You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
apologize for the weird title but i was trying to understand the time taken for each step in the inference process and i found something stupefying
in the file
/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/transformers/generation/utils.py
`calling function # 12. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
timer = time.time()
result = self._sample( input_ids, logits_processor=prepared_logits_processor, stopping_criteria=prepared_stopping_criteria, generation_config=generation_config, synced_gpus=synced_gpus, streamer=streamer, **model_kwargs, )
print('Sampling time:: ', time.time() - timer)
gives me a total time of 22 seconds for a certain prompt BUT within this method i am printing the time taken for the main while loop that includes fwd passes for every single token generated and the softmax at the lm head and it comes to about 15-16 seconds .. i am at my wits end trying to figure out where the 7 odd seconds are going away between a return statement at the end of _sample and this print of mine .. heres the small log snippet
im adding print at the end of the while loop
while self._has_unfinished_sequences(
this_peer_finished, synced_gpus, device=input_ids.device, cur_len=cur_len, max_length=max_length
):
GOING INTO GEN:: 0.025022506713867188
Summary total_fwd_time_, sample_time_, num_toks = 15.918846130371094 0.14170002937316895 289
time tbetween prev print and about to return 9.5367431640625e-07
Sampling time:: 22.605921983718872
everything above clearly says i exited at 15 seconds and yet the calling function says it took 22
i asked grok and it gave me some weird answer about CUDA sync which made no sense to me .. any pointers please ?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
apologize for the weird title but i was trying to understand the time taken for each step in the inference process and i found something stupefying
in the file
/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/transformers/generation/utils.py
gives me a total time of 22 seconds for a certain prompt BUT within this method i am printing the time taken for the main while loop that includes fwd passes for every single token generated and the softmax at the lm head and it comes to about 15-16 seconds .. i am at my wits end trying to figure out where the 7 odd seconds are going away between a return statement at the end of _sample and this print of mine .. heres the small log snippet
im adding print at the end of the while loop
and one just before the
GOING INTO GEN:: 0.025022506713867188
Summary total_fwd_time_, sample_time_, num_toks = 15.918846130371094 0.14170002937316895 289
time tbetween prev print and about to return 9.5367431640625e-07
Sampling time:: 22.605921983718872
everything above clearly says i exited at 15 seconds and yet the calling function says it took 22
i asked grok and it gave me some weird answer about CUDA sync which made no sense to me .. any pointers please ?
Beta Was this translation helpful? Give feedback.
All reactions