mysterious time lapse after _sample return while performing inference #2226

amygbAI · 2025-03-28T16:19:48Z

amygbAI
Mar 28, 2025

apologize for the weird title but i was trying to understand the time taken for each step in the inference process and i found something stupefying
in the file
/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/transformers/generation/utils.py

`calling function # 12. run sample (it degenerates to greedy search when `generation_config.do_sample=False`) 
timer = time.time() 
result = self._sample( input_ids, logits_processor=prepared_logits_processor, stopping_criteria=prepared_stopping_criteria, generation_config=generation_config, synced_gpus=synced_gpus, streamer=streamer, **model_kwargs, ) 
print('Sampling time:: ', time.time() - timer)

gives me a total time of 22 seconds for a certain prompt BUT within this method i am printing the time taken for the main while loop that includes fwd passes for every single token generated and the softmax at the lm head and it comes to about 15-16 seconds .. i am at my wits end trying to figure out where the 7 odd seconds are going away between a return statement at the end of _sample and this print of mine .. heres the small log snippet
im adding print at the end of the while loop


        while self._has_unfinished_sequences(
            this_peer_finished, synced_gpus, device=input_ids.device, cur_len=cur_len, max_length=max_length
        ):

and one just before the

        if return_dict_in_generate:
            if self.config.is_encoder_decoder:
                print('GED0')
                return GenerateEncoderDecoderOutput(
                    sequences=input_ids,
                    scores=scores,
                    logits=raw_logits,
                    encoder_attentions=encoder_attentions,
                    encoder_hidden_states=encoder_hidden_states,
                    decoder_attentions=decoder_attentions,
                    cross_attentions=cross_attentions,
                    decoder_hidden_states=decoder_hidden_states,
                    past_key_values=model_kwargs.get("past_key_values"),
                )
            else:
                print('GED1')
                return GenerateDecoderOnlyOutput(
                    sequences=input_ids,
                    scores=scores,
                    logits=raw_logits,
                    attentions=decoder_attentions,
                    hidden_states=decoder_hidden_states,
                    past_key_values=model_kwargs.get("past_key_values"),
                )
        else:
            print('GED2', time.time() - third_)
            return input_ids

GOING INTO GEN:: 0.025022506713867188
Summary total_fwd_time_, sample_time_, num_toks = 15.918846130371094 0.14170002937316895 289
time tbetween prev print and about to return 9.5367431640625e-07
Sampling time:: 22.605921983718872
everything above clearly says i exited at 15 seconds and yet the calling function says it took 22
i asked grok and it gave me some weird answer about CUDA sync which made no sense to me .. any pointers please ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

mysterious time lapse after _sample return while performing inference #2226

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

mysterious time lapse after _sample return while performing inference #2226

Uh oh!

amygbAI Mar 28, 2025

Replies: 0 comments

amygbAI
Mar 28, 2025