Skip to content

Conversation

@YuJiankang
Copy link

@YuJiankang YuJiankang commented Oct 23, 2025

This PR port chunked prefill related patch from deepseek_r1 to aice 1.22,
co-work with HabanaAI/vllm-hpu-extension#381

if any(context_lens):
assert not self.scheduler_config.chunked_prefill_enabled
# assert not self.scheduler_config.chunked_prefill_enabled
# prefix caching
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is out of data. Remove deprecated (commented out) assert and replace comment with something like:

# prefix caching or chunked prefill

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tvoas
Copy link

tvoas commented Nov 3, 2025

Recommend adjustments be made to start_gaudi_vllm_server.sh" to expose chunk prefill controls. Right now, chunk size will be set to max_model_len even if -e "--enable-chunked-prefill"`` is present in the command line. No way to specify chunk size.

@ikurtchen
Copy link

ikurtchen commented Nov 4, 2025

Recommend adjustments be made to start_gaudi_vllm_server.sh" to expose chunk prefill controls. Right now, chunk size will be set to max_model_len even if -e "--enable-chunked-prefill"`` is present in the command line. No way to specify chunk size.

Yes, the start_gaudi_vllm_server.sh sets max_num_batched_tokens to max_model_len by default. And from scheduler code, max_num_batched_tokens is one of the config which controls chunk size. And looks like the chunk size can be changed at runtime, for example: def _chunk_new_tokens_to_schedule():

        # Get the number of tokens to allocate to this prefill slot
        prefill_slot_budget = (
            remaining_token_budget if partial_prefill_metadata is None else 
            partial_prefill_budget_lookup_list[
                partial_prefill_metadata.schedulable_prefills])

        ...

        num_new_tokens = min(num_new_tokens, remaining_token_budget,
                             prefill_slot_budget)

The value in the partial_prefill_budget_lookup_list is controled by max_num_partial_prefills, and when there're more than one prefills (if we set max_num_partial_prefills > 1), the chunk size will be half.

Currently this patch is doing chunked prefill with prompt length (align to block size), I'm thinking if we can change this and follow the chunk size provided by scheduler, but we may need to consider padding and warmup combinations. Do you guys have some suggestion on this?

@ikurtchen
Copy link

Another question: when there're prefill and decode in one batch, is the decode tokens not padded? The tokens for prefill and decode will concat and send to model.forward(), will this cause dynamic shape?

@YuJiankang YuJiankang force-pushed the chunked_prefill branch 5 times, most recently from ff0da82 to 0f10ea6 Compare November 21, 2025 08:48
@YuJiankang YuJiankang force-pushed the chunked_prefill branch 3 times, most recently from aaa66c1 to f087809 Compare November 25, 2025 01:40
@YuJiankang
Copy link
Author

Another question: when there're prefill and decode in one batch, is the decode tokens not padded? The tokens for prefill and decode will concat and send to model.forward(), will this cause dynamic shape?

I have added the logic to pad the decode tokens when warmup, thanks

@YuJiankang
Copy link
Author

@czhu15 @yangulei @taotod please help to review, thanks a lot

@taotod
Copy link

taotod commented Nov 27, 2025

@YuJiankang Please fix all the pre-commit issues.
image

@taotod
Copy link

taotod commented Nov 27, 2025

@YuJiankang , please update scripts/README.md how to enable chunked prefill and the recommend scenario.

os.environ["VLLM_SKIP_WARMUP"] = "true"
os.environ['VLLM_CONTIGUOUS_PA'] = 'false'
os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'
os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need below env vars for aice/v1.22.0 branch?
os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'
os.environ['VLLM_MLA_PERFORM_MATRIX_ABSORPTION']='0'
os.environ['VLLM_MTP_PRINT_ACCPET_RATE']='0'

os.environ['VLLM_MLA_PERFORM_MATRIX_ABSORPTION']='0'
os.environ['VLLM_MTP_PRINT_ACCPET_RATE']='0'
os.environ['PT_HPU_LAZY_MODE']='1'
os.environ['VLLM_DELAYED_SAMPLING']='false'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does chunked prefill conflict with delayed sampling?

value: torch.Tensor, kv_cache: torch.Tensor,
attn_metadata: HPUAttentionMetadata,
is_prefill: bool) -> HPUAttentionData:
attn_data: HPUAttentionData = HPUAttentionData()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be good to add a description on the preprocess_forward API, including the purpose of this API, arguments, return values of the API.

slot_mapping = attn_metadata.slot_mapping.flatten(
) if attn_metadata.slot_mapping is not None else None
batch_size = attn_metadata.num_prefills
# Convert Flat inputs into 2D Inputs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong comment? should be 3D input, i.e [batch_size, seq_len, hidden_size]

attn_metadata: HPUAttentionMetadata,
output: Optional[torch.Tensor] = None,
) -> torch.Tensor:
"""Forward pass with xFormers and PagedAttention.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong comment?

"VLLM_SLEEP_WHEN_IDLE":
lambda: bool(int(os.getenv("VLLM_SLEEP_WHEN_IDLE", "0"))),

# Use chunked prefill with dynamic input shapes for HPU backend.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the meaning of VLLM_HPU_CHUNKED_PREFILL_DYNAMIC_INPUT? when should this be set?

paddings = [max_len - q for q in temp_query_lens]
paddings = [0] + paddings[:-1]
paddings = list(itertools.accumulate(paddings))
for i, seq_group_metadata in enumerate(seq_group_metadata_list):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need add these lines?

align_worker=align_worker)

selected_token_indices = None
temp_query_lens = query_lens.copy()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to rename temp_query_lens to a more meaningful name.

logger_msg = "Multimodal bucket : " + str(self.multimodal_buckets)
logger.info(logger_msg)

if max_batch_size < 1:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will max_batch_size < 1? should we print a warning message or exception if this case is not expected?

num_iters=3,
align_worker=False,
is_dummy_run=False) -> None:
phase = 'mix'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a description on what's the purpose of warmup_scenario_mix, what it does.
can you re-use current warmup_scenario function? seems lots of common code there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants