Enable chunked prefill on aice 1.22 #2070

YuJiankang · 2025-10-23T05:55:35Z

This PR port chunked prefill related patch from deepseek_r1 to aice 1.22,
co-work with HabanaAI/vllm-hpu-extension#381

tvoas · 2025-11-03T03:23:12Z

vllm/worker/hpu_model_runner.py

        if any(context_lens):
-            assert not self.scheduler_config.chunked_prefill_enabled
+            # assert not self.scheduler_config.chunked_prefill_enabled
            # prefix caching


Comment is out of data. Remove deprecated (commented out) assert and replace comment with something like:

# prefix caching or chunked prefill

tvoas · 2025-11-03T08:50:55Z

Recommend adjustments be made to start_gaudi_vllm_server.sh" to expose chunk prefill controls. Right now, chunk size will be set to max_model_len even if -e "--enable-chunked-prefill"`` is present in the command line. No way to specify chunk size.

ikurtchen · 2025-11-04T02:46:54Z

Recommend adjustments be made to start_gaudi_vllm_server.sh" to expose chunk prefill controls. Right now, chunk size will be set to max_model_len even if -e "--enable-chunked-prefill"`` is present in the command line. No way to specify chunk size.

Yes, the start_gaudi_vllm_server.sh sets max_num_batched_tokens to max_model_len by default. And from scheduler code, max_num_batched_tokens is one of the config which controls chunk size. And looks like the chunk size can be changed at runtime, for example: def _chunk_new_tokens_to_schedule():

        # Get the number of tokens to allocate to this prefill slot
        prefill_slot_budget = (
            remaining_token_budget if partial_prefill_metadata is None else 
            partial_prefill_budget_lookup_list[
                partial_prefill_metadata.schedulable_prefills])

        ...

        num_new_tokens = min(num_new_tokens, remaining_token_budget,
                             prefill_slot_budget)

The value in the partial_prefill_budget_lookup_list is controled by max_num_partial_prefills, and when there're more than one prefills (if we set max_num_partial_prefills > 1), the chunk size will be half.

Currently this patch is doing chunked prefill with prompt length (align to block size), I'm thinking if we can change this and follow the chunk size provided by scheduler, but we may need to consider padding and warmup combinations. Do you guys have some suggestion on this?

ikurtchen · 2025-11-04T02:52:48Z

Another question: when there're prefill and decode in one batch, is the decode tokens not padded? The tokens for prefill and decode will concat and send to model.forward(), will this cause dynamic shape?

Co-authored-by: Jiang, Zhoulong <[email protected]> Signed-off-by: jkyu <[email protected]>

YuJiankang · 2025-11-25T02:33:11Z

Another question: when there're prefill and decode in one batch, is the decode tokens not padded? The tokens for prefill and decode will concat and send to model.forward(), will this cause dynamic shape?

I have added the logic to pad the decode tokens when warmup, thanks

YuJiankang · 2025-11-25T02:49:20Z

@czhu15 @yangulei @taotod please help to review, thanks a lot

Signed-off-by: jkyu <[email protected]>

taotod · 2025-11-27T05:48:51Z

@YuJiankang Please fix all the pre-commit issues.

taotod · 2025-11-27T06:27:03Z

@YuJiankang , please update scripts/README.md how to enable chunked prefill and the recommend scenario.

czhu15 · 2025-11-25T02:53:12Z

examples/offline_inference/basic/chunked_prefill.py

+os.environ["VLLM_SKIP_WARMUP"] = "true"
+os.environ['VLLM_CONTIGUOUS_PA'] = 'false'
+os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'
+os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true'


do we need below env vars for aice/v1.22.0 branch?
os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'
os.environ['VLLM_MLA_PERFORM_MATRIX_ABSORPTION']='0'
os.environ['VLLM_MTP_PRINT_ACCPET_RATE']='0'

czhu15 · 2025-11-25T02:54:04Z

examples/offline_inference/basic/chunked_prefill.py

+os.environ['VLLM_MLA_PERFORM_MATRIX_ABSORPTION']='0'
+os.environ['VLLM_MTP_PRINT_ACCPET_RATE']='0'
+os.environ['PT_HPU_LAZY_MODE']='1'
+os.environ['VLLM_DELAYED_SAMPLING']='false'


does chunked prefill conflict with delayed sampling?

czhu15 · 2025-11-27T01:11:47Z

vllm/attention/backends/hpu_attn.py

+                           value: torch.Tensor, kv_cache: torch.Tensor,
+                           attn_metadata: HPUAttentionMetadata,
+                           is_prefill: bool) -> HPUAttentionData:
+        attn_data: HPUAttentionData = HPUAttentionData()


will be good to add a description on the preprocess_forward API, including the purpose of this API, arguments, return values of the API.

czhu15 · 2025-11-27T01:15:20Z

vllm/attention/backends/hpu_attn.py

+            slot_mapping = attn_metadata.slot_mapping.flatten(
+            ) if attn_metadata.slot_mapping is not None else None
+            batch_size = attn_metadata.num_prefills
+        # Convert Flat inputs into 2D Inputs


wrong comment? should be 3D input, i.e [batch_size, seq_len, hidden_size]

czhu15 · 2025-11-27T06:22:04Z

vllm/attention/backends/hpu_attn.py

+        attn_metadata: HPUAttentionMetadata,
+        output: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """Forward pass with xFormers and PagedAttention.


wrong comment?

czhu15 · 2025-11-27T07:00:09Z

vllm/envs.py

    "VLLM_SLEEP_WHEN_IDLE":
    lambda: bool(int(os.getenv("VLLM_SLEEP_WHEN_IDLE", "0"))),
+
+    # Use chunked prefill with dynamic input shapes for HPU backend.


what's the meaning of VLLM_HPU_CHUNKED_PREFILL_DYNAMIC_INPUT? when should this be set?

czhu15 · 2025-11-27T07:04:36Z

vllm/worker/hpu_model_runner.py

+                paddings = [max_len - q for q in temp_query_lens]
            paddings = [0] + paddings[:-1]
            paddings = list(itertools.accumulate(paddings))
+            for i, seq_group_metadata in enumerate(seq_group_metadata_list):


why need add these lines?

czhu15 · 2025-11-27T07:08:58Z

vllm/worker/hpu_model_runner.py

                                 align_worker=align_worker)

        selected_token_indices = None
+        temp_query_lens = query_lens.copy()


suggest to rename temp_query_lens to a more meaningful name.

czhu15 · 2025-11-27T07:11:11Z

vllm/worker/hpu_model_runner.py

            logger_msg = "Multimodal bucket : " + str(self.multimodal_buckets)
            logger.info(logger_msg)

+        if max_batch_size < 1:


when will max_batch_size < 1? should we print a warning message or exception if this case is not expected?

czhu15 · 2025-11-27T07:12:06Z

vllm/worker/hpu_model_runner.py

+                        num_iters=3,
+                        align_worker=False,
+                        is_dummy_run=False) -> None:
+        phase = 'mix'


pls add a description on what's the purpose of warmup_scenario_mix, what it does.
can you re-use current warmup_scenario function? seems lots of common code there.

YuJiankang requested review from PatrykWo, afierka-intel, jikunshang, kzawora-intel, madamczyk-intel, mgawarkiewicz-intel, michalkuligowski, mswiniarsk, vivekgoe and xuechendi as code owners October 23, 2025 05:55

YuJiankang force-pushed the chunked_prefill branch from 6f24ce2 to da62a3d Compare October 23, 2025 05:59

YuJiankang mentioned this pull request Oct 23, 2025

Enable chunked prefill on aice 1.22 HabanaAI/vllm-hpu-extension#381

Open

YuJiankang force-pushed the chunked_prefill branch 2 times, most recently from 7a60181 to dd5eb1f Compare October 23, 2025 08:32

tvoas reviewed Nov 3, 2025

View reviewed changes

YuJiankang force-pushed the chunked_prefill branch 5 times, most recently from ff0da82 to 0f10ea6 Compare November 21, 2025 08:48

YuJiankang and others added 3 commits November 21, 2025 11:55

Enable chunked prefill on aice 1.22

a3c4822

Co-authored-by: Jiang, Zhoulong <[email protected]> Signed-off-by: jkyu <[email protected]>

Fix the accuracy issue for prefill batch size larger than 1

ee6b7ed

use dynamic chunked prefill

64fdbc6

YuJiankang force-pushed the chunked_prefill branch 3 times, most recently from aaa66c1 to f087809 Compare November 25, 2025 01:40

Enable warmup for chunked prefill

f087809

Signed-off-by: jkyu <[email protected]>

czhu15 reviewed Nov 27, 2025

View reviewed changes

Enable chunked prefill on aice 1.22 #2070

Are you sure you want to change the base?

Enable chunked prefill on aice 1.22 #2070

Conversation

YuJiankang commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tvoas commented Nov 3, 2025

Uh oh!

ikurtchen commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikurtchen commented Nov 4, 2025

Uh oh!

YuJiankang commented Nov 25, 2025

Uh oh!

YuJiankang commented Nov 25, 2025

Uh oh!

taotod commented Nov 27, 2025

Uh oh!

taotod commented Nov 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

YuJiankang commented Oct 23, 2025 •

edited by github-actions bot

Loading

ikurtchen commented Nov 4, 2025 •

edited

Loading