-
Notifications
You must be signed in to change notification settings - Fork 134
Enable chunked prefill on aice 1.22 #2070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: aice/v1.22.0
Are you sure you want to change the base?
Conversation
6f24ce2 to
da62a3d
Compare
7a60181 to
dd5eb1f
Compare
vllm/worker/hpu_model_runner.py
Outdated
| if any(context_lens): | ||
| assert not self.scheduler_config.chunked_prefill_enabled | ||
| # assert not self.scheduler_config.chunked_prefill_enabled | ||
| # prefix caching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is out of data. Remove deprecated (commented out) assert and replace comment with something like:
# prefix caching or chunked prefillThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
Recommend adjustments be made to |
Yes, the The value in the Currently this patch is doing chunked prefill with prompt length (align to block size), I'm thinking if we can change this and follow the chunk size provided by scheduler, but we may need to consider padding and warmup combinations. Do you guys have some suggestion on this? |
|
Another question: when there're prefill and decode in one batch, is the decode tokens not padded? The tokens for prefill and decode will concat and send to model.forward(), will this cause dynamic shape? |
ff0da82 to
0f10ea6
Compare
Co-authored-by: Jiang, Zhoulong <[email protected]> Signed-off-by: jkyu <[email protected]>
aaa66c1 to
f087809
Compare
I have added the logic to pad the decode tokens when warmup, thanks |
Signed-off-by: jkyu <[email protected]>
|
@YuJiankang Please fix all the pre-commit issues. |
|
@YuJiankang , please update scripts/README.md how to enable chunked prefill and the recommend scenario. |
| os.environ["VLLM_SKIP_WARMUP"] = "true" | ||
| os.environ['VLLM_CONTIGUOUS_PA'] = 'false' | ||
| os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1' | ||
| os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need below env vars for aice/v1.22.0 branch?
os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'
os.environ['VLLM_MLA_PERFORM_MATRIX_ABSORPTION']='0'
os.environ['VLLM_MTP_PRINT_ACCPET_RATE']='0'
| os.environ['VLLM_MLA_PERFORM_MATRIX_ABSORPTION']='0' | ||
| os.environ['VLLM_MTP_PRINT_ACCPET_RATE']='0' | ||
| os.environ['PT_HPU_LAZY_MODE']='1' | ||
| os.environ['VLLM_DELAYED_SAMPLING']='false' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does chunked prefill conflict with delayed sampling?
| value: torch.Tensor, kv_cache: torch.Tensor, | ||
| attn_metadata: HPUAttentionMetadata, | ||
| is_prefill: bool) -> HPUAttentionData: | ||
| attn_data: HPUAttentionData = HPUAttentionData() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be good to add a description on the preprocess_forward API, including the purpose of this API, arguments, return values of the API.
| slot_mapping = attn_metadata.slot_mapping.flatten( | ||
| ) if attn_metadata.slot_mapping is not None else None | ||
| batch_size = attn_metadata.num_prefills | ||
| # Convert Flat inputs into 2D Inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong comment? should be 3D input, i.e [batch_size, seq_len, hidden_size]
| attn_metadata: HPUAttentionMetadata, | ||
| output: Optional[torch.Tensor] = None, | ||
| ) -> torch.Tensor: | ||
| """Forward pass with xFormers and PagedAttention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong comment?
| "VLLM_SLEEP_WHEN_IDLE": | ||
| lambda: bool(int(os.getenv("VLLM_SLEEP_WHEN_IDLE", "0"))), | ||
|
|
||
| # Use chunked prefill with dynamic input shapes for HPU backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the meaning of VLLM_HPU_CHUNKED_PREFILL_DYNAMIC_INPUT? when should this be set?
| paddings = [max_len - q for q in temp_query_lens] | ||
| paddings = [0] + paddings[:-1] | ||
| paddings = list(itertools.accumulate(paddings)) | ||
| for i, seq_group_metadata in enumerate(seq_group_metadata_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need add these lines?
| align_worker=align_worker) | ||
|
|
||
| selected_token_indices = None | ||
| temp_query_lens = query_lens.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to rename temp_query_lens to a more meaningful name.
| logger_msg = "Multimodal bucket : " + str(self.multimodal_buckets) | ||
| logger.info(logger_msg) | ||
|
|
||
| if max_batch_size < 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will max_batch_size < 1? should we print a warning message or exception if this case is not expected?
| num_iters=3, | ||
| align_worker=False, | ||
| is_dummy_run=False) -> None: | ||
| phase = 'mix' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a description on what's the purpose of warmup_scenario_mix, what it does.
can you re-use current warmup_scenario function? seems lots of common code there.

This PR port chunked prefill related patch from deepseek_r1 to aice 1.22,
co-work with HabanaAI/vllm-hpu-extension#381