-
Notifications
You must be signed in to change notification settings - Fork 740
Optimize scheduler for chunk prefill #7454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
liyonghua0910
wants to merge
1
commit into
PaddlePaddle:release/2.4
Choose a base branch
from
liyonghua0910:release/2.4+20260416_opt_prefill
base: release/2.4
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -587,6 +587,7 @@ def schedule(self): | |
| preempted_reqs: list[Request] = [] | ||
| error_reqs: list[tuple[str, str]] = [] | ||
| token_budget = self.config.scheduler_config.max_num_batched_tokens | ||
| chunk_prefill_in_running_not_satisfied = False | ||
|
|
||
| # First, schedule the RUNNING requests. | ||
| req_index = 0 | ||
|
|
@@ -694,27 +695,27 @@ def _allocate_decode_and_extend(): | |
| ) | ||
| num_new_tokens = self._get_num_new_tokens(request, token_budget) | ||
| num_new_block = self.get_new_block_nums(request, num_new_tokens) | ||
| can_schedule_block_num_threshold = self._get_can_schedule_prefill_threshold_block(num_new_block) | ||
| # Allocate blocks to prefill | ||
| if self.cache_manager.can_allocate_gpu_blocks(num_new_block): | ||
| request.block_tables.extend(self.cache_manager.allocate_gpu_blocks(num_new_block)) | ||
| # Prepare prefill task | ||
| scheduled_reqs.append(self._prepare_prefill_task(request, num_new_tokens)) | ||
| else: # Not enough blocks to allocate, trigger preemption | ||
| can_schedule = self._trigger_preempt(request, num_new_block, preempted_reqs, scheduled_reqs) | ||
| if not can_schedule: | ||
| break | ||
| request.block_tables.extend(self.cache_manager.allocate_gpu_blocks(num_new_block)) | ||
| if self.cache_manager.can_allocate_gpu_blocks(can_schedule_block_num_threshold): | ||
| request.block_tables.extend( | ||
| self.cache_manager.allocate_gpu_blocks(num_new_block, request.request_id) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 Bug 注意:同文件中其他所有 建议修复: request.block_tables.extend(
self.cache_manager.allocate_gpu_blocks(num_new_block)
) |
||
| ) | ||
| # Prepare prefill task | ||
| scheduled_reqs.append(self._prepare_prefill_task(request, num_new_tokens)) | ||
| else: # Not enough blocks to allocate | ||
| chunk_prefill_in_running_not_satisfied = True | ||
| break # For chunk prefill request, if not satisfy condition for prefill, just break | ||
| token_budget -= num_new_tokens | ||
| request.num_computed_tokens += num_new_tokens | ||
| if self.config.cache_config.enable_prefix_caching: | ||
| self.cache_manager.update_cache_blocks( | ||
| request, self.config.cache_config.block_size, request.num_computed_tokens | ||
| ) | ||
| req_index += 1 | ||
| # schedule the WAITING requests. | ||
| if not preempted_reqs: | ||
|
|
||
| # Second, schedule the WAITING requests. | ||
| if (not preempted_reqs) and (not chunk_prefill_in_running_not_satisfied): | ||
| skip_requests: list[Request] = [] | ||
| while self.waiting and token_budget > 0: | ||
| if len(self.running) == self.max_num_seqs: | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Bug
_get_can_schedule_prefill_threshold_block方法签名为(self, request, num_chunk_new_block),需要两个参数,但这里只传了num_new_block一个参数,缺少request。运行时会抛出TypeError: _get_can_schedule_prefill_threshold_block() missing 1 required positional argument。注意:同文件中 WAITING 请求调度部分(第762行、第810行)的调用是正确的,传了
(request, num_new_block)两个参数。建议修复: