-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[PERF] Remove TRTLLM Gen attn kernel limitation max_seq_len <=131072
#28755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] Remove TRTLLM Gen attn kernel limitation max_seq_len <=131072
#28755
Conversation
Signed-off-by: Vadim Gimpelson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request removes the max_seq_len <= 131072 limitation for using the TRTLLM attention kernel on NVIDIA Blackwell GPUs. The changes involve removing the sequence length check in vllm/utils/flashinfer.py and the corresponding logic in vllm/config/vllm.py that disabled full CUDA graphs for longer sequences. The goal is to simplify kernel dispatching and enable full CUDA graphs for very long sequences. The author has provided test results showing correctness for sequence lengths up to 2**20. While the provided benchmarks show some performance regressions for the standalone attention kernel in specific scenarios, this change is expected to enable better end-to-end performance by allowing cudagraphs. The changes are straightforward and consistent with the PR's objective. I find no issues of high or critical severity.
|
I'm not 100% sure that the performance at 200K+ CTX is good enough here. Maybe getting full-cuda-graphs offsets the cost, but at 4-8 concurrency (which is probably very common for requests with 200K+ tokens in context) it seems consistently 10-30% slower. |
agree that for 4-8 performance is worse, but not sure about that these are common for very large context. Why not 32? |
|
Is there anything that can be done to improve the trtllm attention for those poor cases? At this point I feel like removing the dynamism should be prioritized since the impact of cudagraph support is so important |
|
I vote for removing the restriction to enable the full cuda graph path and filing a flashinfer kernel issue to improve kernel performance for long sequence lengths. |
pavanimajety
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Vadim.
vllm-project#28755) Signed-off-by: Vadim Gimpelson <[email protected]> Signed-off-by: George D. Torres <[email protected]>
Purpose
Right now we have limitation to use TRTLLM-Gen full attn kernel only if longest sequence length in batch less or equal than 131072.
From one point of view it is a bit unclear where it came from. TRTLLM-Gen full attn kernel functionally works well for any
max_seq_len. Performance data also don't show sufficient difference frommax_seq_len=64Kandmax_seq_len=128K- behavior for biggermax_seq_lenis the same as formax_seq_len=64Kandmax_seq_len=128K.From the another point of view dynamic switching between different kernels cause headache - for example we have to disable cudagraph(see #27114).
This PR remove
max_seq_len <=131072limitation for TRTLLM Gen attn kernel.Functional Tests Result
Run
tests/kernels/attention/test_flashinfer_trtllm_attention.pywith modified MAX_SEQ_LENS:All tests successfully passed.
Performance Tests Result
Used
benchmarks/kernels/benchmark_trtllm_decode_attention.pyon B200cc @pavanimajety @benchislett @mgoin