Skip to content

Conversation

@vadiklyutiy
Copy link
Collaborator

@vadiklyutiy vadiklyutiy commented Nov 14, 2025

Purpose

Right now we have limitation to use TRTLLM-Gen full attn kernel only if longest sequence length in batch less or equal than 131072.
From one point of view it is a bit unclear where it came from. TRTLLM-Gen full attn kernel functionally works well for any max_seq_len. Performance data also don't show sufficient difference from max_seq_len=64K and max_seq_len=128K - behavior for bigger max_seq_len is the same as for max_seq_len=64K and max_seq_len=128K.
From the another point of view dynamic switching between different kernels cause headache - for example we have to disable cudagraph(see #27114).

This PR remove max_seq_len <=131072 limitation for TRTLLM Gen attn kernel.

Functional Tests Result

Run tests/kernels/attention/test_flashinfer_trtllm_attention.py with modified MAX_SEQ_LENS:

MAX_SEQ_LENS = [(1024, 200000),(1024, 250000),(1024, 300000),(1024, 500000),(1024, 1000000),(1024, 2**20)]

All tests successfully passed.

Performance Tests Result

Used benchmarks/kernels/benchmark_trtllm_decode_attention.py on B200

batch_size max_seq_len trtllm_mean trtllm_std baseline_mean baseline_std speedup_percent
1 65536 0.070 0.006 0.080 0.005 0.127
4 65536 0.149 0.006 0.155 0.006 0.035
8 65536 0.236 0.006 0.237 0.005 0.006
16 65536 0.421 0.008 0.420 0.011 -0.003
32 65536 0.807 0.009 0.900 0.005 0.104
64 65536 1.401 0.015 1.644 0.023 0.148
128 65536 2.543 0.022 2.631 0.030 0.033
256 65536 4.813 0.023 4.378 0.049 -0.099
1 131072 0.102 0.005 0.114 0.005 0.106
4 131072 0.260 0.006 0.261 0.006 0.001
8 131072 0.431 0.006 0.384 0.005 -0.122
16 131072 0.771 0.013 0.623 0.007 -0.237
32 131072 1.521 0.022 1.626 0.007 0.065
64 131072 2.608 0.062 3.069 0.043 0.150
128 131072 4.863 0.042 5.101 0.048 0.047
256 131072 9.398 0.031 8.812 0.088 -0.066
1 200000 0.138 0.006 0.154 0.005 0.102
4 200000 0.330 0.006 0.321 0.005 -0.030
8 200000 0.651 0.027 0.620 0.006 -0.049
16 200000 1.214 0.008 1.148 0.006 -0.057
32 200000 2.397 0.021 2.749 0.009 0.128
64 200000 4.186 0.029 5.088 0.071 0.177
128 200000 7.783 0.078 8.058 0.078 0.034
256 200000 14.480 0.038 13.518 0.130 -0.071
1 262144 0.169 0.005 0.188 0.006 0.101
4 262144 0.418 0.006 0.386 0.006 -0.083
8 262144 0.755 0.005 0.652 0.006 -0.158
16 262144 1.459 0.006 1.252 0.005 -0.165
32 262144 2.869 0.033 3.404 0.006 0.157
64 262144 5.033 0.105 6.107 0.028 0.176
128 262144 9.393 0.084 9.653 0.096 0.027
256 262144 18.107 0.091 17.162 0.247 -0.055
1 500000 0.290 0.006 0.319 0.005 0.090
4 500000 0.705 0.006 0.587 0.006 -0.201
8 500000 1.393 0.006 1.102 0.008 -0.264
16 500000 2.809 0.006 2.481 0.013 -0.132
32 500000 5.430 0.214 5.916 0.009 0.082
64 500000 9.468 0.142 11.631 0.175 0.186
128 500000 17.470 0.199 18.441 0.239 0.053
256 500000 34.626 0.180 32.607 0.459 -0.062
1 1000000 0.548 0.006 0.588 0.005 0.069
4 1000000 1.530 0.006 1.399 0.004 -0.093
8 1000000 2.946 0.008 2.499 0.011 -0.179
16 1000000 6.160 0.007 5.786 0.027 -0.065
32 1000000 11.341 0.327 12.224 0.016 0.072
64 1000000 19.175 0.503 24.092 0.306 0.204
128 1000000 35.811 0.408 38.114 0.498 0.060
256 1000000 69.572 0.209 66.793 0.972 -0.042
1 1048576 0.572 0.006 0.617 0.006 0.072
4 1048576 1.540 0.006 1.392 0.007 -0.107
8 1048576 2.832 0.006 2.249 0.009 -0.259
16 1048576 5.957 0.007 5.165 0.026 -0.153
32 1048576 12.276 0.183 13.113 0.018 0.064
64 1048576 21.462 0.249 27.042 0.434 0.206
128 1048576 38.207 0.294 41.279 0.558 0.074
256 1048576 70.951 0.627 68.312 1.276 -0.039

cc @pavanimajety @benchislett @mgoin

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the max_seq_len <= 131072 limitation for using the TRTLLM attention kernel on NVIDIA Blackwell GPUs. The changes involve removing the sequence length check in vllm/utils/flashinfer.py and the corresponding logic in vllm/config/vllm.py that disabled full CUDA graphs for longer sequences. The goal is to simplify kernel dispatching and enable full CUDA graphs for very long sequences. The author has provided test results showing correctness for sequence lengths up to 2**20. While the provided benchmarks show some performance regressions for the standalone attention kernel in specific scenarios, this change is expected to enable better end-to-end performance by allowing cudagraphs. The changes are straightforward and consistent with the PR's objective. I find no issues of high or critical severity.

@benchislett
Copy link
Collaborator

I'm not 100% sure that the performance at 200K+ CTX is good enough here. Maybe getting full-cuda-graphs offsets the cost, but at 4-8 concurrency (which is probably very common for requests with 200K+ tokens in context) it seems consistently 10-30% slower.

@vadiklyutiy
Copy link
Collaborator Author

I'm not 100% sure that the performance at 200K+ CTX is good enough here. Maybe getting full-cuda-graphs offsets the cost, but at 4-8 concurrency (which is probably very common for requests with 200K+ tokens in context) it seems consistently 10-30% slower.

agree that for 4-8 performance is worse, but not sure about that these are common for very large context. Why not 32?
Also want to emphasize that these slowdown for large context only but cudagraph are disabled for ALL context lengths.

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 15, 2025
@mgoin
Copy link
Member

mgoin commented Nov 15, 2025

Is there anything that can be done to improve the trtllm attention for those poor cases? At this point I feel like removing the dynamism should be prioritized since the impact of cudagraph support is so important

@pavanimajety
Copy link
Collaborator

I vote for removing the restriction to enable the full cuda graph path and filing a flashinfer kernel issue to improve kernel performance for long sequence lengths.

@pavanimajety pavanimajety enabled auto-merge (squash) November 15, 2025 10:13
Copy link
Collaborator

@pavanimajety pavanimajety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Vadim.

@pavanimajety pavanimajety merged commit 173b356 into vllm-project:main Nov 15, 2025
49 checks passed
@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 15, 2025
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 15, 2025
geodavic pushed a commit to geodavic/vllm that referenced this pull request Nov 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants