[PERF] Remove TRTLLM Gen attn kernel limitation `max_seq_len <=131072` #28755

vadiklyutiy · 2025-11-14T21:41:45Z

Purpose

Right now we have limitation to use TRTLLM-Gen full attn kernel only if longest sequence length in batch less or equal than 131072.
From one point of view it is a bit unclear where it came from. TRTLLM-Gen full attn kernel functionally works well for any max_seq_len. Performance data also don't show sufficient difference from max_seq_len=64K and max_seq_len=128K - behavior for bigger max_seq_len is the same as for max_seq_len=64K and max_seq_len=128K.
From the another point of view dynamic switching between different kernels cause headache - for example we have to disable cudagraph(see #27114).

This PR remove max_seq_len <=131072 limitation for TRTLLM Gen attn kernel.

Functional Tests Result

Run tests/kernels/attention/test_flashinfer_trtllm_attention.py with modified MAX_SEQ_LENS:

MAX_SEQ_LENS = [(1024, 200000),(1024, 250000),(1024, 300000),(1024, 500000),(1024, 1000000),(1024, 2**20)]

All tests successfully passed.

Performance Tests Result

Used benchmarks/kernels/benchmark_trtllm_decode_attention.py on B200

batch_size	max_seq_len	trtllm_mean	trtllm_std	baseline_mean	baseline_std	speedup_percent
1	65536	0.070	0.006	0.080	0.005	0.127
4	65536	0.149	0.006	0.155	0.006	0.035
8	65536	0.236	0.006	0.237	0.005	0.006
16	65536	0.421	0.008	0.420	0.011	-0.003
32	65536	0.807	0.009	0.900	0.005	0.104
64	65536	1.401	0.015	1.644	0.023	0.148
128	65536	2.543	0.022	2.631	0.030	0.033
256	65536	4.813	0.023	4.378	0.049	-0.099
1	131072	0.102	0.005	0.114	0.005	0.106
4	131072	0.260	0.006	0.261	0.006	0.001
8	131072	0.431	0.006	0.384	0.005	-0.122
16	131072	0.771	0.013	0.623	0.007	-0.237
32	131072	1.521	0.022	1.626	0.007	0.065
64	131072	2.608	0.062	3.069	0.043	0.150
128	131072	4.863	0.042	5.101	0.048	0.047
256	131072	9.398	0.031	8.812	0.088	-0.066
1	200000	0.138	0.006	0.154	0.005	0.102
4	200000	0.330	0.006	0.321	0.005	-0.030
8	200000	0.651	0.027	0.620	0.006	-0.049
16	200000	1.214	0.008	1.148	0.006	-0.057
32	200000	2.397	0.021	2.749	0.009	0.128
64	200000	4.186	0.029	5.088	0.071	0.177
128	200000	7.783	0.078	8.058	0.078	0.034
256	200000	14.480	0.038	13.518	0.130	-0.071
1	262144	0.169	0.005	0.188	0.006	0.101
4	262144	0.418	0.006	0.386	0.006	-0.083
8	262144	0.755	0.005	0.652	0.006	-0.158
16	262144	1.459	0.006	1.252	0.005	-0.165
32	262144	2.869	0.033	3.404	0.006	0.157
64	262144	5.033	0.105	6.107	0.028	0.176
128	262144	9.393	0.084	9.653	0.096	0.027
256	262144	18.107	0.091	17.162	0.247	-0.055
1	500000	0.290	0.006	0.319	0.005	0.090
4	500000	0.705	0.006	0.587	0.006	-0.201
8	500000	1.393	0.006	1.102	0.008	-0.264
16	500000	2.809	0.006	2.481	0.013	-0.132
32	500000	5.430	0.214	5.916	0.009	0.082
64	500000	9.468	0.142	11.631	0.175	0.186
128	500000	17.470	0.199	18.441	0.239	0.053
256	500000	34.626	0.180	32.607	0.459	-0.062
1	1000000	0.548	0.006	0.588	0.005	0.069
4	1000000	1.530	0.006	1.399	0.004	-0.093
8	1000000	2.946	0.008	2.499	0.011	-0.179
16	1000000	6.160	0.007	5.786	0.027	-0.065
32	1000000	11.341	0.327	12.224	0.016	0.072
64	1000000	19.175	0.503	24.092	0.306	0.204
128	1000000	35.811	0.408	38.114	0.498	0.060
256	1000000	69.572	0.209	66.793	0.972	-0.042
1	1048576	0.572	0.006	0.617	0.006	0.072
4	1048576	1.540	0.006	1.392	0.007	-0.107
8	1048576	2.832	0.006	2.249	0.009	-0.259
16	1048576	5.957	0.007	5.165	0.026	-0.153
32	1048576	12.276	0.183	13.113	0.018	0.064
64	1048576	21.462	0.249	27.042	0.434	0.206
128	1048576	38.207	0.294	41.279	0.558	0.074
256	1048576	70.951	0.627	68.312	1.276	-0.039

cc @pavanimajety @benchislett @mgoin

Signed-off-by: Vadim Gimpelson <[email protected]>

gemini-code-assist

Code Review

This pull request removes the max_seq_len <= 131072 limitation for using the TRTLLM attention kernel on NVIDIA Blackwell GPUs. The changes involve removing the sequence length check in vllm/utils/flashinfer.py and the corresponding logic in vllm/config/vllm.py that disabled full CUDA graphs for longer sequences. The goal is to simplify kernel dispatching and enable full CUDA graphs for very long sequences. The author has provided test results showing correctness for sequence lengths up to 2**20. While the provided benchmarks show some performance regressions for the standalone attention kernel in specific scenarios, this change is expected to enable better end-to-end performance by allowing cudagraphs. The changes are straightforward and consistent with the PR's objective. I find no issues of high or critical severity.

benchislett · 2025-11-14T22:03:18Z

I'm not 100% sure that the performance at 200K+ CTX is good enough here. Maybe getting full-cuda-graphs offsets the cost, but at 4-8 concurrency (which is probably very common for requests with 200K+ tokens in context) it seems consistently 10-30% slower.

vadiklyutiy · 2025-11-14T22:33:05Z

I'm not 100% sure that the performance at 200K+ CTX is good enough here. Maybe getting full-cuda-graphs offsets the cost, but at 4-8 concurrency (which is probably very common for requests with 200K+ tokens in context) it seems consistently 10-30% slower.

agree that for 4-8 performance is worse, but not sure about that these are common for very large context. Why not 32?
Also want to emphasize that these slowdown for large context only but cudagraph are disabled for ALL context lengths.

mgoin · 2025-11-15T00:27:56Z

Is there anything that can be done to improve the trtllm attention for those poor cases? At this point I feel like removing the dynamism should be prioritized since the impact of cudagraph support is so important

pavanimajety · 2025-11-15T10:13:00Z

I vote for removing the restriction to enable the full cuda graph path and filing a flashinfer kernel issue to improve kernel performance for long sequence lengths.

pavanimajety

LGTM, thanks Vadim.

vllm-project#28755) Signed-off-by: Vadim Gimpelson <[email protected]> Signed-off-by: George D. Torres <[email protected]>

Remove TRTLLM Gen attn kernel limitation max_seq_len <=131072

35a8483

Signed-off-by: Vadim Gimpelson <[email protected]>

vadiklyutiy requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners November 14, 2025 21:41

vadiklyutiy self-assigned this Nov 14, 2025

mergify bot added the nvidia label Nov 14, 2025

github-project-automation bot added this to NVIDIA Nov 14, 2025

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

vadiklyutiy requested a review from pavanimajety November 14, 2025 21:44

benchislett mentioned this pull request Nov 14, 2025

[Bugfix] Use PIECEWISE cudagraphs on Blackwell if max_model_len > 131072 #27114

Merged

5 tasks

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 15, 2025

pavanimajety enabled auto-merge (squash) November 15, 2025 10:13

pavanimajety approved these changes Nov 15, 2025

View reviewed changes

pavanimajety merged commit 173b356 into vllm-project:main Nov 15, 2025
49 checks passed

github-project-automation bot moved this to In review in NVIDIA Nov 15, 2025

github-project-automation bot moved this from In review to Done in NVIDIA Nov 15, 2025

geodavic pushed a commit to geodavic/vllm that referenced this pull request Nov 16, 2025

[PERF] Remove TRTLLM Gen attn kernel limitation max_seq_len <=131072 (

0b2a02f

vllm-project#28755) Signed-off-by: Vadim Gimpelson <[email protected]> Signed-off-by: George D. Torres <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[PERF] Remove TRTLLM Gen attn kernel limitation `max_seq_len <=131072` #28755

[PERF] Remove TRTLLM Gen attn kernel limitation `max_seq_len <=131072` #28755

Uh oh!

vadiklyutiy commented Nov 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

benchislett commented Nov 14, 2025

Uh oh!

vadiklyutiy commented Nov 14, 2025

Uh oh!

mgoin commented Nov 15, 2025

Uh oh!

pavanimajety commented Nov 15, 2025

Uh oh!

pavanimajety left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[PERF] Remove TRTLLM Gen attn kernel limitation max_seq_len <=131072 #28755

[PERF] Remove TRTLLM Gen attn kernel limitation max_seq_len <=131072 #28755

Uh oh!

Conversation

vadiklyutiy commented Nov 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Functional Tests Result

Performance Tests Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

benchislett commented Nov 14, 2025

Uh oh!

vadiklyutiy commented Nov 14, 2025

Uh oh!

mgoin commented Nov 15, 2025

Uh oh!

pavanimajety commented Nov 15, 2025

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[PERF] Remove TRTLLM Gen attn kernel limitation `max_seq_len <=131072` #28755

[PERF] Remove TRTLLM Gen attn kernel limitation `max_seq_len <=131072` #28755

vadiklyutiy commented Nov 14, 2025 •

edited by github-actions bot

Loading