Support piecewise cuda graph for Qwen3-next #13081

Chen-0210 · 2025-11-11T13:42:41Z

Motivation

Support piecewise cuda graph for Qwen3-next
#11490

Modifications

Temporarily split the entire GDN attention due to too many parameters in the linear-attention function other constraints. The performance is not ideal at the moment, but it can be refactored and optimized later.
Disable dual_stream when piecewise CUDA graph is enabled.

Accuracy Tests

python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct/ --tp 2 --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager 
python3 benchmark/gsm8k/bench_sglang.py --parallel 1319 --num-questions 1319
100%|████████████████████████████████████████| 1319/1319 [01:11<00:00, 18.50it/s]
Accuracy: 0.942
Invalid: 0.000
Latency: 71.401 s
Output throughput: 3114.390 token/s

Benchmarking and Profiling(TTFT on bs=1 and H200x2)

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 50 --random-input-len 4096 --random-output-len 1 --random-range-ratio 1 --max-concurrency 1

Input length	1024	2048	4096
with piecewise-cudagraph	67.83 ms	68.72 ms	105.29 ms
without piecewise-cudagraph	99.17 ms	101.52 ms	127.20 ms

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.

gemini-code-assist · 2025-11-11T13:43:04Z

Summary of Changes

Hello @Chen-0210, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for piecewise CUDA graphs for the Qwen3-next model. By enabling specific attention and gated delta rule operations to be compiled into CUDA graphs, the changes aim to optimize the model's execution performance. The modifications involve extending the graph compilation backend, defining custom operations for efficient tensor handling, and adapting the Qwen3-next model's forward pass to integrate these graph-based optimizations.

Highlights

Piecewise CUDA Graph Integration for Qwen3-next: Extended the graph splitting mechanism to include 'sglang.gdn_with_output' for Qwen3-next, enabling its attention operations to be compiled into piecewise CUDA graphs.
Custom Gated Delta Rule (GDN) Operation: Introduced 'gdn_with_output' and 'chunk_gated_delta_rule_with_output' as custom operations, allowing the Gated Delta Rule and chunked attention outputs to be handled efficiently within the CUDA graph context by copying results into preallocated tensors.
Enhanced Attention Layer Detection: Improved the model runner's ability to identify various attention layer types ('attn', 'linear_attn') within models, ensuring broader compatibility for piecewise CUDA graph application.
Refactored Qwen3-next Forward Pass: Modified the Qwen3-next attention and main model forward passes to conditionally leverage the new piecewise CUDA graph capabilities, including adjusting stream thresholds and context management.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for piecewise CUDA graphs for Qwen3-next models, which involves refactoring attention mechanisms and integrating custom operations. Key changes include modifying graph splitting logic to include sglang.gdn_with_output, refactoring Qwen3GatedDeltaNet's forward pass, and adding new custom operations for gated delta rule and GDN with output. The review identified a critical syntax error, potential performance implications from disabling dual-stream optimization, and some minor code cleanup opportunities.

python/sglang/srt/models/qwen3_next.py

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/layers/attention/fla/chunk.py

python/sglang/srt/mem_cache/memory_pool.py

python/sglang/srt/models/qwen3_next.py

Chen-0210 requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy and xiezhq-hermann as code owners November 11, 2025 13:42

Chen-0210 marked this pull request as draft November 11, 2025 13:42

gemini-code-assist bot reviewed Nov 11, 2025

View reviewed changes

hebiao064 assigned hebiao064, ispobock and yizhang2077 Nov 11, 2025

yuan-luo self-requested a review November 12, 2025 01:49

b8zhong mentioned this pull request Nov 13, 2025

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Open

31 tasks

Chen-0210 changed the title ~~[WIP]Support piecewise cuda graph for Qwen3-next~~ Support piecewise cuda graph for Qwen3-next Nov 17, 2025

Chen-0210 marked this pull request as ready for review November 17, 2025 06:29

Chen-0210 force-pushed the support_piece_cuda_graph_Qwen3-next branch from 4a97551 to 1883eab Compare November 17, 2025 09:08

Chen-0210 added 5 commits November 17, 2025 01:11

init

6ec6f1a

fix

3a96c04

fix

7c2e38b

fix accuracy

6863245

fix

7896915

Chen-0210 force-pushed the support_piece_cuda_graph_Qwen3-next branch from 1883eab to 7896915 Compare November 17, 2025 09:11

fix

cde0ed5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support piecewise cuda graph for Qwen3-next #13081

Support piecewise cuda graph for Qwen3-next #13081

Chen-0210 commented Nov 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Support piecewise cuda graph for Qwen3-next #13081

Are you sure you want to change the base?

Support piecewise cuda graph for Qwen3-next #13081

Conversation

Chen-0210 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling(TTFT on bs=1 and H200x2)

Checklist

Uh oh!

gemini-code-assist bot commented Nov 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Chen-0210 commented Nov 11, 2025 •

edited

Loading