Skip to content

Conversation

@Chen-0210
Copy link
Contributor

@Chen-0210 Chen-0210 commented Nov 11, 2025

Motivation

Support piecewise cuda graph for Qwen3-next
#11490

Modifications

  1. Temporarily split the entire GDN attention due to too many parameters in the linear-attention function other constraints. The performance is not ideal at the moment, but it can be refactored and optimized later.
  2. Disable dual_stream when piecewise CUDA graph is enabled.

Accuracy Tests

python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct/ --tp 2 --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager 
python3 benchmark/gsm8k/bench_sglang.py --parallel 1319 --num-questions 1319
100%|████████████████████████████████████████| 1319/1319 [01:11<00:00, 18.50it/s]
Accuracy: 0.942
Invalid: 0.000
Latency: 71.401 s
Output throughput: 3114.390 token/s

Benchmarking and Profiling(TTFT on bs=1 and H200x2)

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 50 --random-input-len 4096 --random-output-len 1 --random-range-ratio 1 --max-concurrency 1

Input length 1024 2048 4096
with piecewise-cudagraph 67.83 ms 68.72 ms 105.29 ms
without piecewise-cudagraph 99.17 ms 101.52 ms 127.20 ms

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Chen-0210, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for piecewise CUDA graphs for the Qwen3-next model. By enabling specific attention and gated delta rule operations to be compiled into CUDA graphs, the changes aim to optimize the model's execution performance. The modifications involve extending the graph compilation backend, defining custom operations for efficient tensor handling, and adapting the Qwen3-next model's forward pass to integrate these graph-based optimizations.

Highlights

  • Piecewise CUDA Graph Integration for Qwen3-next: Extended the graph splitting mechanism to include 'sglang.gdn_with_output' for Qwen3-next, enabling its attention operations to be compiled into piecewise CUDA graphs.
  • Custom Gated Delta Rule (GDN) Operation: Introduced 'gdn_with_output' and 'chunk_gated_delta_rule_with_output' as custom operations, allowing the Gated Delta Rule and chunked attention outputs to be handled efficiently within the CUDA graph context by copying results into preallocated tensors.
  • Enhanced Attention Layer Detection: Improved the model runner's ability to identify various attention layer types ('attn', 'linear_attn') within models, ensuring broader compatibility for piecewise CUDA graph application.
  • Refactored Qwen3-next Forward Pass: Modified the Qwen3-next attention and main model forward passes to conditionally leverage the new piecewise CUDA graph capabilities, including adjusting stream thresholds and context management.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for piecewise CUDA graphs for Qwen3-next models, which involves refactoring attention mechanisms and integrating custom operations. Key changes include modifying graph splitting logic to include sglang.gdn_with_output, refactoring Qwen3GatedDeltaNet's forward pass, and adding new custom operations for gated delta rule and GDN with output. The review identified a critical syntax error, potential performance implications from disabling dual-stream optimization, and some minor code cleanup opportunities.

@yuan-luo yuan-luo self-requested a review November 12, 2025 01:49
@Chen-0210 Chen-0210 changed the title [WIP]Support piecewise cuda graph for Qwen3-next Support piecewise cuda graph for Qwen3-next Nov 17, 2025
@Chen-0210 Chen-0210 marked this pull request as ready for review November 17, 2025 06:29
@Chen-0210 Chen-0210 force-pushed the support_piece_cuda_graph_Qwen3-next branch from 4a97551 to 1883eab Compare November 17, 2025 09:08
@Chen-0210 Chen-0210 force-pushed the support_piece_cuda_graph_Qwen3-next branch from 1883eab to 7896915 Compare November 17, 2025 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants