-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Support piecewise cuda graph for Qwen3-next #13081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support piecewise cuda graph for Qwen3-next #13081
Conversation
Summary of ChangesHello @Chen-0210, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces initial support for piecewise CUDA graphs for the Qwen3-next model. By enabling specific attention and gated delta rule operations to be compiled into CUDA graphs, the changes aim to optimize the model's execution performance. The modifications involve extending the graph compilation backend, defining custom operations for efficient tensor handling, and adapting the Qwen3-next model's forward pass to integrate these graph-based optimizations. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for piecewise CUDA graphs for Qwen3-next models, which involves refactoring attention mechanisms and integrating custom operations. Key changes include modifying graph splitting logic to include sglang.gdn_with_output, refactoring Qwen3GatedDeltaNet's forward pass, and adding new custom operations for gated delta rule and GDN with output. The review identified a critical syntax error, potential performance implications from disabling dual-stream optimization, and some minor code cleanup opportunities.
4a97551 to
1883eab
Compare
1883eab to
7896915
Compare
Motivation
Support piecewise cuda graph for Qwen3-next
#11490
Modifications
Accuracy Tests
Benchmarking and Profiling(TTFT on bs=1 and H200x2)
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 50 --random-input-len 4096 --random-output-len 1 --random-range-ratio 1 --max-concurrency 1Checklist