Skip to content

Conversation

@fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Nov 15, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the DeepseekV2 model's weight handling for Blackwell DeepGEMM by introducing a controlled requantization process. It ensures that weight requantization and subsequent scale adjustments are performed precisely once, optimizing performance and maintaining accuracy for quantized models by preventing duplicate operations.

Highlights

  • DeepGEMM Weight Requantization Control: Introduced a mechanism to ensure that the _weight_requant_ue8m0 process is executed only once during the post_load_weights phase for Blackwell DeepGEMM, preventing redundant or incorrect requantization.
  • Weight Scale Transformation: Integrated inverse_transform_scale_ue8m0 to correctly adjust weight scales after requantization, ensuring accurate model behavior when the requantization has been performed.
  • Quantization Utility Import: Added the import for inverse_transform_scale_ue8m0 from sglang.srt.quantization.quantization_utils to support the new weight scale transformation logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for weight updates for DeepGEMM on Blackwell GPUs in the DeepSeek V2 model. It adds a flag, _executed_weight_requant_ue8m0, to ensure that weight requantization is performed only once. Subsequent calls to post_load_weights will correctly handle the already-quantized weights by applying an inverse transformation to the weight scales. The logic appears sound and correctly addresses the need for dynamic weight updates.

My review includes a couple of suggestions to improve code clarity and reduce redundancy by reusing an existing variable instead of repeatedly calling getattr.

Comment on lines +3319 to +3321
weight_block_size=getattr(
self.quant_config, "weight_block_size", None
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The weight_block_size variable is already defined at the beginning of the post_load_weights function. Reusing it here instead of calling getattr again would improve code clarity and avoid redundancy.

                            weight_block_size=weight_block_size

if (
not ENABLE_FLASHINFER_FP8_GEMM
and should_deepgemm_weight_requant_ue8m0(
weight_block_size=getattr(self.quant_config, "weight_block_size", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the previous comment, weight_block_size is already defined at the top of the function. It's best to reuse it here to avoid the redundant getattr call.

                weight_block_size=weight_block_size

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Nov 17, 2025

pass ci

image

@fzyzcjy fzyzcjy merged commit f3e9336 into sgl-project:main Nov 17, 2025
41 of 64 checks passed
00INDEX pushed a commit to 00INDEX/sglang that referenced this pull request Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants