Add DeepSeek MoE detection and export mapping in HF PTQ/export path#1125
Add DeepSeek MoE detection and export mapping in HF PTQ/export path#1125Charles-JCJ wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds DeepSeek (v2/v3) export support across the export pipeline: HF config mapping, MoE detection and expert extraction, Megatron/NEMO layer handling tweaks, PQS quantization pre-fuse mappings, and unit tests covering DeepSeek MoE behaviors and mappings. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/torch/export/layer_utils.py (1)
330-343: LGTM on DeepSeek detection, but verifygptossmoecoverage.The
_is_deepseek_moe_namehelper and its usage inis_moecorrectly identify DeepSeek MoE modules by requiring both "deepseek" and "moe" in the name plus the presence ofgateandexpertsattributes.However,
gptossmoewas added to the explicit matches (line 343) but the test file only covers DeepSeek. Consider adding test coverage forGptOssMoEdetection to prevent regressions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/layer_utils.py` around lines 330 - 343, Add a unit test to assert that is_moe correctly detects modules named/typed as GptOssMoE: create a minimal nn.Module subclass whose class name includes "GptOssMoE" (or set type(module).__name__ to that via a real class name) and verify is_moe(module) returns True; place this alongside the existing DeepSeek test to ensure the explicit match for "gptossmoe" in is_moe is covered and guarded against regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/export/layer_utils.py`:
- Around line 330-343: Add a unit test to assert that is_moe correctly detects
modules named/typed as GptOssMoE: create a minimal nn.Module subclass whose
class name includes "GptOssMoE" (or set type(module).__name__ to that via a real
class name) and verify is_moe(module) returns True; place this alongside the
existing DeepSeek test to ensure the explicit match for "gptossmoe" in is_moe is
covered and guarded against regressions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 03ce4234-32ec-4dab-b878-35b0766d665d
📒 Files selected for processing (4)
modelopt/torch/export/hf_config_map.pymodelopt/torch/export/layer_utils.pymodelopt/torch/export/quant_utils.pytests/unit/torch/export/test_deepseek_export_support.py
Signed-off-by: Charles.J <jiangchangjian247@gmail.com>
0cae4cb to
de8820f
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/torch/export/layer_utils.py (1)
155-180: Consider adding defensive attribute checks for legacy NEMO model handling.The code assumes
language_model.embeddingandlanguage_model.encoderexist after finding aTransformerLanguageModel. While this is likely always true for valid NEMO models, adding explicit checks would make the code more robust and provide clearer error messages if assumptions are violated.💡 Suggested defensive checks
if language_model: warn("Warning: this is an old NEMO checkpoint format and will be deprecated soon.") + if not hasattr(language_model, "embedding") or not hasattr(language_model, "encoder"): + raise ValueError( + "TransformerLanguageModel missing expected 'embedding' or 'encoder' attributes" + ) layers = list(language_model.embedding.children()) + list( language_model.encoder.children() )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/layer_utils.py` around lines 155 - 180, The legacy NEMO branch that locates a TransformerLanguageModel (variable language_model) should defensively verify attributes before accessing them: check hasattr(language_model, "embedding") and hasattr(language_model, "encoder") before building layers, and only call list(language_model.embedding.children()) or list(language_model.encoder.children()) when present; if either is missing raise or warn with a clear message mentioning TransformerLanguageModel so callers know the checkpoint is unexpected; likewise guard access to language_model.output_layer (already partially handled) and append it only if hasattr(language_model, "output_layer"); update the code around the language_model handling in layer_utils.py to perform these attribute checks and produce explicit error/warning messages rather than assuming attributes exist.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/export/layer_utils.py`:
- Around line 155-180: The legacy NEMO branch that locates a
TransformerLanguageModel (variable language_model) should defensively verify
attributes before accessing them: check hasattr(language_model, "embedding") and
hasattr(language_model, "encoder") before building layers, and only call
list(language_model.embedding.children()) or
list(language_model.encoder.children()) when present; if either is missing raise
or warn with a clear message mentioning TransformerLanguageModel so callers know
the checkpoint is unexpected; likewise guard access to
language_model.output_layer (already partially handled) and append it only if
hasattr(language_model, "output_layer"); update the code around the
language_model handling in layer_utils.py to perform these attribute checks and
produce explicit error/warning messages rather than assuming attributes exist.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d9b14db8-e4cc-4731-b455-7a74ba225052
📒 Files selected for processing (4)
modelopt/torch/export/hf_config_map.pymodelopt/torch/export/layer_utils.pymodelopt/torch/export/quant_utils.pytests/unit/torch/export/test_deepseek_export_support.py
✅ Files skipped from review due to trivial changes (1)
- tests/unit/torch/export/test_deepseek_export_support.py
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/export/hf_config_map.py
|
@Charles-JCJ is this work to support HF quantized checkpoint export or TRTLLM (old format) checkpoint export? |
cjluo-nv
left a comment
There was a problem hiding this comment.
Summary: Adds DeepSeek V2/V3 MoE support to the HF PTQ/export pipeline — config mapping for n_routed_experts, MoE detection, expert linear name resolution, and pre-quant fusion rules. Also includes an unrelated Megatron/NEMO get_transformer_layers change and several comment fixups.
Issues Found:
-
[Readability] Duplicated branch in
get_experts_list(layer_utils.py, diff lines for theelif "deepseek" in model_type:block around line 101-102):
The newelif "deepseek" in model_type:branch setslinear_names = ["gate_proj", "down_proj", "up_proj"]— identical to the Qwen branch immediately above it. Merge DeepSeek into the existingany(...)check:elif any( variant in model_type for variant in [ "qwenmoeforcausallm", "qwen2moeforcausallm", "qwen3moeforcausallm", "qwen3nextforcausallm", "deepseek", ] ):
Or alternatively, keep it separate but add a comment explaining the intentional separation — as-is it looks like an oversight.
-
[Correctness]
get_experts_listuses loose substring match (layer_utils.py, new line ~101):"deepseek" in model_typewould match any model_type containing "deepseek" as a substring. Currently that's fine, but it's inconsistent with the more specific checks for other models (e.g.,"mixtralforcausallm","qwen2moeforcausallm"). If a future DeepSeek variant uses different expert linear names, this broad match would silently produce wrong results. Consider matching"deepseekv2"or"deepseekv3"explicitly, or at minimum"deepseek"with a comment noting the assumption. -
[Correctness] Scope creep — Megatron/NEMO
get_transformer_layerschanges (layer_utils.py, diff lines 153-181): The Megatron/NEMO handling block (BFS forTransformerLanguageModel, legacy checkpoint support) is unrelated to DeepSeek MoE detection. This is a significant chunk of new logic (~30 lines) with no tests and no mention in the PR description. It should either:- Be split into a separate PR with its own tests, or
- Be explicitly documented in this PR's description with test coverage.
-
[Tests] No tests for Megatron/NEMO
get_transformer_layersadditions: The new Megatron/NEMO branch inget_transformer_layersincludes a BFS search, conditional unwrapping, and a deprecation warning — all untested. This is particularly risky because it modifies the model traversal entry point. -
[Tests] No negative test for
is_moestructural check: The newis_moelogic requires bothhasattr(module, "gate")andhasattr(module, "experts"). A test with a DeepSeek-named module missing one of these attributes would validate the guard. Currently only the happy path is tested. -
[Readability] Copyright year (
test_deepseek_export_support.py, line 1): The copyright header says 2024, but this is a new file created in 2026. -
[Readability] Comment-only changes mixed in (
layer_utils.py): TheMegatron-core→NEMOcomment rename (line 393→393 in diff) and theMCore→MCore / NeMocomment fix (line 1493 area) are fine individually but further blur the PR's scope.
Suggestions:
- The
_is_deepseek_moe_namehelper is a nice pattern — consider also using it inget_experts_listinstead of the raw"deepseek" in model_typecheck, for consistency. - The
DeepseekV2Attention/DeepseekV3Attentionentries inPQS_FUSE_MODULE_MAPPINGare added to the same tuple asLlamaAttention. If DeepSeek attention has a fundamentally different structure (e.g., MLA with latent attention), verify thev_proj/o_projfusion math still holds. - Consider adding
"qwen3_5moeforcausallm"toget_experts_listsinceQwen3_5MoeSparseMoeBlockis already inget_expert_linear_names.
Overall Assessment: The core DeepSeek MoE changes are straightforward and correct. However, the Megatron/NEMO changes are out of scope and untested, which is the primary blocking concern. The DeepSeek-specific changes themselves are low-risk additions to existing patterns.
What does this PR do?
Type of change: Bug fix
This PR adds missing DeepSeek MoE support in the Hugging Face PTQ/export path by:
n_routed_expertstomoe_num_expertsgate_proj/down_proj/up_projUsage
This improves DeepSeek MoE compatibility in the existing Hugging Face PTQ/export flow.
Testing
tests/unit/torch/export/test_deepseek_export_support.pypython3 -m py_compile ...Before your PR is "Ready for review"
CONTRIBUTING.md: N/ASummary by CodeRabbit
New Features
Tests