[None][feat] ModelOpt - Align quantization_config with llm-compressor format #9267

danielafrimi · 2025-11-18T14:13:59Z

Add support for modelopt quantization configuration formats (config.json and hf_quant_config.json) and align the quantization config structure with [llm-compressor’s quantization_config

this changes are for TRT-LLM 1.2.0,

example quantization config in config.json:

"quantization_config": {
        "ignore": [
            "lm_head"
        ],
        "quant_algo": "W4A16_AWQ",
        "kv_cache_scheme": {
            "dynamic": false,
            "num_bits": 8,
            "type": "float"
        },
        "producer": {
            "name": "modelopt",
            "version": "0.40.0.dev4+g14fa1e575"
        },
        "quant_method": "modelopt"
    }

refactor the code base where logics for all quantization format are in the QuantConfig class (till now logic was in llm_utils and ModelConfig) and were not align with each other. in addition add support for the new modelopt format.

Summary by CodeRabbit

New Features
- Added file locking mechanism with fallback support for concurrent remote configuration access to prevent race conditions.
Bug Fixes
- Consolidated quantization configuration loading with improved error handling and centralized initialization paths.
Tests
- Updated quantization tests to validate new unified configuration loading behavior.

Signed-off-by: Daniel Afrimi <[email protected]>

danielafrimi · 2025-11-18T14:22:50Z

/bot run

danielafrimi · 2025-11-18T14:26:06Z

/bot run

tensorrt-cicd · 2025-11-18T14:28:22Z

PR_Github #24917 [ run ] triggered by Bot. Commit: bb7e1e1

tensorrt-cicd · 2025-11-18T14:32:20Z

PR_Github #24919 [ run ] triggered by Bot. Commit: bb7e1e1

tensorrt-cicd · 2025-11-18T14:32:22Z

PR_Github #24917 [ run ] completed with state ABORTED. Commit: bb7e1e1

Signed-off-by: Daniel Afrimi <[email protected]>

coderabbitai · 2025-11-18T14:40:20Z

📝 Walkthrough

Walkthrough

The PR consolidates quantization configuration loading from multiple static methods in ModelConfig to a unified update_from_model_ckpt method in QuantConfig. The new approach handles hf_quant_config.json, config.json, dtypes.json, and modelopt formats through a single interface, with added file locking for remote configs.

Changes

Cohort / File(s)	Summary
Quantization config refactoring—ModelConfig `tensorrt_llm/_torch/model_config.py`	Removed static methods: `load_modelopt_quant_config`, `load_hf_quant_config`, `load_quant_config_from_dtypes_json`, `get_mxfp4_quant_algo`, and `override_quant_algo`. Removed imports of json, os, get_sm_version, and QuantAlgo. Added `config_file_lock` context manager for serializing remote config access. Updated `from_pretrained` flow to delegate all quant config updates to `QuantConfig.update_from_model_ckpt`.
Quantization methods implementation—QuantConfig `tensorrt_llm/models/modeling_utils.py`	Added nine new methods to QuantConfig: `override_quant_algo()`, `get_mxfp4_quant_algo()`, `_infer_kv_cache_quant_algo_from_scheme()`, `_map_new_to_legacy_args()`, `_update_from_quant_config_json()`, `_update_from_legacy_args()`, `_update_from_legacy_quant_config_json()`, `load_quant_config_from_dtypes_json()`, and `update_from_model_ckpt()`. Added import of `get_sm_version` from tensorrt_llm._utils.
Consumer updates `tensorrt_llm/llmapi/llm_utils.py`	Split combined imports; added no-op lint annotation for QuantAlgo. Replaced in-place HF quant config parsing in `_update_from_hf_quant_config` with single call to `quant_config.update_from_model_ckpt()`. Removed explicit validation and error handling, delegating to unified updater.
Test updates `tests/unittest/llmapi/test_llm_quant.py`	Updated two test cases to use `QuantConfig.update_from_model_ckpt(model_ckpt_path=model_dir, moe_backend=None)` instead of `ModelConfig.load_modelopt_quant_config`. Wrapped file paths with `Path()` constructor.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as from_pretrained/<br/>_update_from_hf_quant_config
    participant QuantConfig as QuantConfig.<br/>update_from_model_ckpt
    participant FSCheck as Checkpoint<br/>Detection
    participant Updaters as Specific<br/>Updaters

    Caller->>QuantConfig: update_from_model_ckpt(checkpoint_dir,<br/>moe_backend)
    
    QuantConfig->>FSCheck: Check for hf_quant_config.json
    alt hf_quant_config.json exists
        FSCheck-->>QuantConfig: Found
        QuantConfig->>Updaters: _update_from_quant_config_json()
    else config.json with quant_method
        FSCheck-->>QuantConfig: Not found, try config.json
        QuantConfig->>Updaters: _update_from_legacy_quant_config_json()
    else dtypes.json exists
        FSCheck-->>QuantConfig: Not found, try dtypes.json
        QuantConfig->>Updaters: load_quant_config_from_dtypes_json()
    else No quant config found
        FSCheck-->>QuantConfig: Not found
    end
    
    Updaters-->>QuantConfig: Returns is_changed, updates<br/>internal state
    QuantConfig-->>Caller: (is_quant_config_changed,<br/>layer_quant_config)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Multiple quantization paths and logic consolidation: Nine new methods added to QuantConfig, each handling different quantization formats (hf_quant_config.json, legacy HF args, dtypes.json, modelopt) with distinct parsing and validation logic. Review requires understanding all branches and validation flows.
Cross-module refactoring with dependencies: Logic moved from ModelConfig to QuantConfig introduces interdependencies that must be verified (e.g., correct fallback behavior, layer_quant_config population, error handling consistency).
New file locking mechanism: config_file_lock context manager adds synchronization logic with fallback paths and warning conditions that warrant careful validation.
Call site updates: Changes in llm_utils.py and test files depend on correct return contracts and state mutations of update_from_model_ckpt.

Areas requiring extra attention:

Completeness and ordering of the quant config detection/loading fallback chain in update_from_model_ckpt
Validation of HF quant config mapping logic in _map_new_to_legacy_args and _update_from_legacy_args for all supported quantization methods
Layer quantization config merging behavior in _update_from_legacy_args to ensure correctness for mixed-precision scenarios
File locking semantics and fallback safety in config_file_lock, especially error recovery paths
Correctness of SM version checks and backend-specific logic in get_mxfp4_quant_algo

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The description explains the main objective (consolidating quantization format logic into QuantConfig and adding ModelOpt format support) but lacks explicit test coverage details and PR checklist completion required by the template.	Complete the PR checklist section and explicitly list test cases that validate the quantization config changes across different formats (modelopt, hf, dtypes.json).

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: aligning quantization configuration with llm-compressor format and adding ModelOpt support, which directly corresponds to the core refactoring across multiple files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (2)

tensorrt_llm/models/modeling_utils.py (1)

373-384: Consider more specific exception handling.

Line 381 catches a generic Exception, which is too broad. Consider catching specific exceptions like OSError, FileNotFoundError, or transformers library exceptions.

             if checkpoint_dir is not None:
                 try:
                     import transformers
                     mixed_quant_config_file = transformers.utils.hub.cached_file(
                         checkpoint_dir, 'quant_cfg.json')
                     with open(mixed_quant_config_file) as fm:
                         json_extended_quant_configs = json.load(fm)
-                except Exception:
+                except (OSError, FileNotFoundError, ValueError, json.JSONDecodeError) as e:
                     logger.info(
-                        "No quant_cfg.json found for layer quant info, using base quantization config."
+                        f"No quant_cfg.json found for layer quant info: {e}. Using base quantization config."
                     )

tensorrt_llm/_torch/model_config.py (1)

278-289: Consider using or removing is_quant_config_changed.

The is_quant_config_changed return value from update_from_model_ckpt is assigned but never used in all three call sites (lines 278, 283, 287). Consider either:

Using this value to log whether a quantization config was found and loaded
Using _ prefix if intentionally discarding: _is_quant_config_changed, layer_quant_config = ...

         # quantized ckpt in modelopt format
         if quant_config_file := cached_file(checkpoint_dir,
                                             'hf_quant_config.json'):
-            is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt(
+            _is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt(
                 checkpoint_dir, moe_backend)
 
         # quantized ckpt in other formats
         elif hasattr(pretrained_config, "quantization_config"):
-            is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt(
+            _is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt(
                 checkpoint_dir, moe_backend)
 
         elif quant_config_file := cached_file(checkpoint_dir, 'dtypes.json'):
-            is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt(
+            _is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt(
                 checkpoint_dir, moe_backend)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 160b361 and 2facb5e.

📒 Files selected for processing (4)

tensorrt_llm/_torch/model_config.py (2 hunks)
tensorrt_llm/llmapi/llm_utils.py (2 hunks)
tensorrt_llm/models/modeling_utils.py (2 hunks)
tests/unittest/llmapi/test_llm_quant.py (2 hunks)

🧰 Additional context used

🧠 Learnings (16)

📓 Common learnings

Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/llmapi/test_llm_quant.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/models/modeling_utils.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

tests/unittest/llmapi/test_llm_quant.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/models/modeling_utils.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/unittest/llmapi/test_llm_quant.py
tensorrt_llm/_torch/model_config.py

📚 Learning: 2025-08-29T14:07:45.863Z

Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

tests/unittest/llmapi/test_llm_quant.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tests/unittest/llmapi/test_llm_quant.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/models/modeling_utils.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tests/unittest/llmapi/test_llm_quant.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/models/modeling_utils.py

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tests/unittest/llmapi/test_llm_quant.py

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

tests/unittest/llmapi/test_llm_quant.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/models/modeling_utils.py

📚 Learning: 2025-10-17T13:21:31.724Z

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 8398
File: tensorrt_llm/_torch/pyexecutor/sampling_utils.py:237-272
Timestamp: 2025-10-17T13:21:31.724Z
Learning: The setup.py file in TensorRT-LLM explicitly requires Python 3.10+ via `python_requires=">=3.10, <4"`, making match/case statements and other Python 3.10+ features appropriate throughout the codebase.

Applied to files:

tests/unittest/llmapi/test_llm_quant.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/model_config.py

📚 Learning: 2025-08-09T02:04:49.623Z

Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 6760
File: tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py:81-98
Timestamp: 2025-08-09T02:04:49.623Z
Learning: In TensorRT-LLM's auto_deploy module, torch.dtype values in configuration dictionaries must be stored as string representations (e.g., "float16" instead of torch.float16) because OmegaConf.merge does not support torch.dtype types. These string representations are converted to actual torch.dtype objects in downstream code.

Applied to files:

tensorrt_llm/_torch/model_config.py

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/models/modeling_utils.py

📚 Learning: 2025-08-26T06:07:02.166Z

Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

tensorrt_llm/_torch/model_config.py

📚 Learning: 2025-08-27T14:23:55.566Z

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

tensorrt_llm/_torch/model_config.py
tensorrt_llm/models/modeling_utils.py

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

tensorrt_llm/models/modeling_utils.py

🧬 Code graph analysis (4)

tests/unittest/llmapi/test_llm_quant.py (1)

tensorrt_llm/models/modeling_utils.py (2)

QuantConfig (133-578)

update_from_model_ckpt (478-504)

tensorrt_llm/_torch/model_config.py (1)

tensorrt_llm/models/modeling_utils.py (1)

update_from_model_ckpt (478-504)

tensorrt_llm/llmapi/llm_utils.py (3)

tensorrt_llm/quantization/mode.py (1)

QuantAlgo (23-47)

tensorrt_llm/models/modeling_utils.py (3)

PretrainedConfig (676-877)

QuantConfig (133-578)

update_from_model_ckpt (478-504)

tensorrt_llm/llmapi/llm_args.py (2)

quant_config (2718-2721)

quant_config (2724-2725)

tensorrt_llm/models/modeling_utils.py (2)

tensorrt_llm/_utils.py (1)

get_sm_version (740-742)

tensorrt_llm/quantization/mode.py (2)

QuantAlgo (23-47)

has_mxfp4 (197-199)

🪛 Ruff (0.14.5)

tests/unittest/llmapi/test_llm_quant.py

121-121: Unpacked variable is_quant_config_changed is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

165-165: Unpacked variable is_quant_config_changed is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tensorrt_llm/_torch/model_config.py

286-286: Local variable quant_config_file is assigned to but never used

(F841)

287-287: Unpacked variable is_quant_config_changed is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tensorrt_llm/llmapi/llm_utils.py

28-28: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

tensorrt_llm/models/modeling_utils.py

341-343: Avoid specifying long messages outside the exception class

(TRY003)

345-345: Avoid specifying long messages outside the exception class

(TRY003)

359-361: Avoid specifying long messages outside the exception class

(TRY003)

366-368: Avoid specifying long messages outside the exception class

(TRY003)

367-367: Undefined name quant_config

(F821)

381-381: Do not catch blind exception: Exception

(BLE001)

401-403: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tensorrt_llm/models/modeling_utils.py (1)

201-217: LGTM: Clean environment variable override mechanism.

The environment-based override with validation and clear warning messages follows good practices.

tensorrt_llm/llmapi/llm_utils.py (1)

335-339: LGTM: Clean delegation to centralized config updater.

The refactored method properly delegates quantization config updates to QuantConfig.update_from_model_ckpt, simplifying the logic and improving maintainability.

tensorrt_llm/_torch/model_config.py (1)

27-69: LGTM: Robust file locking with graceful fallbacks.

The context manager provides comprehensive fallback handling for file locking, ensuring safe concurrent access to model configs while gracefully degrading if locking fails. The multiple fallback paths (HF cache → temp dir → no lock) make this resilient to various permission and timeout scenarios.

tensorrt_llm/llmapi/llm_utils.py

coderabbitai · 2025-11-18T14:40:23Z

tensorrt_llm/models/modeling_utils.py

+    @staticmethod
+    def get_mxfp4_quant_algo(moe_backend, is_dynamic_quant=False):
+        quant_algo = QuantConfig.override_quant_algo()
+        if quant_algo is None and not is_dynamic_quant:
+            if get_sm_version() >= 100:
+                if moe_backend == 'TRITON':
+                    return QuantAlgo.W4A8_MXFP4_FP8
+                else:
+                    return QuantAlgo.W4A8_MXFP4_MXFP8
+            else:
+                return QuantAlgo.W4A16_MXFP4
+        else:
+            return quant_algo
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify CUDA availability before calling get_sm_version().

The method calls get_sm_version() at line 222, which internally uses torch.cuda.get_device_properties(0). This will raise a runtime error if no CUDA device is available or if device index 0 doesn't exist.

Please verify that get_mxfp4_quant_algo is only invoked in contexts where CUDA is guaranteed to be available. Consider adding a guard or documentation if this assumption is not always valid.

🏁 Script executed:

#!/bin/bash # Check call sites of get_mxfp4_quant_algo to verify CUDA context assumptions rg -nC3 'get_mxfp4_quant_algo'

Length of output: 1720

🏁 Script executed:

#!/bin/bash # 1. Find get_sm_version implementation echo "=== get_sm_version implementation ===" rg -nA10 'def get_sm_version' tensorrt_llm/_utils.py # 2. Check the context of _update_from_quant_config_json echo -e "\n=== _update_from_quant_config_json context ===" rg -nB5 '_update_from_quant_config_json' tensorrt_llm/models/modeling_utils.py | head -30 # 3. Check the context of load_quant_config_from_dtypes_json echo -e "\n=== load_quant_config_from_dtypes_json context ===" rg -nB5 'load_quant_config_from_dtypes_json' tensorrt_llm/models/modeling_utils.py | head -30 # 4. Check if there are any guards before calling get_mxfp4_quant_algo echo -e "\n=== Checking for CUDA guards before get_mxfp4_quant_algo calls ===" rg -nB10 'get_mxfp4_quant_algo' tensorrt_llm/models/modeling_utils.py | grep -A10 -B10 'cuda\|device\|gpu'

Length of output: 1683

🏁 Script executed:

#!/bin/bash # Get full method definitions to understand context better echo "=== _update_from_quant_config_json full method ===" ast-grep --pattern $'def _update_from_quant_config_json($_) { $$$ }' echo -e "\n=== load_quant_config_from_dtypes_json full method ===" ast-grep --pattern $'def load_quant_config_from_dtypes_json($_) { $$$ }' echo -e "\n=== Callers of load_from_json (entry point) ===" rg -nB5 'load_from_json\(' tensorrt_llm/models/modeling_utils.py | grep -v 'test\|Test' | head -40

Length of output: 155

🏁 Script executed:

#!/bin/bash # Extract the full method bodies using line ranges echo "=== _update_from_quant_config_json (lines 277-320) ===" sed -n '277,320p' tensorrt_llm/models/modeling_utils.py echo -e "\n=== load_quant_config_from_dtypes_json (lines 445-480) ===" sed -n '445,480p' tensorrt_llm/models/modeling_utils.py echo -e "\n=== Check if these config loading methods check for CUDA ===" rg -nC15 'def load_from_json' tensorrt_llm/models/modeling_utils.py | head -50 echo -e "\n=== Check if get_mxfp4_quant_algo is guarded by device availability ===" sed -n '300,315p' tensorrt_llm/models/modeling_utils.py sed -n '465,475p' tensorrt_llm/models/modeling_utils.py

Length of output: 5241

🏁 Script executed:

#!/bin/bash # Find where update_from_model_ckpt is called from echo "=== Callers of update_from_model_ckpt ===" rg -n 'update_from_model_ckpt' tensorrt_llm/ -g '!*.pyc' | grep -v 'def update_from_model_ckpt' echo -e "\n=== QuantConfig instantiation and usage ===" rg -nB3 'QuantConfig\(' tensorrt_llm/models/modeling_utils.py | head -40 echo -e "\n=== Where is QuantConfig.load_from_json called ===" rg -nB5 'load_from_json' tensorrt_llm/ -g '!*.pyc' -g '!test' | head -50 echo -e "\n=== Check if there are any torch.cuda availability checks in modeling_utils ===" rg -n 'cuda\|device\|is_gpu' tensorrt_llm/models/modeling_utils.py | head -20

Length of output: 1695

🏁 Script executed:

#!/bin/bash # Check the context around the callers in llmapi/llm_utils.py echo "=== Context in llmapi/llm_utils.py around line 337 ===" sed -n '330,345p' tensorrt_llm/llmapi/llm_utils.py # Check the context in _torch/model_config.py echo -e "\n=== Context in _torch/model_config.py around lines 278-287 ===" sed -n '270,295p' tensorrt_llm/_torch/model_config.py # Check if llmapi or model_config have any GPU/CUDA guards before calling quantization methods echo -e "\n=== Check llmapi for CUDA guards ===" rg -nC5 'cuda\|device\|gpu' tensorrt_llm/llmapi/llm_utils.py | head -40 echo -e "\n=== Check model_config for CUDA guards around quantization calls ===" rg -nC5 'cuda\|device\|gpu' tensorrt_llm/_torch/model_config.py | head -40

Length of output: 2152

Add CUDA availability guard to get_sm_version() call in get_mxfp4_quant_algo().

The method calls get_sm_version() at line 222, which unconditionally calls torch.cuda.get_device_properties(0) without checking CUDA availability. This causes a runtime error when loading quantized checkpoints on CPU-only systems. Since update_from_model_ckpt() can be invoked during model initialization in non-GPU contexts, this is not safe.

Add a check or guard before accessing SM version:

if not torch.cuda.is_available(): # Fall back to safe default for CPU environments return QuantAlgo.W4A16_MXFP4

Alternatively, wrap the get_sm_version() call in the method or add a device availability check at the calling sites.

🤖 Prompt for AI Agents

In tensorrt_llm/models/modeling_utils.py around lines 218 to 231, the call to get_sm_version() is unguarded and will raise on CPU-only systems because torch.cuda.get_device_properties(0) is used; update get_mxfp4_quant_algo to first check torch.cuda.is_available() and if CUDA is not available return a safe CPU default (e.g., QuantAlgo.W4A16_MXFP4) instead of calling get_sm_version(), otherwise proceed with the existing SM-version based branching; ensure this early-return handles both is_dynamic_quant cases consistently so CPU-only runs never call get_sm_version().

tensorrt_llm/models/modeling_utils.py

tests/unittest/llmapi/test_llm_quant.py

danielafrimi · 2025-11-18T14:48:12Z

/bot run

tensorrt-cicd · 2025-11-18T14:53:56Z

PR_Github #24921 [ run ] triggered by Bot. Commit: 2facb5e

tensorrt-cicd · 2025-11-18T14:53:57Z

PR_Github #24919 [ run ] completed with state ABORTED. Commit: bb7e1e1
LLM/main/L0_MergeRequest_PR #18819 (Blue Ocean) completed with status: ABORTED

Signed-off-by: Daniel Afrimi <[email protected]>

danielafrimi · 2025-11-18T15:12:42Z

/bot run

tensorrt-cicd · 2025-11-18T15:19:00Z

PR_Github #24923 [ run ] triggered by Bot. Commit: 9d736e7

tensorrt-cicd · 2025-11-18T15:19:01Z

PR_Github #24921 [ run ] completed with state ABORTED. Commit: 2facb5e
LLM/main/L0_MergeRequest_PR #18821 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-11-18T16:31:43Z

PR_Github #24923 [ run ] completed with state SUCCESS. Commit: 9d736e7
/LLM/main/L0_MergeRequest_PR pipeline #18823 completed with status: 'FAILURE'

danielafrimi · 2025-11-19T15:22:28Z

/bot run

tensorrt-cicd · 2025-11-19T15:28:54Z

PR_Github #25064 [ run ] triggered by Bot. Commit: 9d736e7

tensorrt-cicd · 2025-11-19T16:12:36Z

PR_Github #25064 [ run ] completed with state SUCCESS. Commit: 9d736e7
/LLM/main/L0_MergeRequest_PR pipeline #18944 completed with status: 'FAILURE'

danielafrimi added 2 commits November 18, 2025 13:13

new format

2847be1

Signed-off-by: Daniel Afrimi <[email protected]>

wip

bb7e1e1

Signed-off-by: Daniel Afrimi <[email protected]>

danielafrimi requested review from a team as code owners November 18, 2025 14:14

danielafrimi requested review from hchings, nv-yilinf and yizhang-nv November 18, 2025 14:14

danielafrimi changed the title ~~ModelOpt - Align quantization_config with llm-compressor format~~ [None][feat] ModelOpt - Align quantization_config with llm-compressor format Nov 18, 2025

wip

2facb5e

Signed-off-by: Daniel Afrimi <[email protected]>

coderabbitai bot reviewed Nov 18, 2025

View reviewed changes

danielafrimi added 2 commits November 18, 2025 15:05

wip

55bb2c6

Signed-off-by: Daniel Afrimi <[email protected]>

wip

9d736e7

Signed-off-by: Daniel Afrimi <[email protected]>

[None][feat] ModelOpt - Align quantization_config with llm-compressor format #9267

Are you sure you want to change the base?

[None][feat] ModelOpt - Align quantization_config with llm-compressor format #9267

Uh oh!

Conversation

danielafrimi commented Nov 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

danielafrimi commented Nov 18, 2025

Uh oh!

danielafrimi commented Nov 18, 2025

Uh oh!

tensorrt-cicd commented Nov 18, 2025

Uh oh!

tensorrt-cicd commented Nov 18, 2025

Uh oh!

tensorrt-cicd commented Nov 18, 2025

Uh oh!

coderabbitai bot commented Nov 18, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielafrimi commented Nov 18, 2025

Uh oh!

tensorrt-cicd commented Nov 18, 2025

Uh oh!

tensorrt-cicd commented Nov 18, 2025

Uh oh!

danielafrimi commented Nov 18, 2025

Uh oh!

tensorrt-cicd commented Nov 18, 2025

Uh oh!

tensorrt-cicd commented Nov 18, 2025

Uh oh!

tensorrt-cicd commented Nov 18, 2025

Uh oh!

danielafrimi commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielafrimi commented Nov 18, 2025 •

edited by coderabbitai bot

Loading