Skip to content

Conversation

@danielafrimi
Copy link
Collaborator

@danielafrimi danielafrimi commented Nov 18, 2025

Add support for modelopt quantization configuration formats (config.json and hf_quant_config.json) and align the quantization config structure with [llm-compressor’s quantization_config

this changes are for TRT-LLM 1.2.0,

example quantization config in config.json:

"quantization_config": {
        "ignore": [
            "lm_head"
        ],
        "quant_algo": "W4A16_AWQ",
        "kv_cache_scheme": {
            "dynamic": false,
            "num_bits": 8,
            "type": "float"
        },
        "producer": {
            "name": "modelopt",
            "version": "0.40.0.dev4+g14fa1e575"
        },
        "quant_method": "modelopt"
    }

refactor the code base where logics for all quantization format are in the QuantConfig class (till now logic was in llm_utils and ModelConfig) and were not align with each other. in addition add support for the new modelopt format.

Summary by CodeRabbit

  • New Features

    • Added file locking mechanism with fallback support for concurrent remote configuration access to prevent race conditions.
  • Bug Fixes

    • Consolidated quantization configuration loading with improved error handling and centralized initialization paths.
  • Tests

    • Updated quantization tests to validate new unified configuration loading behavior.

Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: Daniel Afrimi <[email protected]>
@danielafrimi danielafrimi requested review from a team as code owners November 18, 2025 14:14
@danielafrimi danielafrimi changed the title ModelOpt - Align quantization_config with llm-compressor format [None][feat] ModelOpt - Align quantization_config with llm-compressor format Nov 18, 2025
@danielafrimi
Copy link
Collaborator Author

/bot run

1 similar comment
@danielafrimi
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24917 [ run ] triggered by Bot. Commit: bb7e1e1

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24919 [ run ] triggered by Bot. Commit: bb7e1e1

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24917 [ run ] completed with state ABORTED. Commit: bb7e1e1

Signed-off-by: Daniel Afrimi <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 18, 2025

📝 Walkthrough

Walkthrough

The PR consolidates quantization configuration loading from multiple static methods in ModelConfig to a unified update_from_model_ckpt method in QuantConfig. The new approach handles hf_quant_config.json, config.json, dtypes.json, and modelopt formats through a single interface, with added file locking for remote configs.

Changes

Cohort / File(s) Summary
Quantization config refactoring—ModelConfig
tensorrt_llm/_torch/model_config.py
Removed static methods: load_modelopt_quant_config, load_hf_quant_config, load_quant_config_from_dtypes_json, get_mxfp4_quant_algo, and override_quant_algo. Removed imports of json, os, get_sm_version, and QuantAlgo. Added config_file_lock context manager for serializing remote config access. Updated from_pretrained flow to delegate all quant config updates to QuantConfig.update_from_model_ckpt.
Quantization methods implementation—QuantConfig
tensorrt_llm/models/modeling_utils.py
Added nine new methods to QuantConfig: override_quant_algo(), get_mxfp4_quant_algo(), _infer_kv_cache_quant_algo_from_scheme(), _map_new_to_legacy_args(), _update_from_quant_config_json(), _update_from_legacy_args(), _update_from_legacy_quant_config_json(), load_quant_config_from_dtypes_json(), and update_from_model_ckpt(). Added import of get_sm_version from tensorrt_llm._utils.
Consumer updates
tensorrt_llm/llmapi/llm_utils.py
Split combined imports; added no-op lint annotation for QuantAlgo. Replaced in-place HF quant config parsing in _update_from_hf_quant_config with single call to quant_config.update_from_model_ckpt(). Removed explicit validation and error handling, delegating to unified updater.
Test updates
tests/unittest/llmapi/test_llm_quant.py
Updated two test cases to use QuantConfig.update_from_model_ckpt(model_ckpt_path=model_dir, moe_backend=None) instead of ModelConfig.load_modelopt_quant_config. Wrapped file paths with Path() constructor.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as from_pretrained/<br/>_update_from_hf_quant_config
    participant QuantConfig as QuantConfig.<br/>update_from_model_ckpt
    participant FSCheck as Checkpoint<br/>Detection
    participant Updaters as Specific<br/>Updaters

    Caller->>QuantConfig: update_from_model_ckpt(checkpoint_dir,<br/>moe_backend)
    
    QuantConfig->>FSCheck: Check for hf_quant_config.json
    alt hf_quant_config.json exists
        FSCheck-->>QuantConfig: Found
        QuantConfig->>Updaters: _update_from_quant_config_json()
    else config.json with quant_method
        FSCheck-->>QuantConfig: Not found, try config.json
        QuantConfig->>Updaters: _update_from_legacy_quant_config_json()
    else dtypes.json exists
        FSCheck-->>QuantConfig: Not found, try dtypes.json
        QuantConfig->>Updaters: load_quant_config_from_dtypes_json()
    else No quant config found
        FSCheck-->>QuantConfig: Not found
    end
    
    Updaters-->>QuantConfig: Returns is_changed, updates<br/>internal state
    QuantConfig-->>Caller: (is_quant_config_changed,<br/>layer_quant_config)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Multiple quantization paths and logic consolidation: Nine new methods added to QuantConfig, each handling different quantization formats (hf_quant_config.json, legacy HF args, dtypes.json, modelopt) with distinct parsing and validation logic. Review requires understanding all branches and validation flows.
  • Cross-module refactoring with dependencies: Logic moved from ModelConfig to QuantConfig introduces interdependencies that must be verified (e.g., correct fallback behavior, layer_quant_config population, error handling consistency).
  • New file locking mechanism: config_file_lock context manager adds synchronization logic with fallback paths and warning conditions that warrant careful validation.
  • Call site updates: Changes in llm_utils.py and test files depend on correct return contracts and state mutations of update_from_model_ckpt.

Areas requiring extra attention:

  • Completeness and ordering of the quant config detection/loading fallback chain in update_from_model_ckpt
  • Validation of HF quant config mapping logic in _map_new_to_legacy_args and _update_from_legacy_args for all supported quantization methods
  • Layer quantization config merging behavior in _update_from_legacy_args to ensure correctness for mixed-precision scenarios
  • File locking semantics and fallback safety in config_file_lock, especially error recovery paths
  • Correctness of SM version checks and backend-specific logic in get_mxfp4_quant_algo

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The description explains the main objective (consolidating quantization format logic into QuantConfig and adding ModelOpt format support) but lacks explicit test coverage details and PR checklist completion required by the template. Complete the PR checklist section and explicitly list test cases that validate the quantization config changes across different formats (modelopt, hf, dtypes.json).
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: aligning quantization configuration with llm-compressor format and adding ModelOpt support, which directly corresponds to the core refactoring across multiple files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
tensorrt_llm/models/modeling_utils.py (1)

373-384: Consider more specific exception handling.

Line 381 catches a generic Exception, which is too broad. Consider catching specific exceptions like OSError, FileNotFoundError, or transformers library exceptions.

             if checkpoint_dir is not None:
                 try:
                     import transformers
                     mixed_quant_config_file = transformers.utils.hub.cached_file(
                         checkpoint_dir, 'quant_cfg.json')
                     with open(mixed_quant_config_file) as fm:
                         json_extended_quant_configs = json.load(fm)
-                except Exception:
+                except (OSError, FileNotFoundError, ValueError, json.JSONDecodeError) as e:
                     logger.info(
-                        "No quant_cfg.json found for layer quant info, using base quantization config."
+                        f"No quant_cfg.json found for layer quant info: {e}. Using base quantization config."
                     )
tensorrt_llm/_torch/model_config.py (1)

278-289: Consider using or removing is_quant_config_changed.

The is_quant_config_changed return value from update_from_model_ckpt is assigned but never used in all three call sites (lines 278, 283, 287). Consider either:

  1. Using this value to log whether a quantization config was found and loaded
  2. Using _ prefix if intentionally discarding: _is_quant_config_changed, layer_quant_config = ...
         # quantized ckpt in modelopt format
         if quant_config_file := cached_file(checkpoint_dir,
                                             'hf_quant_config.json'):
-            is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt(
+            _is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt(
                 checkpoint_dir, moe_backend)
 
         # quantized ckpt in other formats
         elif hasattr(pretrained_config, "quantization_config"):
-            is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt(
+            _is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt(
                 checkpoint_dir, moe_backend)
 
         elif quant_config_file := cached_file(checkpoint_dir, 'dtypes.json'):
-            is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt(
+            _is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt(
                 checkpoint_dir, moe_backend)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 160b361 and 2facb5e.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/model_config.py (2 hunks)
  • tensorrt_llm/llmapi/llm_utils.py (2 hunks)
  • tensorrt_llm/models/modeling_utils.py (2 hunks)
  • tests/unittest/llmapi/test_llm_quant.py (2 hunks)
🧰 Additional context used
🧠 Learnings (16)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/llmapi/test_llm_quant.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tests/unittest/llmapi/test_llm_quant.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/unittest/llmapi/test_llm_quant.py
  • tensorrt_llm/_torch/model_config.py
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

  • tests/unittest/llmapi/test_llm_quant.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tests/unittest/llmapi/test_llm_quant.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/unittest/llmapi/test_llm_quant.py
  • tensorrt_llm/llmapi/llm_utils.py
  • tensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

  • tests/unittest/llmapi/test_llm_quant.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tests/unittest/llmapi/test_llm_quant.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-10-17T13:21:31.724Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 8398
File: tensorrt_llm/_torch/pyexecutor/sampling_utils.py:237-272
Timestamp: 2025-10-17T13:21:31.724Z
Learning: The setup.py file in TensorRT-LLM explicitly requires Python 3.10+ via `python_requires=">=3.10, <4"`, making match/case statements and other Python 3.10+ features appropriate throughout the codebase.

Applied to files:

  • tests/unittest/llmapi/test_llm_quant.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/model_config.py
📚 Learning: 2025-08-09T02:04:49.623Z
Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 6760
File: tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py:81-98
Timestamp: 2025-08-09T02:04:49.623Z
Learning: In TensorRT-LLM's auto_deploy module, torch.dtype values in configuration dictionaries must be stored as string representations (e.g., "float16" instead of torch.float16) because OmegaConf.merge does not support torch.dtype types. These string representations are converted to actual torch.dtype objects in downstream code.

Applied to files:

  • tensorrt_llm/_torch/model_config.py
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/llmapi/llm_utils.py
  • tensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

  • tensorrt_llm/_torch/model_config.py
📚 Learning: 2025-08-27T14:23:55.566Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

  • tensorrt_llm/models/modeling_utils.py
🧬 Code graph analysis (4)
tests/unittest/llmapi/test_llm_quant.py (1)
tensorrt_llm/models/modeling_utils.py (2)
  • QuantConfig (133-578)
  • update_from_model_ckpt (478-504)
tensorrt_llm/_torch/model_config.py (1)
tensorrt_llm/models/modeling_utils.py (1)
  • update_from_model_ckpt (478-504)
tensorrt_llm/llmapi/llm_utils.py (3)
tensorrt_llm/quantization/mode.py (1)
  • QuantAlgo (23-47)
tensorrt_llm/models/modeling_utils.py (3)
  • PretrainedConfig (676-877)
  • QuantConfig (133-578)
  • update_from_model_ckpt (478-504)
tensorrt_llm/llmapi/llm_args.py (2)
  • quant_config (2718-2721)
  • quant_config (2724-2725)
tensorrt_llm/models/modeling_utils.py (2)
tensorrt_llm/_utils.py (1)
  • get_sm_version (740-742)
tensorrt_llm/quantization/mode.py (2)
  • QuantAlgo (23-47)
  • has_mxfp4 (197-199)
🪛 Ruff (0.14.5)
tests/unittest/llmapi/test_llm_quant.py

121-121: Unpacked variable is_quant_config_changed is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


165-165: Unpacked variable is_quant_config_changed is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tensorrt_llm/_torch/model_config.py

286-286: Local variable quant_config_file is assigned to but never used

(F841)


287-287: Unpacked variable is_quant_config_changed is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tensorrt_llm/llmapi/llm_utils.py

28-28: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

tensorrt_llm/models/modeling_utils.py

341-343: Avoid specifying long messages outside the exception class

(TRY003)


345-345: Avoid specifying long messages outside the exception class

(TRY003)


359-361: Avoid specifying long messages outside the exception class

(TRY003)


366-368: Avoid specifying long messages outside the exception class

(TRY003)


367-367: Undefined name quant_config

(F821)


381-381: Do not catch blind exception: Exception

(BLE001)


401-403: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/models/modeling_utils.py (1)

201-217: LGTM: Clean environment variable override mechanism.

The environment-based override with validation and clear warning messages follows good practices.

tensorrt_llm/llmapi/llm_utils.py (1)

335-339: LGTM: Clean delegation to centralized config updater.

The refactored method properly delegates quantization config updates to QuantConfig.update_from_model_ckpt, simplifying the logic and improving maintainability.

tensorrt_llm/_torch/model_config.py (1)

27-69: LGTM: Robust file locking with graceful fallbacks.

The context manager provides comprehensive fallback handling for file locking, ensuring safe concurrent access to model configs while gracefully degrading if locking fails. The multiple fallback paths (HF cache → temp dir → no lock) make this resilient to various permission and timeout scenarios.

Comment on lines +218 to +231
@staticmethod
def get_mxfp4_quant_algo(moe_backend, is_dynamic_quant=False):
quant_algo = QuantConfig.override_quant_algo()
if quant_algo is None and not is_dynamic_quant:
if get_sm_version() >= 100:
if moe_backend == 'TRITON':
return QuantAlgo.W4A8_MXFP4_FP8
else:
return QuantAlgo.W4A8_MXFP4_MXFP8
else:
return QuantAlgo.W4A16_MXFP4
else:
return quant_algo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify CUDA availability before calling get_sm_version().

The method calls get_sm_version() at line 222, which internally uses torch.cuda.get_device_properties(0). This will raise a runtime error if no CUDA device is available or if device index 0 doesn't exist.

Please verify that get_mxfp4_quant_algo is only invoked in contexts where CUDA is guaranteed to be available. Consider adding a guard or documentation if this assumption is not always valid.


🏁 Script executed:

#!/bin/bash
# Check call sites of get_mxfp4_quant_algo to verify CUDA context assumptions
rg -nC3 'get_mxfp4_quant_algo'

Length of output: 1720


🏁 Script executed:

#!/bin/bash
# 1. Find get_sm_version implementation
echo "=== get_sm_version implementation ==="
rg -nA10 'def get_sm_version' tensorrt_llm/_utils.py

# 2. Check the context of _update_from_quant_config_json
echo -e "\n=== _update_from_quant_config_json context ==="
rg -nB5 '_update_from_quant_config_json' tensorrt_llm/models/modeling_utils.py | head -30

# 3. Check the context of load_quant_config_from_dtypes_json
echo -e "\n=== load_quant_config_from_dtypes_json context ==="
rg -nB5 'load_quant_config_from_dtypes_json' tensorrt_llm/models/modeling_utils.py | head -30

# 4. Check if there are any guards before calling get_mxfp4_quant_algo
echo -e "\n=== Checking for CUDA guards before get_mxfp4_quant_algo calls ==="
rg -nB10 'get_mxfp4_quant_algo' tensorrt_llm/models/modeling_utils.py | grep -A10 -B10 'cuda\|device\|gpu'

Length of output: 1683



🏁 Script executed:

#!/bin/bash
# Get full method definitions to understand context better
echo "=== _update_from_quant_config_json full method ==="
ast-grep --pattern $'def _update_from_quant_config_json($_) {
  $$$
}'

echo -e "\n=== load_quant_config_from_dtypes_json full method ==="
ast-grep --pattern $'def load_quant_config_from_dtypes_json($_) {
  $$$
}'

echo -e "\n=== Callers of load_from_json (entry point) ==="
rg -nB5 'load_from_json\(' tensorrt_llm/models/modeling_utils.py | grep -v 'test\|Test' | head -40

Length of output: 155



🏁 Script executed:

#!/bin/bash
# Extract the full method bodies using line ranges
echo "=== _update_from_quant_config_json (lines 277-320) ==="
sed -n '277,320p' tensorrt_llm/models/modeling_utils.py

echo -e "\n=== load_quant_config_from_dtypes_json (lines 445-480) ==="
sed -n '445,480p' tensorrt_llm/models/modeling_utils.py

echo -e "\n=== Check if these config loading methods check for CUDA ==="
rg -nC15 'def load_from_json' tensorrt_llm/models/modeling_utils.py | head -50

echo -e "\n=== Check if get_mxfp4_quant_algo is guarded by device availability ==="
sed -n '300,315p' tensorrt_llm/models/modeling_utils.py
sed -n '465,475p' tensorrt_llm/models/modeling_utils.py

Length of output: 5241



🏁 Script executed:

#!/bin/bash
# Find where update_from_model_ckpt is called from
echo "=== Callers of update_from_model_ckpt ==="
rg -n 'update_from_model_ckpt' tensorrt_llm/ -g '!*.pyc' | grep -v 'def update_from_model_ckpt'

echo -e "\n=== QuantConfig instantiation and usage ==="
rg -nB3 'QuantConfig\(' tensorrt_llm/models/modeling_utils.py | head -40

echo -e "\n=== Where is QuantConfig.load_from_json called ==="
rg -nB5 'load_from_json' tensorrt_llm/ -g '!*.pyc' -g '!test' | head -50

echo -e "\n=== Check if there are any torch.cuda availability checks in modeling_utils ==="
rg -n 'cuda\|device\|is_gpu' tensorrt_llm/models/modeling_utils.py | head -20

Length of output: 1695



🏁 Script executed:

#!/bin/bash
# Check the context around the callers in llmapi/llm_utils.py
echo "=== Context in llmapi/llm_utils.py around line 337 ==="
sed -n '330,345p' tensorrt_llm/llmapi/llm_utils.py

# Check the context in _torch/model_config.py
echo -e "\n=== Context in _torch/model_config.py around lines 278-287 ==="
sed -n '270,295p' tensorrt_llm/_torch/model_config.py

# Check if llmapi or model_config have any GPU/CUDA guards before calling quantization methods
echo -e "\n=== Check llmapi for CUDA guards ==="
rg -nC5 'cuda\|device\|gpu' tensorrt_llm/llmapi/llm_utils.py | head -40

echo -e "\n=== Check model_config for CUDA guards around quantization calls ==="
rg -nC5 'cuda\|device\|gpu' tensorrt_llm/_torch/model_config.py | head -40

Length of output: 2152


Add CUDA availability guard to get_sm_version() call in get_mxfp4_quant_algo().

The method calls get_sm_version() at line 222, which unconditionally calls torch.cuda.get_device_properties(0) without checking CUDA availability. This causes a runtime error when loading quantized checkpoints on CPU-only systems. Since update_from_model_ckpt() can be invoked during model initialization in non-GPU contexts, this is not safe.

Add a check or guard before accessing SM version:

if not torch.cuda.is_available():
    # Fall back to safe default for CPU environments
    return QuantAlgo.W4A16_MXFP4

Alternatively, wrap the get_sm_version() call in the method or add a device availability check at the calling sites.

🤖 Prompt for AI Agents
In tensorrt_llm/models/modeling_utils.py around lines 218 to 231, the call to
get_sm_version() is unguarded and will raise on CPU-only systems because
torch.cuda.get_device_properties(0) is used; update get_mxfp4_quant_algo to
first check torch.cuda.is_available() and if CUDA is not available return a safe
CPU default (e.g., QuantAlgo.W4A16_MXFP4) instead of calling get_sm_version(),
otherwise proceed with the existing SM-version based branching; ensure this
early-return handles both is_dynamic_quant cases consistently so CPU-only runs
never call get_sm_version().

@danielafrimi
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24921 [ run ] triggered by Bot. Commit: 2facb5e

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24919 [ run ] completed with state ABORTED. Commit: bb7e1e1
LLM/main/L0_MergeRequest_PR #18819 (Blue Ocean) completed with status: ABORTED

Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: Daniel Afrimi <[email protected]>
@danielafrimi
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24923 [ run ] triggered by Bot. Commit: 9d736e7

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24921 [ run ] completed with state ABORTED. Commit: 2facb5e
LLM/main/L0_MergeRequest_PR #18821 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24923 [ run ] completed with state SUCCESS. Commit: 9d736e7
/LLM/main/L0_MergeRequest_PR pipeline #18823 completed with status: 'FAILURE'

@danielafrimi
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25064 [ run ] triggered by Bot. Commit: 9d736e7

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25064 [ run ] completed with state SUCCESS. Commit: 9d736e7
/LLM/main/L0_MergeRequest_PR pipeline #18944 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants