-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[None][feat] ModelOpt - Align quantization_config with llm-compressor format #9267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: Daniel Afrimi <[email protected]>
|
/bot run |
1 similar comment
|
/bot run |
|
PR_Github #24917 [ run ] triggered by Bot. Commit: |
|
PR_Github #24919 [ run ] triggered by Bot. Commit: |
|
PR_Github #24917 [ run ] completed with state |
Signed-off-by: Daniel Afrimi <[email protected]>
📝 WalkthroughWalkthroughThe PR consolidates quantization configuration loading from multiple static methods in ModelConfig to a unified Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as from_pretrained/<br/>_update_from_hf_quant_config
participant QuantConfig as QuantConfig.<br/>update_from_model_ckpt
participant FSCheck as Checkpoint<br/>Detection
participant Updaters as Specific<br/>Updaters
Caller->>QuantConfig: update_from_model_ckpt(checkpoint_dir,<br/>moe_backend)
QuantConfig->>FSCheck: Check for hf_quant_config.json
alt hf_quant_config.json exists
FSCheck-->>QuantConfig: Found
QuantConfig->>Updaters: _update_from_quant_config_json()
else config.json with quant_method
FSCheck-->>QuantConfig: Not found, try config.json
QuantConfig->>Updaters: _update_from_legacy_quant_config_json()
else dtypes.json exists
FSCheck-->>QuantConfig: Not found, try dtypes.json
QuantConfig->>Updaters: load_quant_config_from_dtypes_json()
else No quant config found
FSCheck-->>QuantConfig: Not found
end
Updaters-->>QuantConfig: Returns is_changed, updates<br/>internal state
QuantConfig-->>Caller: (is_quant_config_changed,<br/>layer_quant_config)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Areas requiring extra attention:
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (2)
tensorrt_llm/models/modeling_utils.py (1)
373-384: Consider more specific exception handling.Line 381 catches a generic
Exception, which is too broad. Consider catching specific exceptions likeOSError,FileNotFoundError, ortransformerslibrary exceptions.if checkpoint_dir is not None: try: import transformers mixed_quant_config_file = transformers.utils.hub.cached_file( checkpoint_dir, 'quant_cfg.json') with open(mixed_quant_config_file) as fm: json_extended_quant_configs = json.load(fm) - except Exception: + except (OSError, FileNotFoundError, ValueError, json.JSONDecodeError) as e: logger.info( - "No quant_cfg.json found for layer quant info, using base quantization config." + f"No quant_cfg.json found for layer quant info: {e}. Using base quantization config." )tensorrt_llm/_torch/model_config.py (1)
278-289: Consider using or removing is_quant_config_changed.The
is_quant_config_changedreturn value fromupdate_from_model_ckptis assigned but never used in all three call sites (lines 278, 283, 287). Consider either:
- Using this value to log whether a quantization config was found and loaded
- Using
_prefix if intentionally discarding:_is_quant_config_changed, layer_quant_config = ...# quantized ckpt in modelopt format if quant_config_file := cached_file(checkpoint_dir, 'hf_quant_config.json'): - is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt( + _is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt( checkpoint_dir, moe_backend) # quantized ckpt in other formats elif hasattr(pretrained_config, "quantization_config"): - is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt( + _is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt( checkpoint_dir, moe_backend) elif quant_config_file := cached_file(checkpoint_dir, 'dtypes.json'): - is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt( + _is_quant_config_changed, layer_quant_config = quant_config.update_from_model_ckpt( checkpoint_dir, moe_backend)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tensorrt_llm/_torch/model_config.py(2 hunks)tensorrt_llm/llmapi/llm_utils.py(2 hunks)tensorrt_llm/models/modeling_utils.py(2 hunks)tests/unittest/llmapi/test_llm_quant.py(2 hunks)
🧰 Additional context used
🧠 Learnings (16)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/llmapi/test_llm_quant.pytensorrt_llm/_torch/model_config.pytensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tests/unittest/llmapi/test_llm_quant.pytensorrt_llm/_torch/model_config.pytensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/unittest/llmapi/test_llm_quant.pytensorrt_llm/_torch/model_config.py
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.
Applied to files:
tests/unittest/llmapi/test_llm_quant.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tests/unittest/llmapi/test_llm_quant.pytensorrt_llm/_torch/model_config.pytensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/unittest/llmapi/test_llm_quant.pytensorrt_llm/llmapi/llm_utils.pytensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").
Applied to files:
tests/unittest/llmapi/test_llm_quant.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/unittest/llmapi/test_llm_quant.pytensorrt_llm/_torch/model_config.pytensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-10-17T13:21:31.724Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 8398
File: tensorrt_llm/_torch/pyexecutor/sampling_utils.py:237-272
Timestamp: 2025-10-17T13:21:31.724Z
Learning: The setup.py file in TensorRT-LLM explicitly requires Python 3.10+ via `python_requires=">=3.10, <4"`, making match/case statements and other Python 3.10+ features appropriate throughout the codebase.
Applied to files:
tests/unittest/llmapi/test_llm_quant.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/model_config.py
📚 Learning: 2025-08-09T02:04:49.623Z
Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 6760
File: tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py:81-98
Timestamp: 2025-08-09T02:04:49.623Z
Learning: In TensorRT-LLM's auto_deploy module, torch.dtype values in configuration dictionaries must be stored as string representations (e.g., "float16" instead of torch.float16) because OmegaConf.merge does not support torch.dtype types. These string representations are converted to actual torch.dtype objects in downstream code.
Applied to files:
tensorrt_llm/_torch/model_config.py
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
tensorrt_llm/_torch/model_config.pytensorrt_llm/llmapi/llm_utils.pytensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Applied to files:
tensorrt_llm/_torch/model_config.py
📚 Learning: 2025-08-27T14:23:55.566Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.
Applied to files:
tensorrt_llm/_torch/model_config.pytensorrt_llm/models/modeling_utils.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Applied to files:
tensorrt_llm/models/modeling_utils.py
🧬 Code graph analysis (4)
tests/unittest/llmapi/test_llm_quant.py (1)
tensorrt_llm/models/modeling_utils.py (2)
QuantConfig(133-578)update_from_model_ckpt(478-504)
tensorrt_llm/_torch/model_config.py (1)
tensorrt_llm/models/modeling_utils.py (1)
update_from_model_ckpt(478-504)
tensorrt_llm/llmapi/llm_utils.py (3)
tensorrt_llm/quantization/mode.py (1)
QuantAlgo(23-47)tensorrt_llm/models/modeling_utils.py (3)
PretrainedConfig(676-877)QuantConfig(133-578)update_from_model_ckpt(478-504)tensorrt_llm/llmapi/llm_args.py (2)
quant_config(2718-2721)quant_config(2724-2725)
tensorrt_llm/models/modeling_utils.py (2)
tensorrt_llm/_utils.py (1)
get_sm_version(740-742)tensorrt_llm/quantization/mode.py (2)
QuantAlgo(23-47)has_mxfp4(197-199)
🪛 Ruff (0.14.5)
tests/unittest/llmapi/test_llm_quant.py
121-121: Unpacked variable is_quant_config_changed is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
165-165: Unpacked variable is_quant_config_changed is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
tensorrt_llm/_torch/model_config.py
286-286: Local variable quant_config_file is assigned to but never used
(F841)
287-287: Unpacked variable is_quant_config_changed is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
tensorrt_llm/llmapi/llm_utils.py
28-28: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
tensorrt_llm/models/modeling_utils.py
341-343: Avoid specifying long messages outside the exception class
(TRY003)
345-345: Avoid specifying long messages outside the exception class
(TRY003)
359-361: Avoid specifying long messages outside the exception class
(TRY003)
366-368: Avoid specifying long messages outside the exception class
(TRY003)
367-367: Undefined name quant_config
(F821)
381-381: Do not catch blind exception: Exception
(BLE001)
401-403: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/models/modeling_utils.py (1)
201-217: LGTM: Clean environment variable override mechanism.The environment-based override with validation and clear warning messages follows good practices.
tensorrt_llm/llmapi/llm_utils.py (1)
335-339: LGTM: Clean delegation to centralized config updater.The refactored method properly delegates quantization config updates to
QuantConfig.update_from_model_ckpt, simplifying the logic and improving maintainability.tensorrt_llm/_torch/model_config.py (1)
27-69: LGTM: Robust file locking with graceful fallbacks.The context manager provides comprehensive fallback handling for file locking, ensuring safe concurrent access to model configs while gracefully degrading if locking fails. The multiple fallback paths (HF cache → temp dir → no lock) make this resilient to various permission and timeout scenarios.
| @staticmethod | ||
| def get_mxfp4_quant_algo(moe_backend, is_dynamic_quant=False): | ||
| quant_algo = QuantConfig.override_quant_algo() | ||
| if quant_algo is None and not is_dynamic_quant: | ||
| if get_sm_version() >= 100: | ||
| if moe_backend == 'TRITON': | ||
| return QuantAlgo.W4A8_MXFP4_FP8 | ||
| else: | ||
| return QuantAlgo.W4A8_MXFP4_MXFP8 | ||
| else: | ||
| return QuantAlgo.W4A16_MXFP4 | ||
| else: | ||
| return quant_algo | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify CUDA availability before calling get_sm_version().
The method calls get_sm_version() at line 222, which internally uses torch.cuda.get_device_properties(0). This will raise a runtime error if no CUDA device is available or if device index 0 doesn't exist.
Please verify that get_mxfp4_quant_algo is only invoked in contexts where CUDA is guaranteed to be available. Consider adding a guard or documentation if this assumption is not always valid.
🏁 Script executed:
#!/bin/bash
# Check call sites of get_mxfp4_quant_algo to verify CUDA context assumptions
rg -nC3 'get_mxfp4_quant_algo'Length of output: 1720
🏁 Script executed:
#!/bin/bash
# 1. Find get_sm_version implementation
echo "=== get_sm_version implementation ==="
rg -nA10 'def get_sm_version' tensorrt_llm/_utils.py
# 2. Check the context of _update_from_quant_config_json
echo -e "\n=== _update_from_quant_config_json context ==="
rg -nB5 '_update_from_quant_config_json' tensorrt_llm/models/modeling_utils.py | head -30
# 3. Check the context of load_quant_config_from_dtypes_json
echo -e "\n=== load_quant_config_from_dtypes_json context ==="
rg -nB5 'load_quant_config_from_dtypes_json' tensorrt_llm/models/modeling_utils.py | head -30
# 4. Check if there are any guards before calling get_mxfp4_quant_algo
echo -e "\n=== Checking for CUDA guards before get_mxfp4_quant_algo calls ==="
rg -nB10 'get_mxfp4_quant_algo' tensorrt_llm/models/modeling_utils.py | grep -A10 -B10 'cuda\|device\|gpu'Length of output: 1683
🏁 Script executed:
#!/bin/bash
# Get full method definitions to understand context better
echo "=== _update_from_quant_config_json full method ==="
ast-grep --pattern $'def _update_from_quant_config_json($_) {
$$$
}'
echo -e "\n=== load_quant_config_from_dtypes_json full method ==="
ast-grep --pattern $'def load_quant_config_from_dtypes_json($_) {
$$$
}'
echo -e "\n=== Callers of load_from_json (entry point) ==="
rg -nB5 'load_from_json\(' tensorrt_llm/models/modeling_utils.py | grep -v 'test\|Test' | head -40Length of output: 155
🏁 Script executed:
#!/bin/bash
# Extract the full method bodies using line ranges
echo "=== _update_from_quant_config_json (lines 277-320) ==="
sed -n '277,320p' tensorrt_llm/models/modeling_utils.py
echo -e "\n=== load_quant_config_from_dtypes_json (lines 445-480) ==="
sed -n '445,480p' tensorrt_llm/models/modeling_utils.py
echo -e "\n=== Check if these config loading methods check for CUDA ==="
rg -nC15 'def load_from_json' tensorrt_llm/models/modeling_utils.py | head -50
echo -e "\n=== Check if get_mxfp4_quant_algo is guarded by device availability ==="
sed -n '300,315p' tensorrt_llm/models/modeling_utils.py
sed -n '465,475p' tensorrt_llm/models/modeling_utils.pyLength of output: 5241
🏁 Script executed:
#!/bin/bash
# Find where update_from_model_ckpt is called from
echo "=== Callers of update_from_model_ckpt ==="
rg -n 'update_from_model_ckpt' tensorrt_llm/ -g '!*.pyc' | grep -v 'def update_from_model_ckpt'
echo -e "\n=== QuantConfig instantiation and usage ==="
rg -nB3 'QuantConfig\(' tensorrt_llm/models/modeling_utils.py | head -40
echo -e "\n=== Where is QuantConfig.load_from_json called ==="
rg -nB5 'load_from_json' tensorrt_llm/ -g '!*.pyc' -g '!test' | head -50
echo -e "\n=== Check if there are any torch.cuda availability checks in modeling_utils ==="
rg -n 'cuda\|device\|is_gpu' tensorrt_llm/models/modeling_utils.py | head -20Length of output: 1695
🏁 Script executed:
#!/bin/bash
# Check the context around the callers in llmapi/llm_utils.py
echo "=== Context in llmapi/llm_utils.py around line 337 ==="
sed -n '330,345p' tensorrt_llm/llmapi/llm_utils.py
# Check the context in _torch/model_config.py
echo -e "\n=== Context in _torch/model_config.py around lines 278-287 ==="
sed -n '270,295p' tensorrt_llm/_torch/model_config.py
# Check if llmapi or model_config have any GPU/CUDA guards before calling quantization methods
echo -e "\n=== Check llmapi for CUDA guards ==="
rg -nC5 'cuda\|device\|gpu' tensorrt_llm/llmapi/llm_utils.py | head -40
echo -e "\n=== Check model_config for CUDA guards around quantization calls ==="
rg -nC5 'cuda\|device\|gpu' tensorrt_llm/_torch/model_config.py | head -40Length of output: 2152
Add CUDA availability guard to get_sm_version() call in get_mxfp4_quant_algo().
The method calls get_sm_version() at line 222, which unconditionally calls torch.cuda.get_device_properties(0) without checking CUDA availability. This causes a runtime error when loading quantized checkpoints on CPU-only systems. Since update_from_model_ckpt() can be invoked during model initialization in non-GPU contexts, this is not safe.
Add a check or guard before accessing SM version:
if not torch.cuda.is_available():
# Fall back to safe default for CPU environments
return QuantAlgo.W4A16_MXFP4Alternatively, wrap the get_sm_version() call in the method or add a device availability check at the calling sites.
🤖 Prompt for AI Agents
In tensorrt_llm/models/modeling_utils.py around lines 218 to 231, the call to
get_sm_version() is unguarded and will raise on CPU-only systems because
torch.cuda.get_device_properties(0) is used; update get_mxfp4_quant_algo to
first check torch.cuda.is_available() and if CUDA is not available return a safe
CPU default (e.g., QuantAlgo.W4A16_MXFP4) instead of calling get_sm_version(),
otherwise proceed with the existing SM-version based branching; ensure this
early-return handles both is_dynamic_quant cases consistently so CPU-only runs
never call get_sm_version().
|
/bot run |
|
PR_Github #24921 [ run ] triggered by Bot. Commit: |
|
PR_Github #24919 [ run ] completed with state |
Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: Daniel Afrimi <[email protected]>
|
/bot run |
|
PR_Github #24923 [ run ] triggered by Bot. Commit: |
|
PR_Github #24921 [ run ] completed with state |
|
PR_Github #24923 [ run ] completed with state |
|
/bot run |
|
PR_Github #25064 [ run ] triggered by Bot. Commit: |
|
PR_Github #25064 [ run ] completed with state |
Add support for modelopt quantization configuration formats (
config.jsonandhf_quant_config.json) and align the quantization config structure with [llm-compressor’s quantization_configthis changes are for TRT-LLM 1.2.0,
example quantization config in
config.json:refactor the code base where logics for all quantization format are in the QuantConfig class (till now logic was in llm_utils and ModelConfig) and were not align with each other. in addition add support for the new modelopt format.
Summary by CodeRabbit
New Features
Bug Fixes
Tests