Fix vllm quantization for new vllm >= 0.17 by mxinO · Pull Request #1146 · NVIDIA/Model-Optimizer

mxinO · 2026-03-31T07:33:37Z

What does this PR do?

Type of change: Bug fix

Fix for vllm >= 0.17

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Refactor
- Improved vLLM quantization detection with guarded checks and cached results to handle optional components more robustly.
- Broadened handling of missing attention implementations to treat multiple import-related failures uniformly.
- Enhanced KV-cache handling to accept single or list/tuple formats and safely derive device/dtype.
- Added tolerance for environments missing certain distributed groups and made attention-related quant module registration conditional.

Signed-off-by: Meng Xin <mxin@nvidia.com>

coderabbitai · 2026-03-31T07:33:56Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3e3f782d-af50-49b1-9bb3-996156b387b0

📥 Commits

Reviewing files that changed from the base of the PR and between 58daff1 and 3065069.

📒 Files selected for processing (1)

modelopt/torch/quantization/plugins/vllm.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/quantization/plugins/vllm.py

📝 Walkthrough

Walkthrough

Replace fragile vLLM presence checks with guarded try/except logic, conditionally define/register attention QuantModules only when symbols exist, make KV-cache handling accept lists/tuples or single tensors, and allow create_parallel_state() to proceed when EP groups are absent by catching related exceptions.

Changes

Cohort / File(s)	Summary
vLLM plugin `modelopt/torch/quantization/plugins/vllm.py`	Replaced direct `find_spec` checks with `try/except`-guarded detection and cached boolean flags, treat missing vLLM MLAAttention via `AttributeError`/`ImportError`, conditionally declare/register `_QuantVLLMCrossAttention` and `_QuantVLLMEncoderOnlyAttention` only when symbols exist, updated `_get_device_dtype()` to accept KV-cache as list/tuple or single object and validate tensors, and made `create_parallel_state()` tolerate missing EP groups by catching `AssertionError`/`RuntimeError`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fix vllm quantization for new vllm >= 0.17' is directly related to the main change: robustifying vllm quantization code for compatibility with vllm >= 0.17 through defensive error handling and conditional registry registration.
Security Anti-Patterns	✅ Passed	PR modifies only vllm.py for vLLM >= 0.17 compatibility with no critical security anti-patterns.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch mxin/vllm-update

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-31T07:37:44Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-01 05:47 UTC

Signed-off-by: Meng Xin <mxin@nvidia.com>

codecov · 2026-03-31T07:53:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.18%. Comparing base (ada1e26) to head (3065069).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1146   +/-   ##
=======================================
  Coverage   70.18%   70.18%           
=======================================
  Files         230      230           
  Lines       26080    26080           
=======================================
  Hits        18304    18304           
  Misses       7776     7776

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kinjalpatel27

LGTM

Signed-off-by: Meng Xin <mxin@nvidia.com>

fix new vllm

98a6a38

Signed-off-by: Meng Xin <mxin@nvidia.com>

mxinO requested a review from a team as a code owner March 31, 2026 07:33

mxinO requested a review from Edwardf0t1 March 31, 2026 07:33

mxinO requested a review from kinjalpatel27 March 31, 2026 07:34

fix cq

58daff1

Signed-off-by: Meng Xin <mxin@nvidia.com>

mxinO requested a review from realAsma March 31, 2026 07:41

mxinO enabled auto-merge (squash) March 31, 2026 08:10

kinjalpatel27 approved these changes Mar 31, 2026

View reviewed changes

realAsma approved these changes Mar 31, 2026

View reviewed changes

mxinO added 2 commits March 31, 2026 16:22

fix ep group

414f68a

Signed-off-by: Meng Xin <mxin@nvidia.com>

Merge remote-tracking branch 'origin/main' into mxin/vllm-update

3065069

mxinO merged commit f1beaba into main Apr 1, 2026
73 of 77 checks passed

mxinO deleted the mxin/vllm-update branch April 1, 2026 05:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix vllm quantization for new vllm >= 0.17#1146

Fix vllm quantization for new vllm >= 0.17#1146
mxinO merged 4 commits intomainfrom
mxin/vllm-update

mxinO commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 31, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

kinjalpatel27 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mxinO commented Mar 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kinjalpatel27 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mxinO commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading

codecov bot commented Mar 31, 2026 •

edited

Loading