Skip to content

Conversation

@sheikheddy
Copy link

@sheikheddy sheikheddy commented Nov 17, 2025

This commit includes:

  • New lora_utils module for unpacking INT4 weights to enable LoRA adapter injection
  • Comprehensive test suite for lora_utils functionality
  • Integration with compressed_tensors_utils for automatic LoRA metadata generation
  • Documentation for INT4+LoRA integration with vLLM
  • Code formatting improvements across multiple modules (ruff format)

The new utilities enable using LoRA adapters with INT4 quantized models by providing on-demand unpacking of compressed weights to floating-point format.

🤖 Generated with Claude Code

SUMMARY:
Enables int 4 + lora for moe models

TEST PLAN:
I asked Claude Code to write and run some tests for this, but haven't read them closely, and am planning to try out Mixtral or Qwen first and then Kimi K2 Thinking.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @sheikheddy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the capability to use LoRA adapters with INT4 quantized models, particularly targeting integration with vLLM. It achieves this by providing the necessary utilities to unpack compressed INT4 weights on demand and by ensuring that models saved with quantization include specific metadata required for LoRA compatibility. The changes also include updated documentation and general code formatting for improved maintainability.

Highlights

  • LoRA INT4 Compatibility Utilities: Introduced a new lora_utils module that provides functionalities to unpack INT4 quantized weights back to floating-point format, enabling LoRA adapter injection for such models.
  • LoRA Metadata Generation: Integrated the new lora_utils with compressed_tensors_utils to automatically generate and save LoRA-specific metadata (like lora_metadata.json and lora_compatible flag in config.json) when a compressed model is saved. This metadata is crucial for downstream frameworks like vLLM to correctly handle LoRA with INT4 models.
  • Documentation for vLLM Integration: Added comprehensive documentation, including a quick start guide (docs/lora_int4_quickstart.md) and a detailed design document (docs/vllm_lora_int4_design.md), explaining how to use LoRA adapters with INT4 quantized models in vLLM.
  • Code Formatting: Applied extensive code formatting improvements across multiple modules using ruff format to enhance readability and maintain consistency.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

This commit includes:
- New lora_utils module for unpacking INT4 weights to enable LoRA adapter injection
- Comprehensive test suite for lora_utils functionality
- Integration with compressed_tensors_utils for automatic LoRA metadata generation
- Documentation for INT4+LoRA integration with vLLM
- Code formatting improvements across multiple modules (ruff format)

The new utilities enable using LoRA adapters with INT4 quantized models by
providing on-demand unpacking of compressed weights to floating-point format.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: sheikheddy <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality to support LoRA adapters with INT4 quantized models. The changes include a new lora_utils module for unpacking INT4 weights, comprehensive tests, and integration into the model saving pipeline to include LoRA metadata. Additionally, new documentation in the form of a quick start guide and a design document has been added. The code is well-structured and the new features are well-tested. I have a couple of minor suggestions for improvement in the documentation and code style.

Comment on lines 347 to 351
try:
from llmcompressor.transformers.compression.lora_utils import get_lora_metadata
except ImportError:
logger.warning("Could not import lora_utils, skipping LoRA metadata generation")
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import for get_lora_metadata is wrapped in a try...except ImportError. Since lora_utils is a new module being added to the project in this same pull request, it should be considered a core dependency rather than an optional one, making the try-except block unnecessary. For better code style, it's recommended to move all imports, including this one and import json on line 345, to the top of the file.

    from llmcompressor.transformers.compression.lora_utils import get_lora_metadata

sheikheddy and others added 3 commits November 17, 2025 03:35
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Sheikh Abdur Raheem Ali <[email protected]>
Refactor compressed_tensors_utils.py to follow Python best practices by
moving all imports to the top of the file:
- Move json import to standard library imports section
- Move get_lora_metadata import to local imports section
- Remove try/except wrapper around lora_utils import

This improves code readability and follows PEP 8 import conventions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: sheikheddy <[email protected]>
@sheikheddy
Copy link
Author

@brian-dellabetta @dsikka @HDCharles hey, need some help with next steps before I'd be comfortable marking it as ready to merge. I'm a new contributor to vllm so happy to hop on call or answer questions async about what I'm trying to achieve if anything is unclear. the vllm part is at vllm-project/vllm#28791.

Tests INT4 quantization + LoRA compatibility in vLLM PR #28791

Results:
- ✅ INT4 + LoRA works for dense models (32B Qwen2)
- ❌ INT4 + LoRA fails for MoE with shared experts (Qwen MoE)
- Bug: SharedFusedMoE missing w2_weight attribute at LoRA init

Affected architectures:
- Qwen MoE, Kimi K2, DeepSeek V3 (all use SharedFusedMoE)
- Mixtral should work (uses standard FusedMoE)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: sheikheddy <[email protected]>
), f"Event lifecycle did not return an event for {event_type}"
assert event is not None, (
f"Event lifecycle did not return an event for {event_type}"
)
Copy link
Collaborator

@HDCharles HDCharles Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already a beast of a PR, can you separate out this formatting stuff into a separate PR.

Also side point: we use linting as outlined https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md#code-styling-and-formatting-checks

whereas it looks like you focused on ruff format and i'm not 100% sure that that specific ruff invocation is going to automatically work with what we're doing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll break it up. I was using ruff since that was what was in vllm proper and I assumed the static rules would be consistent standards across vllm project repos, but in retrospect I should have checked.

)

print("✅ Model quantized and saved to ./model-int4")
print(" - Includes LoRA metadata for vLLM compatibility")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like normal quantization was applied, where was Lora added? or is the assumption this was a Lora model to begin with?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember off the top of my head, I can look into it.



def materialize_weights_for_lora(
model: torch.nn.Module,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like this function may make more sense in compressed-tensors or vllm especially if its only really used for debugging the quantized+lora model.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting one contribution across three repos introduces more points of failure, two is already pushing it in terms of ambition, so I'll plan to move it into vllm (even though I do have a fork of compressed-tensors locally).

@@ -0,0 +1,293 @@
# LoRA + INT4 Quantization Quick Start

This guide shows how to use LoRA adapters with INT4 quantized models using llm-compressor and vLLM.
Copy link
Collaborator

@HDCharles HDCharles Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like it may be good to elaborate on the use case here or in a different README, maybe i'm dumb but at first i thought this was for doing Qlora whereas (i hope i'm getting this right) it looks like its acutally for improving inference speed for of unfused Lora models.

As an example in https://github.com/vllm-project/llm-compressor/blob/99e231e16d7ef45e2fab67c4c77178900eb00f33/examples/awq/README.md?plain=1 we link to documentation for AWQ in general before going into our implementation of it.

Copy link
Author

@sheikheddy sheikheddy Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's some more context on what I'm trying to achieve more generally in this doc (in particular, solution #2): https://docs.google.com/document/d/19CsSgU_aPnYTwNoz67TN9Vdfba_EvlGX4TvRcOQ9Nzw/edit?tab=t.0

I actually don't know the difference between QLora and unfused lora (though I can kinda guess from the name). I'll look it up.

- Check out [quantization recipes](../examples/quantization_w4a16/) for different strategies
- See [LoRA examples](https://docs.vllm.ai/en/latest/models/lora.html) in vLLM docs

## Contributing
Copy link
Collaborator

@HDCharles HDCharles Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk if you need details about how to contribute to vllm in a llm-compressor doc, also these files should probably not be in root in docs. maybe guides/lora/*?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, those are redundant. Oh I hadn't seen guides/lora, thanks for the pointer!

def inject_lora(base_module, lora_adapter):
# Detect and unpack INT4 weights
if is_int4_quantized(base_module):
base_weight = unpack_int4_for_lora(base_module) # ✅ Unpack to FP16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the intention that this will be generalized to other quantization techniques? May make more sense to start with a general unpack helper (with only int4 implemented) that gets dispatched based on the quantization technique rather than have everything checking for int4

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't thinking of other quantization techniques. Sure, I can refactor to make it more general, but let me know if I go too far into enterprise design pattern hell.

@HDCharles
Copy link
Collaborator

need to link directly to related PRs in description.

@HDCharles
Copy link
Collaborator

Still not done reviewing, this is a really cool contribution so thanks for working on this. Will try to finish reviewing later today. Feel free to reach out to me on the vllm slack since i think this is going to take a few iterations to get everything working between this and the vllm PR.

Will try to finish reviewing this later today.

@HDCharles HDCharles self-assigned this Nov 19, 2025
│ ├─> For each target module (q_proj, v_proj, etc.): │
│ │ ├─> Read packed_weight (uint8) │
│ │ ├─> Read weight_scale, weight_zero_point │
│ │ ├─> Unpack: INT4 → FP16 │
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure I understand whats going on here. Why are the int4 weights being unpacked to fp16 if the intention is to use the int4 kernel as outlined in line 104 of this doc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is weird. I had three parallel instances of claude code going in split terminal windows and it was going too fast for me to consistently pay attention to when the agents were producing slop.

**File**: `vllm/model_executor/layers/quantization/compressed_tensors.py`

```python
class CompressedTensorsConfig:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any of this in the vllm PR, is this just hallucination?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, I will remove it.


### 2. Unpacking Module

**File**: `vllm/lora/int4_utils.py` (new file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also seems like a hallucination

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I got lazy

)


def save_lora_metadata(model: torch.nn.Module, save_directory: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unclear why some things are in this utils file and others are in the lora utils

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i think that only ones related to lora should be in lora utils.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that might not be the case currently. i'll take a look and see if any make sense to move.

Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the formatting stuff should be in an unrelated PR
  2. I have several questions about the design I hope you can answer
  3. It looks like there's a bunch of hallucinations in the docs, you should read through all this and verify it yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants