Add LoRA INT4 compatibility utilities and apply code formatting #2037

sheikheddy · 2025-11-17T08:32:28Z

This commit includes:

New lora_utils module for unpacking INT4 weights to enable LoRA adapter injection
Comprehensive test suite for lora_utils functionality
Integration with compressed_tensors_utils for automatic LoRA metadata generation
Documentation for INT4+LoRA integration with vLLM
Code formatting improvements across multiple modules (ruff format)

The new utilities enable using LoRA adapters with INT4 quantized models by providing on-demand unpacking of compressed weights to floating-point format.

🤖 Generated with Claude Code

SUMMARY:
Enables int 4 + lora for moe models

TEST PLAN:
I asked Claude Code to write and run some tests for this, but haven't read them closely, and am planning to try out Mixtral or Qwen first and then Kimi K2 Thinking.

gemini-code-assist · 2025-11-17T08:32:58Z

Summary of Changes

Hello @sheikheddy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the capability to use LoRA adapters with INT4 quantized models, particularly targeting integration with vLLM. It achieves this by providing the necessary utilities to unpack compressed INT4 weights on demand and by ensuring that models saved with quantization include specific metadata required for LoRA compatibility. The changes also include updated documentation and general code formatting for improved maintainability.

Highlights

LoRA INT4 Compatibility Utilities: Introduced a new lora_utils module that provides functionalities to unpack INT4 quantized weights back to floating-point format, enabling LoRA adapter injection for such models.
LoRA Metadata Generation: Integrated the new lora_utils with compressed_tensors_utils to automatically generate and save LoRA-specific metadata (like lora_metadata.json and lora_compatible flag in config.json) when a compressed model is saved. This metadata is crucial for downstream frameworks like vLLM to correctly handle LoRA with INT4 models.
Documentation for vLLM Integration: Added comprehensive documentation, including a quick start guide (docs/lora_int4_quickstart.md) and a detailed design document (docs/vllm_lora_int4_design.md), explaining how to use LoRA adapters with INT4 quantized models in vLLM.
Code Formatting: Applied extensive code formatting improvements across multiple modules using ruff format to enhance readability and maintain consistency.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-11-17T08:33:13Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

This commit includes: - New lora_utils module for unpacking INT4 weights to enable LoRA adapter injection - Comprehensive test suite for lora_utils functionality - Integration with compressed_tensors_utils for automatic LoRA metadata generation - Documentation for INT4+LoRA integration with vLLM - Code formatting improvements across multiple modules (ruff format) The new utilities enable using LoRA adapters with INT4 quantized models by providing on-demand unpacking of compressed weights to floating-point format. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: sheikheddy <[email protected]>

gemini-code-assist

Code Review

This pull request introduces significant new functionality to support LoRA adapters with INT4 quantized models. The changes include a new lora_utils module for unpacking INT4 weights, comprehensive tests, and integration into the model saving pipeline to include LoRA metadata. Additionally, new documentation in the form of a quick start guide and a design document has been added. The code is well-structured and the new features are well-tested. I have a couple of minor suggestions for improvement in the documentation and code style.

docs/lora_int4_quickstart.md

gemini-code-assist · 2025-11-17T08:35:02Z

src/llmcompressor/transformers/compression/compressed_tensors_utils.py

+    try:
+        from llmcompressor.transformers.compression.lora_utils import get_lora_metadata
+    except ImportError:
+        logger.warning("Could not import lora_utils, skipping LoRA metadata generation")
+        return


The import for get_lora_metadata is wrapped in a try...except ImportError. Since lora_utils is a new module being added to the project in this same pull request, it should be considered a core dependency rather than an optional one, making the try-except block unnecessary. For better code style, it's recommended to move all imports, including this one and import json on line 345, to the top of the file.

from llmcompressor.transformers.compression.lora_utils import get_lora_metadata

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Sheikh Abdur Raheem Ali <[email protected]>

Refactor compressed_tensors_utils.py to follow Python best practices by moving all imports to the top of the file: - Move json import to standard library imports section - Move get_lora_metadata import to local imports section - Remove try/except wrapper around lora_utils import This improves code readability and follows PEP 8 import conventions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: sheikheddy <[email protected]>

sheikheddy · 2025-11-18T16:34:52Z

@brian-dellabetta @dsikka @HDCharles hey, need some help with next steps before I'd be comfortable marking it as ready to merge. I'm a new contributor to vllm so happy to hop on call or answer questions async about what I'm trying to achieve if anything is unclear. the vllm part is at vllm-project/vllm#28791.

Tests INT4 quantization + LoRA compatibility in vLLM PR #28791 Results: - ✅ INT4 + LoRA works for dense models (32B Qwen2) - ❌ INT4 + LoRA fails for MoE with shared experts (Qwen MoE) - Bug: SharedFusedMoE missing w2_weight attribute at LoRA init Affected architectures: - Qwen MoE, Kimi K2, DeepSeek V3 (all use SharedFusedMoE) - Mixtral should work (uses standard FusedMoE) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: sheikheddy <[email protected]>

HDCharles · 2025-11-19T16:28:58Z

src/llmcompressor/core/lifecycle.py

-        ), f"Event lifecycle did not return an event for {event_type}"
+        assert event is not None, (
+            f"Event lifecycle did not return an event for {event_type}"
+        )


This is already a beast of a PR, can you separate out this formatting stuff into a separate PR.

Also side point: we use linting as outlined https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md#code-styling-and-formatting-checks

whereas it looks like you focused on ruff format and i'm not 100% sure that that specific ruff invocation is going to automatically work with what we're doing.

I'll break it up. I was using ruff since that was what was in vllm proper and I assumed the static rules would be consistent standards across vllm project repos, but in retrospect I should have checked.

HDCharles · 2025-11-19T16:37:23Z

docs/lora_int4_quickstart.md

+)
+
+print("✅ Model quantized and saved to ./model-int4")
+print("   - Includes LoRA metadata for vLLM compatibility")


it looks like normal quantization was applied, where was Lora added? or is the assumption this was a Lora model to begin with?

I don't remember off the top of my head, I can look into it.

HDCharles · 2025-11-19T16:42:39Z

src/llmcompressor/transformers/compression/lora_utils.py

+
+
+def materialize_weights_for_lora(
+    model: torch.nn.Module,


feels like this function may make more sense in compressed-tensors or vllm especially if its only really used for debugging the quantized+lora model.

Splitting one contribution across three repos introduces more points of failure, two is already pushing it in terms of ambition, so I'll plan to move it into vllm (even though I do have a fork of compressed-tensors locally).

HDCharles · 2025-11-19T16:46:51Z

docs/lora_int4_quickstart.md

@@ -0,0 +1,293 @@
+# LoRA + INT4 Quantization Quick Start
+
+This guide shows how to use LoRA adapters with INT4 quantized models using llm-compressor and vLLM.


i feel like it may be good to elaborate on the use case here or in a different README, maybe i'm dumb but at first i thought this was for doing Qlora whereas (i hope i'm getting this right) it looks like its acutally for improving inference speed for of unfused Lora models.

As an example in https://github.com/vllm-project/llm-compressor/blob/99e231e16d7ef45e2fab67c4c77178900eb00f33/examples/awq/README.md?plain=1 we link to documentation for AWQ in general before going into our implementation of it.

there's some more context on what I'm trying to achieve more generally in this doc (in particular, solution #2): https://docs.google.com/document/d/19CsSgU_aPnYTwNoz67TN9Vdfba_EvlGX4TvRcOQ9Nzw/edit?tab=t.0

I actually don't know the difference between QLora and unfused lora (though I can kinda guess from the name). I'll look it up.

HDCharles · 2025-11-19T16:48:42Z

docs/lora_int4_quickstart.md

+- Check out [quantization recipes](../examples/quantization_w4a16/) for different strategies
+- See [LoRA examples](https://docs.vllm.ai/en/latest/models/lora.html) in vLLM docs
+
+## Contributing


idk if you need details about how to contribute to vllm in a llm-compressor doc, also these files should probably not be in root in docs. maybe guides/lora/*?

Yeah, those are redundant. Oh I hadn't seen guides/lora, thanks for the pointer!

HDCharles · 2025-11-19T16:53:35Z

docs/vllm_lora_int4_design.md

+def inject_lora(base_module, lora_adapter):
+    # Detect and unpack INT4 weights
+    if is_int4_quantized(base_module):
+        base_weight = unpack_int4_for_lora(base_module)  # ✅ Unpack to FP16


is the intention that this will be generalized to other quantization techniques? May make more sense to start with a general unpack helper (with only int4 implemented) that gets dispatched based on the quantization technique rather than have everything checking for int4

I wasn't thinking of other quantization techniques. Sure, I can refactor to make it more general, but let me know if I go too far into enterprise design pattern hell.

HDCharles · 2025-11-19T16:54:19Z

need to link directly to related PRs in description.

HDCharles · 2025-11-19T17:12:56Z

Still not done reviewing, this is a really cool contribution so thanks for working on this. Will try to finish reviewing later today. Feel free to reach out to me on the vllm slack since i think this is going to take a few iterations to get everything working between this and the vllm PR.

Will try to finish reviewing this later today.

HDCharles · 2025-11-19T18:37:11Z

docs/vllm_lora_int4_design.md

+│     ├─> For each target module (q_proj, v_proj, etc.):       │
+│     │   ├─> Read packed_weight (uint8)                       │
+│     │   ├─> Read weight_scale, weight_zero_point            │
+│     │   ├─> Unpack: INT4 → FP16                            │


i'm not sure I understand whats going on here. Why are the int4 weights being unpacked to fp16 if the intention is to use the int4 kernel as outlined in line 104 of this doc

I agree this is weird. I had three parallel instances of claude code going in split terminal windows and it was going too fast for me to consistently pay attention to when the agents were producing slop.

HDCharles · 2025-11-19T18:41:37Z

docs/vllm_lora_int4_design.md

+**File**: `vllm/model_executor/layers/quantization/compressed_tensors.py`
+
+```python
+class CompressedTensorsConfig:


I'm not seeing any of this in the vllm PR, is this just hallucination?

I think so, I will remove it.

HDCharles · 2025-11-19T18:42:38Z

docs/vllm_lora_int4_design.md

+
+### 2. Unpacking Module
+
+**File**: `vllm/lora/int4_utils.py` (new file)


this also seems like a hallucination

Sorry, I think I got lazy

HDCharles · 2025-11-19T18:44:03Z

src/llmcompressor/transformers/compression/compressed_tensors_utils.py

    )


+def save_lora_metadata(model: torch.nn.Module, save_directory: str):


unclear why some things are in this utils file and others are in the lora utils

hmm, i think that only ones related to lora should be in lora utils.

but that might not be the case currently. i'll take a look and see if any make sense to move.

HDCharles

the formatting stuff should be in an unrelated PR
I have several questions about the design I hope you can answer
It looks like there's a bunch of hallucinations in the docs, you should read through all this and verify it yourself.

sheikheddy force-pushed the main branch from 68c093a to d73e6fd Compare November 17, 2025 08:33

gemini-code-assist bot reviewed Nov 17, 2025

View reviewed changes

sheikheddy and others added 3 commits November 17, 2025 03:35

Update docs/lora_int4_quickstart.md

2f18d3f

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Sheikh Abdur Raheem Ali <[email protected]>

Merge branch 'main' into main

38d4ce5

sheikheddy force-pushed the main branch from 4bd7178 to 9d7d429 Compare November 18, 2025 23:05

HDCharles reviewed Nov 19, 2025

View reviewed changes

HDCharles self-assigned this Nov 19, 2025

HDCharles reviewed Nov 19, 2025

View reviewed changes

HDCharles requested changes Nov 19, 2025

View reviewed changes

		@@ -0,0 +1,293 @@
		# LoRA + INT4 Quantization Quick Start

		This guide shows how to use LoRA adapters with INT4 quantized models using llm-compressor and vLLM.


		### 2. Unpacking Module

		File: `vllm/lora/int4_utils.py` (new file)

		)


		def save_lora_metadata(model: torch.nn.Module, save_directory: str):

Add LoRA INT4 compatibility utilities and apply code formatting #2037

Are you sure you want to change the base?

Add LoRA INT4 compatibility utilities and apply code formatting #2037

Conversation

sheikheddy commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 17, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Nov 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

sheikheddy commented Nov 18, 2025

Uh oh!

HDCharles Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HDCharles Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sheikheddy Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HDCharles Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HDCharles commented Nov 19, 2025

Uh oh!

HDCharles commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sheikheddy commented Nov 17, 2025 •

edited

Loading

HDCharles Nov 19, 2025 •

edited

Loading

HDCharles Nov 19, 2025 •

edited

Loading

sheikheddy Nov 19, 2025 •

edited

Loading

HDCharles Nov 19, 2025 •

edited

Loading