-
Notifications
You must be signed in to change notification settings - Fork 293
Open
Labels
bugSomething isn't workingSomething isn't workingllamaFor any PR / issue related to Llama herd supportFor any PR / issue related to Llama herd support
Description
⚙️ Your current environment
The output of python collect_env.py
### Environment Information ###
Operating System: `Linux-6.8.0-85-generic-x86_64-with-glibc2.39`
Python Version: `3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]`
llm-compressor Version: `0.8.2.dev54+g6fea8880.d20251119`
compressed-tensors Version: `0.12.3a20251114`
transformers Version: `4.57.1`
torch Version: `2.9.0`
CUDA Devices: `['NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200']`
AMD Devices: `None`
🐛 Describe the bug
When running examples/quantization_w4a4_fp4/llama4_example.py to quantize the Llama-4-Maverick-17B-128E-Instruct model, the generated config.json places all MoE experts under the ignore list - none of the routed experts are quantized. These should get quantized.
The shared expert is quantized correctly, but all 128 routed experts remain unquantized, producing no quantized tensors such as w1_weight, w2_weight, w3_weight.
This leads to vLLM failing to load the model with:
KeyError: 'layers.17.feed_forward.experts.126.w1_weight'
because the expected quantized expert parameters were never created.
Impact:
MoE expert weights are silently omitted during quantization, producing incomplete checkpoints incompatible with vLLM inference.
🛠️ Steps to reproduce
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingllamaFor any PR / issue related to Llama herd supportFor any PR / issue related to Llama herd support