Skip to content

[Bug]: Llama-4-Maverick-17B-128E-Instruct quantization skips all MoE experts → missing expert weights → vLLM load failure #2060

@shubhra

Description

@shubhra

⚙️ Your current environment

The output of python collect_env.py
### Environment Information ###
Operating System: `Linux-6.8.0-85-generic-x86_64-with-glibc2.39`
Python Version: `3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]`
llm-compressor Version: `0.8.2.dev54+g6fea8880.d20251119`
compressed-tensors Version: `0.12.3a20251114`
transformers Version: `4.57.1`
torch Version: `2.9.0`
CUDA Devices: `['NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200']`
AMD Devices: `None`

🐛 Describe the bug

When running examples/quantization_w4a4_fp4/llama4_example.py to quantize the Llama-4-Maverick-17B-128E-Instruct model, the generated config.json places all MoE experts under the ignore list - none of the routed experts are quantized. These should get quantized.

The shared expert is quantized correctly, but all 128 routed experts remain unquantized, producing no quantized tensors such as w1_weight, w2_weight, w3_weight.

This leads to vLLM failing to load the model with:

 KeyError: 'layers.17.feed_forward.experts.126.w1_weight'

because the expected quantized expert parameters were never created.

Impact:
MoE expert weights are silently omitted during quantization, producing incomplete checkpoints incompatible with vLLM inference.

🛠️ Steps to reproduce

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingllamaFor any PR / issue related to Llama herd support

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions