-
Notifications
You must be signed in to change notification settings - Fork 293
Open
Description
The llmcompressor community doesn't provide dynamic FP8 per tensor quantization method. I added the following quantization scheme in the quant_scheme.py file of the compressor_tensors library:
FP8_TENSOR = dict(
weights=QuantizationArgs(
num_bits=8,
type=QuantizationType.FLOAT,
strategy=QuantizationStrategy.TENSOR,
symmetric=True,
dynamic=False,
),
input_activations=QuantizationArgs(
num_bits=8,
type=QuantizationType.FLOAT,
strategy=QuantizationStrategy.TOKEN,
symmetric=True,
dynamic=True,
observer=None,
),
)
Using this quantization scheme, I was able to quantize the Qwen3 MOE model, but it failed during inference on vLLM. The key error message is as follows:
quant_config.per_out_ch_quant == (quant_config.w1_scale.size(1) == w1_q.size(1))
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
My questions are:
- Is the quantization scheme I added for dynamic FP8 per tensor quantization correct?
- Why can't the quantized MOE model run inference on the vLLM framework?
Metadata
Metadata
Assignees
Labels
No labels