-
Notifications
You must be signed in to change notification settings - Fork 292
Description
⚙️ Your current environment
The output of python collect_env.py
### Environment Information ###
Operating System: `Linux-6.8.0-64-generic-x86_64-with-glibc2.35`
Python Version: `3.11.14 (main, Oct 31 2025, 23:04:14) [Clang 21.1.4 ]`
llm-compressor Version: `0.8.1`
compressed-tensors Version: `0.12.2`
transformers Version: `4.56.2`
torch Version: `2.8.0`
CUDA Devices: `['NVIDIA RTX PRO 6000 Blackwell Server Edition', 'NVIDIA RTX PRO 6000 Blackwell Server Edition']`
AMD Devices: `None`
🐛 Describe the bug
When quantizing ERNIE-4.5-VL-28B-A3B-Thinking (baidu/ERNIE-4.5-VL-28B-A3B-Thinking) with LLM-Compressor, the SequentialPipeline inferred by default crashes during FX tracing with:
TypeError: to() received an invalid combination of arguments - got (device=MetaDeviceAttribute, )
This happens before calibration even starts, on a very small dataset slice.
If I instead force pipeline="basic", oneshot runs (until it eventually hits CUDA OOM for larger calibration settings). So the model itself can be quantized, but the sequential/onloading pipeline is currently incompatible.
Maybe I'm just doing something stupid here but this is generally how I've been quantizing other models to NVFP4.
🛠️ Steps to reproduce
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
MODEL_ID = "baidu/ERNIE-4.5-VL-28B-A3B-Thinking"
# Load tokenizer
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
# Load model with bf16 dtype to avoid unnecessary FP32 tensors
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
dtype=torch.bfloat16,
trust_remote_code=True,
)
# Tiny dataset slice to make repro cheap
raw = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:2]")
def to_text(ex):
return {"text": tok.apply_chat_template(ex["messages"], tokenize=False)}
raw = raw.map(to_text)
ds = raw.map(
lambda s: tok(s["text"], truncation=True, max_length=256),
remove_columns=raw.column_names,
)
# Minimal NVFP4 recipe
recipe = QuantizationModifier(
targets="Linear",
scheme="NVFP4",
ignore=["lm_head"],
)
# Run oneshot quantization with default (inferred) pipeline
oneshot(
model=model,
dataset=ds,
recipe=recipe,
num_calibration_samples=2,
max_seq_length=256,
trust_remote_code_model=True,
)Running this should produce the same error. The number of samples and seq length are extremely low just for troubleshooting speed.