Skip to content

[Bug]: SequentialPipeline fails on ERNIE-4.5-VL (remote code) with FX trace TypeError: to(device=MetaDeviceAttribute) #2033

@Firworksyt

Description

@Firworksyt

⚙️ Your current environment

The output of python collect_env.py
### Environment Information ###
Operating System: `Linux-6.8.0-64-generic-x86_64-with-glibc2.35`
Python Version: `3.11.14 (main, Oct 31 2025, 23:04:14) [Clang 21.1.4 ]`
llm-compressor Version: `0.8.1`
compressed-tensors Version: `0.12.2`
transformers Version: `4.56.2`
torch Version: `2.8.0`
CUDA Devices: `['NVIDIA RTX PRO 6000 Blackwell Server Edition', 'NVIDIA RTX PRO 6000 Blackwell Server Edition']`
AMD Devices: `None`

🐛 Describe the bug

When quantizing ERNIE-4.5-VL-28B-A3B-Thinking (baidu/ERNIE-4.5-VL-28B-A3B-Thinking) with LLM-Compressor, the SequentialPipeline inferred by default crashes during FX tracing with:
TypeError: to() received an invalid combination of arguments - got (device=MetaDeviceAttribute, )

This happens before calibration even starts, on a very small dataset slice.

If I instead force pipeline="basic", oneshot runs (until it eventually hits CUDA OOM for larger calibration settings). So the model itself can be quantized, but the sequential/onloading pipeline is currently incompatible.

Maybe I'm just doing something stupid here but this is generally how I've been quantizing other models to NVFP4.

🛠️ Steps to reproduce

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "baidu/ERNIE-4.5-VL-28B-A3B-Thinking"

# Load tokenizer
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)

# Load model with bf16 dtype to avoid unnecessary FP32 tensors
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.bfloat16,
    trust_remote_code=True,
)

# Tiny dataset slice to make repro cheap
raw = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:2]")

def to_text(ex):
    return {"text": tok.apply_chat_template(ex["messages"], tokenize=False)}

raw = raw.map(to_text)

ds = raw.map(
    lambda s: tok(s["text"], truncation=True, max_length=256),
    remove_columns=raw.column_names,
)

# Minimal NVFP4 recipe
recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",
    ignore=["lm_head"],
)

# Run oneshot quantization with default (inferred) pipeline
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    num_calibration_samples=2,
    max_seq_length=256,
    trust_remote_code_model=True,
)

Running this should produce the same error. The number of samples and seq length are extremely low just for troubleshooting speed.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtracingIssues related to model tracing

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions