-
Notifications
You must be signed in to change notification settings - Fork 292
Open
Labels
bugSomething isn't workingSomething isn't working
Description
⚙️ Your current environment
python3 collect_env.py
[I1119 04:52:31.684685962 debug.cpp:50] [c10d] The debug level is set to INFO.
### Environment Information ###
Operating System: `Linux-5.15.0-1053-nvidia-x86_64-with-glibc2.35`
Python Version: `3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0]`
llm-compressor Version: `0.8.2.dev51+gc600e2e3.d20251119`
compressed-tensors Version: `0.12.3a20251114`
transformers Version: `4.54.1`
torch Version: `2.7.1+cu128`
CUDA Devices: `['NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3']`
AMD Devices: `None`
🐛 Describe the bug
I quanted qwen3-30b-a3b to fp8-version,when I tested the accuracy, there exists no big diff. but when I set tp=4 to run-benchmark, the fp8's qps is less than the original models'. and I also profile some prompts to find that fp8-version spends more time on cross_device_reduce_1stage (all_reduce ):
original model: 3.154ms (15.52%) ← 291 calls
FP8 version: 7.582ms (29.28%) ← 291 calls
diff: +4.428ms (slow 140%!)
🛠️ Steps to reproduce
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working