Skip to content

[Bug]: qwen3-30b-a3b-fp8 is no better than the original model when I set tp=4 #2050

@youngze0016

Description

@youngze0016

⚙️ Your current environment

python3 collect_env.py [I1119 04:52:31.684685962 debug.cpp:50] [c10d] The debug level is set to INFO. ### Environment Information ### Operating System: `Linux-5.15.0-1053-nvidia-x86_64-with-glibc2.35` Python Version: `3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0]` llm-compressor Version: `0.8.2.dev51+gc600e2e3.d20251119` compressed-tensors Version: `0.12.3a20251114` transformers Version: `4.54.1` torch Version: `2.7.1+cu128` CUDA Devices: `['NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3']` AMD Devices: `None`

🐛 Describe the bug

I quanted qwen3-30b-a3b to fp8-version,when I tested the accuracy, there exists no big diff. but when I set tp=4 to run-benchmark, the fp8's qps is less than the original models'. and I also profile some prompts to find that fp8-version spends more time on cross_device_reduce_1stage (all_reduce ):
original model: 3.154ms (15.52%) ← 291 calls
FP8 version: 7.582ms (29.28%) ← 291 calls

diff: +4.428ms (slow 140%!) ⚠️

🛠️ Steps to reproduce

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions