[Bug]: qwen3-30b-a3b-fp8 is no better than the original model when I set tp=4

### ⚙️ Your current environment

<details>
 python3 collect_env.py
[I1119 04:52:31.684685962 debug.cpp:50] [c10d] The debug level is set to INFO.
### Environment Information ###
Operating System: `Linux-5.15.0-1053-nvidia-x86_64-with-glibc2.35`
Python Version: `3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]`
llm-compressor Version: `0.8.2.dev51+gc600e2e3.d20251119`
compressed-tensors Version: `0.12.3a20251114`
transformers Version: `4.54.1`
torch Version: `2.7.1+cu128`
CUDA Devices: `['NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3']`
AMD Devices: `None`



</details>


### 🐛 Describe the bug

I quanted qwen3-30b-a3b to fp8-version,when I tested the accuracy, there exists no big diff. but when I set tp=4 to run-benchmark, the fp8's qps is less than the original models'. and I also profile some prompts to find that fp8-version spends more time on cross_device_reduce_1stage (all_reduce ):
  original model:  3.154ms (15.52%)  ← 291 calls
  FP8 version: 7.582ms (29.28%)  ← 291 calls
  
  diff: +4.428ms (slow 140%！) ⚠️

### 🛠️ Steps to reproduce

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: qwen3-30b-a3b-fp8 is no better than the original model when I set tp=4 #2050

⚙️ Your current environment

🐛 Describe the bug

🛠️ Steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: qwen3-30b-a3b-fp8 is no better than the original model when I set tp=4 #2050

Description

⚙️ Your current environment

🐛 Describe the bug

🛠️ Steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions