-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Hello DeepSeek team and community,
Thank you for your incredible work and contributions to this project. It's a fantastic resource for the community.
I'm writing to report an issue we've observed while profiling MoE model inference under a simulated unbalanced load. Specifically, we are seeing a regular, cyclical latency pattern in the combine operation.
The Problem
While running inference on the Qwen3-235B-A22B model, the combine operation's latency follows a consistent cycle: one iteration takes approximately 2ms, followed by two subsequent iterations that each take around 70us. This ~2ms -> ~70us -> ~70us pattern repeats consistently.
Our Experiment Setup
To help reproduce or diagnose this, here is our environment and scenario:
- Hardware: 4 nodes, each with 8x H200 GPUs (4x8 H200 total).
- Interconnect: We have explicitly disabled NVLink between GPUs to simulate a specific hardware environment.
- Scenario: We are simulating an unbalanced inference workload. The load is distributed as follows:
- A subset of ranks is assigned a sequence of 256 tokens.
- The remaining ranks are assigned a much smaller sequence of 16 tokens.
- Configuration: We have enabled forced perfect expert balancing, which ensures that tokens are routed evenly across all experts. We did this to eliminate token routing imbalance as a potential cause for performance variations.
Question
Given that we have already forced perfect expert load balancing, we are trying to understand what else could be causing this cyclical latency behavior in the combine operation.
Could you provide any guidance on other areas we should investigate?
Any insights or suggestions for further debugging would be greatly appreciated. If there are specific logs or profiling metrics you'd find helpful, please let us know.
Thank you for your time and assistance.
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 NIC12NIC13 NIC14 NIC15 NIC16 NIC17 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS NODE SYS 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX NODE NODE SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE SYS 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE SYS 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODE SYS 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS PIX NODE NODE NODE SYS NODE 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODEPIX NODE NODE SYS NODE 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODENODE PIX NODE SYS NODE 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODENODE NODE PIX SYS NODE 48-95,144-191 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS NODE SYS
NIC1 NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE SYS
NIC2 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE X NODE SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE SYS
NIC3 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODE SYS
NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS PIX NODE NODE NODE SYS NODE
NIC5 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE SYS SYS SYS SYS NODEPIX NODE NODE SYS NODE
NIC6 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE X NODE SYS SYS SYS SYS NODENODE PIX NODE SYS NODE
NIC7 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS NODENODE NODE PIX SYS NODE
NIC8 PIX NODE NODE NODE SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS NODE SYS
NIC9 NODE PIX NODE NODE SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE SYS SYS SYS SYS NODE SYS
NIC10 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE X NODE SYS SYS SYS SYS NODE SYS
NIC11 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS NODE SYS
NIC12 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS NODE
NIC13 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE SYS NODE
NIC14 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODENODE X NODE SYS NODE
NIC15 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODENODE NODE X SYS NODE
NIC16 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS X SYS
NIC17 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS NODENODE NODE NODE SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_20
NIC9: mlx5_21
NIC10: mlx5_22
NIC11: mlx5_23
NIC12: mlx5_24
NIC13: mlx5_25
NIC14: mlx5_26
NIC15: mlx5_27
NIC16: mlx5_bond_0
NIC17: mlx5_data_0