Skip to content

Cyclical Latency Variation in combine Operation During Unbalanced MoE Inference #485

@FirwoodLin

Description

@FirwoodLin

Hello DeepSeek team and community,

Thank you for your incredible work and contributions to this project. It's a fantastic resource for the community.

I'm writing to report an issue we've observed while profiling MoE model inference under a simulated unbalanced load. Specifically, we are seeing a regular, cyclical latency pattern in the combine operation.

The Problem

While running inference on the Qwen3-235B-A22B model, the combine operation's latency follows a consistent cycle: one iteration takes approximately 2ms, followed by two subsequent iterations that each take around 70us. This ~2ms -> ~70us -> ~70us pattern repeats consistently.

Our Experiment Setup

To help reproduce or diagnose this, here is our environment and scenario:

  • Hardware: 4 nodes, each with 8x H200 GPUs (4x8 H200 total).
  • Interconnect: We have explicitly disabled NVLink between GPUs to simulate a specific hardware environment.
  • Scenario: We are simulating an unbalanced inference workload. The load is distributed as follows:
    • A subset of ranks is assigned a sequence of 256 tokens.
    • The remaining ranks are assigned a much smaller sequence of 16 tokens.
  • Configuration: We have enabled forced perfect expert balancing, which ensures that tokens are routed evenly across all experts. We did this to eliminate token routing imbalance as a potential cause for performance variations.

Question

Given that we have already forced perfect expert load balancing, we are trying to understand what else could be causing this cyclical latency behavior in the combine operation.

Could you provide any guidance on other areas we should investigate?

Any insights or suggestions for further debugging would be greatly appreciated. If there are specific logs or profiling metrics you'd find helpful, please let us know.

Thank you for your time and assistance.

Image
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   NIC12NIC13   NIC14   NIC15   NIC16   NIC17   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS SYS      SYS     SYS     NODE    SYS     0-47,96-143     0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS SYS      SYS     SYS     NODE    SYS     0-47,96-143     0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS SYS      SYS     SYS     NODE    SYS     0-47,96-143     0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS SYS      SYS     SYS     NODE    SYS     0-47,96-143     0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     PIX NODE     NODE    NODE    SYS     NODE    48-95,144-191   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODEPIX      NODE    NODE    SYS     NODE    48-95,144-191   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODENODE     PIX     NODE    SYS     NODE    48-95,144-191   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODENODE     NODE    PIX     SYS     NODE    48-95,144-191   1               N/A
NIC0    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS SYS      SYS     SYS     NODE    SYS
NIC1    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS SYS      SYS     SYS     NODE    SYS
NIC2    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS SYS      SYS     SYS     NODE    SYS
NIC3    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS SYS      SYS     SYS     NODE    SYS
NIC4    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     PIX NODE     NODE    NODE    SYS     NODE
NIC5    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     NODEPIX      NODE    NODE    SYS     NODE
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     NODENODE     PIX     NODE    SYS     NODE
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     NODENODE     NODE    PIX     SYS     NODE
NIC8    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS SYS      SYS     SYS     NODE    SYS
NIC9    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS SYS      SYS     SYS     NODE    SYS
NIC10   NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS SYS      SYS     SYS     NODE    SYS
NIC11   NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS SYS      SYS     SYS     NODE    SYS
NIC12   SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X  NODE     NODE    NODE    SYS     NODE
NIC13   SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE X       NODE    NODE    SYS     NODE
NIC14   SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODENODE      X      NODE    SYS     NODE
NIC15   SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODENODE     NODE     X      SYS     NODE
NIC16   NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS SYS      SYS     SYS      X      SYS
NIC17   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODENODE     NODE    NODE    SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_20
  NIC9: mlx5_21
  NIC10: mlx5_22
  NIC11: mlx5_23
  NIC12: mlx5_24
  NIC13: mlx5_25
  NIC14: mlx5_26
  NIC15: mlx5_27
  NIC16: mlx5_bond_0
  NIC17: mlx5_data_0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions