-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
How is this issue impacting you?
Lower performance than expected
Share Your Debug Logs
No response
Steps to Reproduce the Issue
No response
NCCL Version
NCCL version 2.28.9+cuda12.9
Your platform details
Number of nodes: 2
GPUs per node: 8 × H100
Total GPUs: 16
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS SYS SYS SYS NODE 0-41,96-137 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX NODE NODE SYS SYS SYS SYS SYS NODE 0-41,96-137 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE PIX NODE SYS SYS SYS SYS SYS NODE 0-41,96-137 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE PIX SYS SYS SYS SYS SYS NODE 0-41,96-137 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE NODE NODE NODE SYS 48-89,144-185 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE PIX NODE NODE NODE SYS 48-89,144-185 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE PIX NODE NODE SYS 48-89,144-185 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE NODE PIX NODE SYS 48-89,144-185 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS SYS NODE
NIC1 NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE SYS SYS SYS SYS SYS NODE
NIC2 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE X NODE SYS SYS SYS SYS SYS NODE
NIC3 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS SYS NODE
NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE SYS
NIC5 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE NODE SYS
NIC6 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE X NODE NODE SYS
NIC7 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X NODE SYS
NIC8 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE X SYS
NIC9 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS X
Error Message & Behavior
- When
NCCL_NCHANNELS_PER_NET_PEER=32,send/recvhangs. - Reducing
NCCL_NCHANNELS_PER_NET_PEERto lower values (e.g., 8 or 16) allows communication to complete successfully.
No response
Metadata
Metadata
Assignees
Labels
No labels