How is this issue impacting you?
Application crash
Share Your Debug Logs
2025-11-09 15:39:31.043 +08:00 [2025-11-09 07:39:30] h100-28:238:1313 [0] transport/net_ib.cc:139 NCCL WARN communicator encountered a fatal error (detected in ncclIbTest)
2025-11-09 15:39:31.043 +08:00 [2025-11-09 07:39:30] h100-28:238:1219 [0] transport/net_ib.cc:180 NCCL WARN NET/IB : mlx5_ib0:1 async fatal event on QP (0x7f69f01519a8): local access violation work queue error
2025-11-09 15:39:31.043 +08:00 [2025-11-09 07:39:30] h100-28:238:1313 [0] transport/net_ib.cc:2451 NCCL WARN NET/IB: Got completion from peer 172.16.12.34<38081> with status=4 opcode=0 len=0 vendor err 83 (Send) hca mlx5_ib0
Steps to Reproduce the Issue
nodeA: H100 X 8
nodeB: H100 X 8
The model operates with a TP of 16, and communication is normal 99% of the time. Occasionally, this error occurs. What could be the possible reason for this?
NCCL Version
2.21.5-1+cuda12.4
Your platform details
No response
Error Message & Behavior
No response