-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
How is this issue impacting you?
Application crash
Share Your Debug Logs
No response
Steps to Reproduce the Issue
Hi, I found that rmtAddr may be nullptr when ncclSend/Recv and ncclCollectives (such as ncclAllreduce/Allgather...) are called in sequence(using the same user-buffer).
The reason is: the peerRmtAddrs in IPC_COLLECTIVES and P2P are different (deviceMemory or hostMemory), so when I call p2p first and then collectives, the variable needUpdate is false and peerRmtAddrs is not filled.
https://github.com/NVIDIA/nccl/blob/master/src/transport/p2p.cc#L911
https://github.com/NVIDIA/nccl/blob/master/src/transport/p2p.cc#L927
Fix suggestion: When regRecord->regIpcAddrs.devPeerRmtAddrs is first allocated, hostRmtAddrs need to be copied to devPeerRmtAddrs.
Thanks.
NCCL Version
v2.24.3-1 cuda12
Your platform details
No response
Error Message & Behavior
No response
Metadata
Metadata
Assignees
Labels
No labels