Skip to content

[Issue]: user-buffer p2p + coll rmtAddr nullptr error #1859

@visualxu

Description

@visualxu

How is this issue impacting you?

Application crash

Share Your Debug Logs

No response

Steps to Reproduce the Issue

Hi, I found that rmtAddr may be nullptr when ncclSend/Recv and ncclCollectives (such as ncclAllreduce/Allgather...) are called in sequence(using the same user-buffer).
The reason is: the peerRmtAddrs in IPC_COLLECTIVES and P2P are different (deviceMemory or hostMemory), so when I call p2p first and then collectives, the variable needUpdate is false and peerRmtAddrs is not filled.
https://github.com/NVIDIA/nccl/blob/master/src/transport/p2p.cc#L911
https://github.com/NVIDIA/nccl/blob/master/src/transport/p2p.cc#L927
Fix suggestion: When regRecord->regIpcAddrs.devPeerRmtAddrs is first allocated, hostRmtAddrs need to be copied to devPeerRmtAddrs.

Thanks.

NCCL Version

v2.24.3-1 cuda12

Your platform details

No response

Error Message & Behavior

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions