Skip to content

Conversation

@marksantesson
Copy link
Collaborator

This is an Early Access release. It contains new features and APIs that may be changed before the official General Access release. In particular, the host one-sided API is not frozen. It also may contain performance degradation that is not representative of the General Access release.

Host One-Sided RMA API

  • Introduces one-sided ncclPut, ncclSignal and ncclWaitSignal APIs. Please refer to src/nccl.h.in for API definitions.
  • The one-sided API support both intra-node (NVL) and inter-node (network) operations
  • The implementation of the one-sided APIs does not use SM (zero-SM)
  • Put/WaitSignal over network can achieve ~25% latency reduction compared to send/recv.
  • The API signature, functionalities and final performance are subject to change for the official release.

Support Requirements

  • The one-sided API requires NCCL window registration of the user buffers
  • To enable one-sided API over the network, please set
    • NCCL_GIN_TYPE=2
    • NCCL_NET=IB
  • Current one-sided APIs do not work with CUDA graph capture

Example

The example below demonstrate a ping-pong communication pattern using the ncclPut and ncclWaitSignal between two ranks.

// Configure NCCL with one-sided RMA support
ncclConfig_t config = NCCL_CONFIG_INITIALIZER;
config.numRmaCtx = 1;  // Enable RMA with 1 context
config.blocking = 1;
NCCLCHECK(ncclCommInitRankConfig(&comm, nRanks, id, myRank, &config));

int ctx = 0;  // Use context 0 for RMA operations

// Allocate symmetric memory for RMA operations
void *sendbuff, *recvbuff;
NCCLCHECK(ncclMemAlloc((void**)&sendbuff, size));
NCCLCHECK(ncclMemAlloc((void**)&recvbuff, size));

// Register both send and receive buffers as symmetric windows for RMA operations
ncclWindow_t sendWindow, recvWindow;
NCCLCHECK(ncclCommWindowRegister(comm, sendbuff, args.end_size, &sendWindow, NCCL_WIN_COLL_SYMMETRIC));
NCCLCHECK(ncclCommWindowRegister(comm, recvbuff, args.end_size, &recvWindow, NCCL_WIN_COLL_SYMMETRIC));

// Ensure all ranks have completed window registration before proceeding
MPICHECK(MPI_Barrier(MPI_COMM_WORLD));

if (myRank == 0) {  // Rank 0: wait then put
    // Wait for signal from peer
    int nsignals = 1;
    int peer = 1;
    NCCLCHECK(ncclWaitSignal(ctx, &peer, &nsignals, 1, NCCL_SIGNAL_DISTINCT, comm, stream));

    // Put data with signal to peer's receive buffer
    NCCLCHECK(ncclPut(ctx, sendbuff, nelems, ncclInt, peer, 0, recvWindow,
                        NCCL_SIGNAL_DISTINCT, comm, stream));

} else {   // Rank 1: put then wait
    // Put data with signal to peer's receive buffer
    int peer = 0;
    NCCLCHECK(ncclPut(ctx, sendbuff, nelems, ncclInt, peer, 0, recvWindow,
                        NCCL_SIGNAL_DISTINCT, comm, stream));

    // Wait for signal from peer
    int nsignals = 1;
    NCCLCHECK(ncclWaitSignal(ctx, &peer, &nsignals, 1, NCCL_SIGNAL_DISTINCT, comm, stream));

}
NCCLCHECK(ncclCommWindowDeregister(comm, sendWindow));
NCCLCHECK(ncclCommWindowDeregister(comm, recvWindow));
NCCLCHECK(ncclMemFree(sendbuff));
NCCLCHECK(ncclMemFree(recvbuff));
CUDACHECK(cudaStreamDestroy(stream));
NCCLCHECK(ncclCommFinalize(comm));
NCCLCHECK(ncclCommDestroy(comm));

@xiaofanl-nvidia xiaofanl-nvidia changed the title NCCL Put API Preview [Feature Preview] Introduce one-sided API with ncclPut Nov 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants