Skip to content

Conversation

@marksantesson
Copy link
Collaborator

@marksantesson marksantesson commented Nov 6, 2025

NCCL4Py provides Python language bindings for NCCL, providing a Pythonic interface to NCCL library's functionality. It enables Python applications to leverage NCCL's GPU-accelerated multi-GPU and multi-node communication capabilities for distributed computing workloads.

Key Features

  • Pythonic Interface: Clean, intuitive Python API that follows Python conventions and best practices
  • Seamless Integration: Direct compatibility with PyTorch and CuPy for zero-copy GPU data transfer
  • Complete NCCL Support: Access to all NCCL collective operations, point-to-point communication, and advanced features
  • Type Safety: Comprehensive type annotations for better IDE support and code validation
  • Resource Management: Automatic resource cleanup with Python's context managers and garbage collection
  • Flexible Initialization: Multiple initialization methods including MPI and manual unique ID distribution
  • NCCL4Py supports all NCCL collective communication primitives: AllReduce, Broadcast, Reduce, AllGather, ReduceScatter, AllToAll, Gather, and Scatter.

Usage Model

NCCL4Py follows a simple workflow:

  • Obtain Unique ID: Generate or receive a NCCL unique ID for communicator initialization
  • Initialize Communicator: Create a communicator that connects multiple ranks using the unique ID
  • Allocate Buffers: Create GPU buffers using PyTorch, CuPy, or NCCL's memory allocator
  • Perform Communication: Use collective operations (all_reduce, broadcast, etc.) or point-to-point (send/recv)
  • Cleanup: Destroy the communicator when done

Limitations

  • GPU-Only: NCCL4Py only supports CUDA-enabled GPUs (CPU operations are not supported)
  • Python 3.10+: Requires Python 3.10 or later
  • Dependencies: Requires NCCL library, CUDA toolkit, and optionally PyTorch or CuPy for array operations

For more details, see the respective sections in this documentation.

Quick Start

Here's a minimal example demonstrating NCCL4Py with an AllReduce operation:

from mpi4py import MPI
import cupy as cp
from cuda.core.experimental import Device, system
import nccl.core as nccl

comm_mpi = MPI.COMM_WORLD
rank = comm_mpi.Get_rank()
nranks = comm_mpi.Get_size()

dev = Device(rank % system.num_devices)
dev.set_current()

unique_id = nccl.get_unique_id() if rank == 0 else None
unique_id = comm_mpi.bcast(unique_id, root=0)

nccl_comm = nccl.Communicator.init(nranks=nranks, rank=rank, unique_id=unique_id)

data = cp.array([float(rank)], dtype=cp.float32)

nccl_comm.all_reduce(data, data, nccl.SUM)
cp.cuda.Stream.null.synchronize()

expected = float(nranks * (nranks - 1) // 2)
print(f"Rank {rank}: result = {float(data[0]):.0f} (expected {expected:.0f})")

nccl_comm.destroy()

@sjeaugey sjeaugey changed the title Add nccl4py [Feature Preview] Introduce Python bindings for NCCL with nccl4py Nov 6, 2025
"nccl.bindings" = ["*.pyi"]

[tool.setuptools.exclude-package-data]
"*" = ["__pycache__/*", "*.py[co]", "*.stamp", "*.pyx", "*.pxd", "*.cpp", "*.c"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All *.pxd files are excluded. However, things like cynccl.pxd may still be useful for third party code, as it provides access to the core NCCL API. Maybe this file should be promoted to the top-level package directory and exposed to third-party Cython code? That's what I do in mpi4py with libmpi.pxd, though I have no idea how much people use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. By design, cynccl.pxd is meant to be public facing. However, I don't think this is currently tested by nccl4py? If not, then we should document that the Cython interface is experimental.

Copy link

@xiakun-lu xiakun-lu Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, only nccl.core is tested. The target of nccl4py for the first phase is to provide pythonic interfaces to use NCCL, this is the main reason I excluded .pxd files, and expect users to copy .pxd files from source tree if they do want to experiment on the cython interface. expose and mark them as experimental is a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants