-
Notifications
You must be signed in to change notification settings - Fork 1k
Add NCCL Communication Backend for DeepEP #521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
aamirshafi
wants to merge
69
commits into
deepseek-ai:main
Choose a base branch
from
aamirshafi:nccl
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…-ai#217) * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * more * add flag * add test * fix * more * apply
Signed-off-by: wangfakang <[email protected]>
- Add TORCH_DISTRIBUTED_BACKEND env var configuration - Fix tensor shape compatibility between NCCL and Gloo - Add backend-aware wrappers for distributed operations - Update test files to work with different backends
Signed-off-by: Georgios Theodorakis <[email protected]>
Signed-off-by: Georgios Theodorakis <[email protected]>
…. Removed redundant/dead code. Signed-off-by: Georgios Theodorakis <[email protected]>
Signed-off-by: Georgios Theodorakis <[email protected]>
Signed-off-by: Georgios Theodorakis <[email protected]>
Signed-off-by: Georgios Theodorakis <[email protected]>
Signed-off-by: Georgios Theodorakis <[email protected]>
…de for the DeepEP buffers. Signed-off-by: Georgios Theodorakis <[email protected]>
NVLink comms is disabled for now.
… enabled. Removing unnecessary comments from internode.cu Signed-off-by: Georgios Theodorakis <[email protected]>
…mmunicators only across symmetric RDMA ranks). Signed-off-by: Georgios Theodorakis <[email protected]>
Signed-off-by: Georgios Theodorakis <[email protected]>
- Fix COMBINE_LAUNCH_CASE macro redefinition in internode_ll.cu - Fix Python linting errors (unused imports/variables, missing imports) - Enable half and bfloat16 operators by undefining PyTorch's NO_* flags in setup.py - Fix missing os import in test_low_latency.py - Remove trailing whitespace in test utils
Signed-off-by: Georgios Theodorakis <[email protected]>
Signed-off-by: Georgios Theodorakis <[email protected]>
Signed-off-by: Georgios Theodorakis <[email protected]>
Signed-off-by: Georgios Theodorakis <[email protected]>
Signed-off-by: Georgios Theodorakis <[email protected]>
Extends existing --disable-nvlink flag (already supported in NVSHMEM) to work with NCCL GIN backend. Implements device-side P2P pointer resolution via ncclGetPeerPointer with fallback to RDMA when P2P unavailable or disabled.
buffer cleanup to prepare for next dispatch/combine. This helps avoid a sync at the end of dispatch. Previously signals for the current dispatch/combine were cleared towards the end of the dispatch/combine kernels.
This commit resolves compilation issues when building with NCCL-only or
NVSHMEM modes by properly guarding backend-specific code with preprocessor
directives. It also documents the LD_PRELOAD workaround needed for NCCL
symbol resolution with PyTorch.
Key changes:
1. Renamed flag DISABLE_NVSHMEM to DISABLE_NVSHMEM_AND_NCCL for clarity
- Updated all occurrences across setup.py, configs.cuh, runtime.cu,
config.hpp, and deep_ep.cpp
2. Fixed conditional compilation in internode.cu and internode_ll.cu:
- Guarded #include "ibgda_device.cuh" with #ifdef ENABLE_NVSHMEM
- Guarded extern nvshmem_team_t cpu_rdma_team with #ifdef ENABLE_NVSHMEM
- Changed all #else blocks to #elif defined(ENABLE_NVSHMEM) for explicit
backend selection (31 instances in internode.cu, 18 in internode_ll.cu)
- Made function signatures and kernel parameters conditional based on
ENABLE_NCCL vs ENABLE_NVSHMEM
- Fixed kernel launch macros to pass correct parameters per backend
3. Fixed runtime.cu NVSHMEM header guards:
- Added nested #ifdef ENABLE_NVSHMEM within #ifndef DISABLE_NVSHMEM_AND_NCCL
- Ensures NVSHMEM headers only included when actually enabled
4. Fixed setup.py:
- Removed duplicate include_dirs.append()
- Fixed undefined nccl_lib variable reference
- Added -dlink flag to nvcc_dlink for proper CUDA device linking with RDC
5. Added cooperative_groups support to internode_ll.cu:
- Added #include <cooperative_groups.h> and namespace alias
- Resolves cg::this_grid().sync() compilation errors
6. Fixed internode.cu warnings:
- Added #undef DISPATCH_LAUNCH_CASE to prevent macro redefinition
- Guarded rdma_rank declaration with #ifdef ENABLE_NVSHMEM
7. Fixed NVSHMEM kernel parameter mismatch:
- Added missing nccl_windows and signals_base to NOTIFY_DISPATCH_LAUNCH_CASE
- Added missing signals_base to cached_notify kernel launch
8. Documentation:
- Added LD_PRELOAD documentation to README-NCCL.md explaining the
workaround for PyTorch's bundled NCCL vs custom GIN-enabled NCCL
This allows clean compilation in both NCCL-only mode (ENABLE_NCCL=1) and
NVSHMEM mode (NVSHMEM_DIR set), with proper symbol resolution at runtime.
…intranode kernels, cleaned-up nvshmem/nccl gin only code. Signed-off-by: Georgios Theodorakis <[email protected]>
- Fix include order: move configs.cuh before CUDA headers in internode.cu and internode_ll.cu to properly undef PyTorch's half/bfloat16 restrictions - Remove redundant -U flags from setup.py (no longer needed with correct include order) - Remove NCCL src/include path, use only build/include (public API) - Consolidate extra_link_args.extend() calls in setup.py - Add NCCL path to build output - Update internode::init() signature to accept num_rdma_ranks parameter - Add reference to GIN paper (arXiv:2511.15076) in README-NCCL.md
Signed-off-by: Georgios Theodorakis <[email protected]>
- Add rdma_rank parameter to init() function signature - Update call site in Buffer::sync to pass rdma_rank - Fix default backend type string to "nccl"
…atch - Remove bulk signal reset loops from dispatch/combine initialization - Add inline net.resetSignal() after receiving data in dispatch recv - This ensures signals are reset immediately after use rather than upfront
- Remove num_comms parameter from nccl_get_p2p_ptr - Remove signals_base_next from dispatch/combine kernel signatures - Simplify nccl_get_p2p_ptr return statement to single line - Conditionally clean next buffer only when P2P is not disabled - Remove unused comments
…CL_ prefixed macros to DEEP_EP_
Signed-off-by: Georgios Theodorakis <[email protected]>
…ption.cuh and deep_ep.cpp
…tion wq# Please enter the commit message for your changes. Lines starting
1) Used ncclCoopWarp() for RDMA put operations 2) Added net.signal() code for signaling. Keeping net.put() with 0 bytes for signaling because of better Dispatch-Send and Combine-Send latency.
…tion - setup.py: Remove /build/ suffix from include/lib paths - README-NCCL.md: Update documentation for NCCL_DIR convention - All LL/HT scripts: Update NCCL_DIR and LD_LIBRARY_PATH accordingly
…domains (36/72 GPUs)
Replace the hardcoded 8 GPUs/node assumption with dynamic NCCL LSA team-based peer detection. This allows low-latency mode to work on systems with varying GPU configurations without recompilation. Also removes unused DEEPEP_DEBUG_PRINT macro.
Signed-off-by: Georgios Theodorakis <[email protected]>
…onfig Revert README.md to H800 benchmarks, add NCCL_GIN_TYPE docs and QP depth env vars
Contributor
|
good job! It seems that the version of NCCL v2.29.1 has not been released, but how did you get the code of this version? |
Contributor
Because he works at NVIDIA lol |
Author
|
@alpha-baby - Thanks for letting us know. Updated the NCCL version to 2.28.9. @polarstormx - You got that right 💚 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds NCCL as an additional communication backend for DeepEP, leveraging NCCL's Device API for GPU-initiated
network operations. The integration introduces a
CommunicationBackendabstraction that allows users to use NVSHMEM orNCCL as communication backend. This enables users to select their preferred backend based on deployment requirements.
Build/Runtime Selection of NVSHMEM and NCCL Backends
Both backends are fully supported and can be selected at build/runtime:
NVSHMEM_DIR=/path/to/nvshmemENABLE_NCCL=1 NCCL_DIR=/path/to/ncclDEEP_EP_BACKEND=ncclWhy Support Both NVSHMEM and NCCL Backends:
Unchanged Buffer Interface
The
Bufferclass interface exported to AI frameworks remains completely unchanged. This means:dispatch(),combine(),low_latency_dispatch(), andlow_latency_combine()APIs work transparently with either backendCommunication Backend Interface
We introduce an abstract
CommunicationBackendinterface that decouples DeepEP kernels from the underlying communication library:Backend Selection:
NCCL Backend Integration
The NCCL backend uses the Device API for GPU-initiated network operations (GIN), translating between NVSHMEM's PGAS model and NCCL's window-based model:
put_nbi(dst_ptr, src_ptr, count, pe)put(peer, dstWin, dstOff, srcWin, srcOff, bytes)signal(),readSignal())quiet()flush()Key Integration Challenges:
Multi-Communicator Mapping: Each NCCL GIN context wraps a Queue Pair (QP), and NCCL provides 4 GIN contexts per communicator. DeepEP's QP requirements are met using ⌈QPs/4⌉ communicators:
Memory Registration: Buffers registered with all communicators; window handles stored in GPU memory for kernel access.
Signal-Based Synchronization: Pre-allocated signal layouts map memory atomics to NCCL signal primitives.
Semantic Preservation: Zero-byte
put()withSignalAddis semantically equivalent tonet.signal()but better performing in the current NCCL release.NCCL GIN Type Selection
For GPU-initiated network operations, NCCL supports multiple backends:
NCCL_GIN_TYPE23Performance Results
Benchmarked on H100 (900 GB/s NVLink) with 8×400 Gbit/s InfiniBand (~50 GB/s per NIC).
High-Throughput Kernels
4096 tokens, 7168 hidden, top-8 experts, BF16 dispatch, BF16 combine
Low-Latency Kernels
128 tokens, 7168 hidden, top-8 experts, FP8 dispatch, BF16 combine
Requirements
Tested configurations:
Usage
Future Work
Reduce Communicator Count: Currently, we initialize multiple NCCL communicators due to the limitation of 4 GIN contexts per communicator. Work is in progress on the NCCL roadmap to increase the number of GIN contexts per communicator, which will simplify the integration.
Unified Backend Interface for NVSHMEM: The current NVSHMEM code does not follow the
CommunicationBackendinterface in order to stay close to the upstream DeepEP main branch. Migrating NVSHMEM to use this interface should be addressed in consultation with DeepEP maintainers.References
📄 Paper: GPU-Initiated Networking for NCCL
🔗 NCCL Repository: https://github.com/NVIDIA/nccl
📄 NCCL README: NCCL README
Co-authored with @grtheod.