Skip to content

Commit f130899

Browse files
committed
NCCL 2.28.3-1
Device API (Experimental) * Introduces device-side APIs to integrate NCCL communication directly into application kernels. * Supports LSA (Load/Store Access) for CUDA P2P communication over NVLink and some PCIe platforms. * Supports Multimem for hardware multicast using NVLink SHARP. * Adds initial framework for GIN (GPU-Initiated Networking), currently under development. * Introduces device communicators created using ncclDevCommCreate. * Enables device-side communication operations with synchronization (ncclLsaBarrierSession) and memory accessors (ncclGetLsaPointer, ncclGetLsaMultimemPointer). * Experimental APIs - signatures and functionality may evolve in future releases. * No ABI compatibility is guaranteed — applications must be recompiled with each new NCCL release. Symmetric memory improvements * Support for aggregating symmetric operations using ncclGroupStart/End APIs. * Reimplement symmetric kernels using device API. New Host APIs * Introduce new host collective APIs: ncclAlltoAll, ncclScatter, ncclGather. CE (Copy Engine) Collectives * Reduce SM utilization for alltoall, scatter, gather, and allgather within a single (MN)NVL domain. * Free up SM capacity for the application to do computation at the same time. * To enable the feature for ncclAllGather, ncclAlltoAll, ncclGather, ncclScatter, register buffers into symmetric windows and use the NCCL_CTA_POLICY_ZERO flag in the communicator config_t. NCCL Inspector Plugin * Introduces an Inspector plugin for always-on performance monitoring. * Produces structured JSON output with metadata, execution time, bandwidth, and optional event traces for each NCCL operation. * Enables integration with analysis tools such as Performance Exporter to visualize NCCL performance bottlenecks. * Lightweight to enable via environment variables NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE. CMake support (Experiemental) * Adds a CMake build system as an alternative to existing Makefiles. * Known issues: pkg.build and Device API currently do not work with CMake. * The known issues will be addressed in a future release. Decreased max CTA count from 32 to 16 on Blackwell * SM overhead is decreased by 50% with this improvement. * This may cause some perf drop on Blackwell because of the reduced SM usage. * If the extra SM capacity is not desired, two options are available to restore to previous behavior: 1) Setting NCCL_MIN_CTAS=32 NCCL_MAX_CTAS=32 environment variables; 2) setting communicator config to over-write max CTA count to 32. * Based on community feedback, future versions may consider different trade-offs between performance and SM overhead. Plugins * Network * App-aware Network plugin. NCCL passes information about communication operations to be executed on the network end point. This allows for better tuning of network end points and their use in the plugins. * Improve handling of physical and virtual network devices and load/unload. * Network plugin version 11 - add explicit context and communication ID support for per communicator init/finalize. * Add Multi-Request Net API. Using this will help NCCL to anticipate multiple send/recv requests and optimize for it. See maxMultiRequestSize field in ncclNetProperties_v11_t. * Profiler * Add support for API events (group, collective, and p2p) and for tracking kernel launches in the profiler plugin. * Add Inspector Profiler Plugin (see section above). * Add a hook to Google’s CoMMA profiler on github. * Tuner * Expose NCCL tuning constants at tuner initialization via ncclTunerConstants_v5_t. * Add NVL Domain Information API. * Support multiple plugin types from a single shared object. New Parameterization and ncclConfig changes: * Add new option NCCL_MNNVL_CLIQUE_ID=-2 which will use rack serial number to partition the MNNVL clique. This will limit NVLink domains to GPUs within a single rack. * Add NCCL_NETDEVS_POLICY to control how NET devices are assigned to GPUs. The default (AUTO) is the policy used in previous versions. * Add NCCL_SINGLE_PROC_MEM_REG_ENABLE control variable to enable NVLS UB registration in the “one process, multiple ranks” case as opt in. * Move nChannelsPerNetPeer into ncclConfig. NCCL_NCHANNELS_PER_NET_PEER can override the value in ncclConfig. * Enable PxN over C2C by default * PxN over C2C will improve performance for Grace-Blackwell platforms by allowing NCCL to leverage the NIC attached to a peer GPU over NVLINK, C2C, and PCIe. * This behavior can be overridden by setting NCCL_PXN_C2C=0. Other Improvements: * Allow FP8 support for non-reductive operations on pre sm90 devices. (See pytorch/pytorch#151594 (comment)) * Fix NVLS+CollNet and temporarily disables COLLNET_CHAIN for >8 GPUs. * Only consider running interfaces for socket traffic. NCCL will not attempt to use interfaces that do not have the IFF_RUNNING bit. (#1798) * Modernize mutex management. Convert to std::mutex and std::lock_guard. * Remove sm35 and sm50 GENCODE targets which have long been deprecated and were causing issues with the latest NCCL release builds. * Improved NVLS/NVLSTree tuning prediction to improve algorithm and protocol selection. * NVLSTree Tuning Fixes. Update tuning data for H100, GB200-NV72. * Respond better to RoCE link flaps. Instead of reporting an “unknown event” it will now report “GID table changed”. * Move libvirt bridge interface to the end of possible interfaces so that they are considered last. These interfaces are usually virtual bridges to relay traffic to containers running on the host and cannot be used for traffic to a remote node and are therefore unsuitable.
1 parent 593de54 commit f130899

File tree

212 files changed

+15532
-2935
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

212 files changed

+15532
-2935
lines changed

ext-net/README.md

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -60,36 +60,36 @@ of newer ones.
6060
The `nccl/` directory is populated with `net_vX.h` files extracting all relevant definitions
6161
from old API versions. It also provides error codes in `err.h`.
6262

63-
# API (v10)
63+
# API (v11)
6464

65-
Below is the main `ncclNet_v10` struct. Each function is explained in later sections.
65+
Below is the main `ncclNet_v11` struct. Each function is explained in later sections.
6666

6767
```
6868
typedef struct {
6969
// Name of the network (mainly for logs)
7070
const char* name;
7171
// Initialize the network.
72-
ncclResult_t (*init)(ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction);
72+
ncclResult_t (*init)(void** ctx, uint64_t commId, ncclNetCommConfig_v11_t* config, ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction);
7373
// Return the number of adapters.
7474
ncclResult_t (*devices)(int* ndev);
7575
// Get various device properties.
76-
ncclResult_t (*getProperties)(int dev, ncclNetProperties_v10_t* props);
76+
ncclResult_t (*getProperties)(int dev, ncclNetProperties_v11_t* props);
7777
// Create a receiving object and provide a handle to connect to it. The
7878
// handle can be up to NCCL_NET_HANDLE_MAXSIZE bytes and will be exchanged
7979
// between ranks to create a connection.
80-
ncclResult_t (*listen)(int dev, void* handle, void** listenComm);
80+
ncclResult_t (*listen)(void* ctx, int dev, void* handle, void** listenComm);
8181
// Connect to a handle and return a sending comm object for that peer.
8282
// This call must not block for the connection to be established, and instead
8383
// should return successfully with sendComm == NULL with the expectation that
8484
// it will be called again until sendComm != NULL.
8585
// If *sendDevComm points to a valid object, then NCCL is requesting device offload for this connection
86-
ncclResult_t (*connect)(int dev, ncclNetCommConfig_v10_t* config, void* handle, void** sendComm, ncclNetDeviceHandle_v10_t** sendDevComm);
86+
ncclResult_t (*connect)(void* ctx, int dev, void* handle, void** sendComm, ncclNetDeviceHandle_v11_t** sendDevComm);
8787
// Finalize connection establishment after remote peer has called connect.
8888
// This call must not block for the connection to be established, and instead
8989
// should return successfully with recvComm == NULL with the expectation that
9090
// it will be called again until recvComm != NULL.
9191
// If *recvDevComm points to a valid object, then NCCL is requesting device offload for this connection
92-
ncclResult_t (*accept)(void* listenComm, void** recvComm, ncclNetDeviceHandle_v10_t** recvDevComm);
92+
ncclResult_t (*accept)(void* listenComm, void** recvComm, ncclNetDeviceHandle_v11_t** recvDevComm);
9393
// Register/Deregister memory. Comm can be either a sendComm or a recvComm.
9494
// Type is either NCCL_PTR_HOST or NCCL_PTR_CUDA.
9595
ncclResult_t (*regMr)(void* comm, void* data, size_t size, int type, void** mhandle);
@@ -191,6 +191,12 @@ This will allow the plugin to discover network devices and make sure they are us
191191
`init` function does not return `ncclSuccess`, then NCCL will not use the plugin and fall back on
192192
internal ones.
193193

194+
Every call to `init` returns an opaque context that the plugin uses internally to allocate resources
195+
and manage state. Such context is passed to other net plugin calls that create further resources,
196+
such as `listen` and `connect`. Every context is uniquely associated to a communicator
197+
using the commId. The network can also be initialized with a per communicator configuration using
198+
the `config` argument.
199+
194200
To allow the plugin logs to integrate into the NCCL logs seemlessly, NCCL provides a logging
195201
function to `init`. This function is typically used to allow for `INFO` and `WARN` macros within
196202
the plugin code adding the following definitions:
@@ -282,7 +288,7 @@ side.
282288
`listen`
283289

284290
To create a connection, NCCL will start by calling `listen` on the receiver side. This function
285-
takes a device number as input argument, and should return a local `listenComm` object, and a
291+
takes the opaque plugin context returned by `init` and a device number as input argument, and should return a local `listenComm` object, and a
286292
`handle` to pass to the other side, so that the sender side can connect to the receiver.
287293

288294
The `handle` is a buffer of size `NCCL_NET_HANDLE_MAXSIZE` and is provided by NCCL.
@@ -304,7 +310,8 @@ the `listen` call previously. If the sender did not connect yet, `accept` should
304310
should return `ncclSuccess`, setting `recvComm` to `NULL`. NCCL will call `accept` again until it
305311
succeeds.
306312

307-
The `connect` API takes a `ncclNetCommConfig_t`, which contains a trafficClass field.
313+
The `connect` API takes the opaque plugin context returned by `init`. The plugin context can reference
314+
the `ncclNetCommConfig_t` passed to the `init` function and containing a trafficClass field.
308315
This field can be used by the network plugin to specify the QoS level of the connection. By default,
309316
`trafficClass` is set to -1 but can be configured by the application during communicator initialization
310317
to select a plugin-supported QoS level.

ext-net/example/CMakeLists.txt

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
set(SRC_FILES
2+
${CMAKE_CURRENT_SOURCE_DIR}/plugin.c
3+
)
4+
5+
# Create shared library
6+
add_library(nccl-net-example SHARED ${SRC_FILES})
7+
8+
# Set include directories
9+
target_include_directories(nccl-net-example PRIVATE
10+
${CMAKE_CURRENT_SOURCE_DIR}/nccl
11+
)
12+
13+
# Set output name to match Makefile
14+
set_target_properties(nccl-net-example PROPERTIES
15+
OUTPUT_NAME "nccl-net-example"
16+
PREFIX "lib"
17+
POSITION_INDEPENDENT_CODE ON
18+
LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/test/unit/plugins
19+
)

ext-net/example/nccl/net.h

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,9 @@
2222

2323
// Maximum number of requests per comm object
2424
#define NCCL_NET_MAX_REQUESTS 32
25+
#define NCCL_NET_MAX_DEVS_PER_NIC 4
2526

27+
#include "net_v11.h"
2628
#include "net_v10.h"
2729
#include "net_v9.h"
2830
#include "net_v8.h"
@@ -33,9 +35,9 @@
3335
#include "net_v3.h"
3436
#include "net_v2.h"
3537

36-
typedef ncclNet_v10_t ncclNet_t;
37-
typedef ncclNetProperties_v10_t ncclNetProperties_t;
38-
typedef ncclNetVDeviceProps_v10_t ncclNetVDeviceProps_t;
39-
typedef ncclNetCommConfig_v10_t ncclNetCommConfig_t;
38+
typedef ncclNet_v11_t ncclNet_t;
39+
typedef ncclNetProperties_v11_t ncclNetProperties_t;
40+
typedef ncclNetVDeviceProps_v11_t ncclNetVDeviceProps_t;
41+
typedef ncclNetCommConfig_v11_t ncclNetCommConfig_t;
4042

4143
#endif // end include guard

ext-net/example/nccl/net_device.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212

1313
// Arbitrary version number - A given NCCL build will only be compatible with a single device networking plugin
1414
// version. NCCL will check the supplied version number from net->getProperties() and compare to its internal version.
15-
#define NCCL_NET_DEVICE_UNPACK_VERSION 0x7
15+
#define NCCL_NET_DEVICE_UNPACK_VERSION 0x7
1616

1717
typedef enum {NCCL_NET_DEVICE_HOST=0, NCCL_NET_DEVICE_UNPACK=1} ncclNetDeviceType;
1818

@@ -27,6 +27,7 @@ typedef struct {
2727
typedef ncclNetDeviceHandle_v7_t ncclNetDeviceHandle_v8_t;
2828
typedef ncclNetDeviceHandle_v8_t ncclNetDeviceHandle_v9_t;
2929
typedef ncclNetDeviceHandle_v9_t ncclNetDeviceHandle_v10_t;
30-
typedef ncclNetDeviceHandle_v10_t ncclNetDeviceHandle_t;
30+
typedef ncclNetDeviceHandle_v10_t ncclNetDeviceHandle_v11_t;
31+
typedef ncclNetDeviceHandle_v11_t ncclNetDeviceHandle_t;
3132

3233
#endif

ext-net/example/nccl/net_v10.h

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,9 @@
55
#ifndef NET_V10_H_
66
#define NET_V10_H_
77

8-
#define NCCL_NET_MAX_DEVS_PER_NIC_V10 4
98
typedef struct {
109
int ndevs;
11-
int devs[NCCL_NET_MAX_DEVS_PER_NIC_V10];
10+
int devs[NCCL_NET_MAX_DEVS_PER_NIC];
1211
} ncclNetVDeviceProps_v10_t;
1312

1413

ext-net/example/nccl/net_v11.h

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
/*
2+
* Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
3+
*/
4+
5+
#ifndef NET_V11_H_
6+
#define NET_V11_H_
7+
8+
typedef struct {
9+
int ndevs;
10+
int devs[NCCL_NET_MAX_DEVS_PER_NIC];
11+
} ncclNetVDeviceProps_v11_t;
12+
13+
#define NCCL_NET_TRAFFIC_CLASS_UNDEF -1
14+
15+
typedef struct {
16+
// Plugin-specific TC value
17+
int trafficClass;
18+
} ncclNetCommConfig_v11_t;
19+
20+
21+
typedef struct {
22+
char* name; // Used mostly for logging.
23+
char* pciPath; // Path to the PCI device in /sys.
24+
uint64_t guid; // Unique identifier for the NIC chip. Important for
25+
// cards with multiple PCI functions (Physical or virtual).
26+
int ptrSupport; // [NCCL_PTR_HOST|NCCL_PTR_CUDA|NCCL_PTR_DMABUF]
27+
int regIsGlobal; // regMr is not tied to a particular comm
28+
int forceFlush; // Force a flush on receives
29+
int speed; // Port speed in Mbps.
30+
int port; // Port number.
31+
float latency; // Network latency
32+
int maxComms; // Maximum number of comms we can create
33+
int maxRecvs; // Maximum number of grouped receives.
34+
ncclNetDeviceType netDeviceType; // Network offload type
35+
int netDeviceVersion; // Version number for network offload
36+
ncclNetVDeviceProps_v11_t vProps;
37+
size_t maxP2pBytes; // Max transfer size for point-to-point operations
38+
size_t maxCollBytes; // Max transfer size for collective operations
39+
int maxMultiRequestSize; // Maximum number of requests supported in a single multi-request.
40+
} ncclNetProperties_v11_t;
41+
42+
typedef struct {
43+
int32_t maxConcurrentPeers;
44+
int32_t minConcurrentPeers;
45+
int32_t maxFlowsPerPeer;
46+
int32_t minFlowsPerPeer;
47+
} ncclNetCommAttr_v11_t;
48+
49+
typedef struct {
50+
ncclNetCommAttr_v11_t sendCommAttr;
51+
ncclNetCommAttr_v11_t recvCommAttr;
52+
uint32_t op;
53+
uint32_t algo;
54+
uint32_t proto;
55+
} ncclNetAttr_v11_t;
56+
57+
typedef struct {
58+
// Name of the network (mainly for logs)
59+
const char* name;
60+
// Initialize the network.
61+
ncclResult_t (*init)(void** ctx, uint64_t commId, ncclNetCommConfig_v11_t* config, ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction);
62+
// Return the number of adapters.
63+
ncclResult_t (*devices)(int* ndev);
64+
// Get various device properties.
65+
ncclResult_t (*getProperties)(int dev, ncclNetProperties_v11_t* props);
66+
// Create a receiving object and provide a handle to connect to it. The
67+
// handle can be up to NCCL_NET_HANDLE_MAXSIZE bytes and will be exchanged
68+
// between ranks to create a connection.
69+
ncclResult_t (*listen)(void* ctx, int dev, void* handle, void** listenComm);
70+
// Connect to a handle and return a sending comm object for that peer.
71+
// This call must not block for the connection to be established, and instead
72+
// should return successfully with sendComm == NULL with the expectation that
73+
// it will be called again until sendComm != NULL.
74+
// If *sendDevComm points to a valid object, then NCCL is requesting device offload for this connection
75+
ncclResult_t (*connect)(void* ctx, int dev, void* handle, void** sendComm, ncclNetDeviceHandle_v11_t** sendDevComm);
76+
// Finalize connection establishment after remote peer has called connect.
77+
// This call must not block for the connection to be established, and instead
78+
// should return successfully with recvComm == NULL with the expectation that
79+
// it will be called again until recvComm != NULL.
80+
// If *recvDevComm points to a valid object, then NCCL is requesting device offload for this connection
81+
ncclResult_t (*accept)(void* listenComm, void** recvComm, ncclNetDeviceHandle_v11_t** recvDevComm);
82+
// Register/Deregister memory. Comm can be either a sendComm or a recvComm.
83+
// Type is either NCCL_PTR_HOST or NCCL_PTR_CUDA.
84+
ncclResult_t (*regMr)(void* comm, void* data, size_t size, int type, void** mhandle);
85+
/* DMA-BUF support */
86+
ncclResult_t (*regMrDmaBuf)(void* comm, void* data, size_t size, int type, uint64_t offset, int fd, void** mhandle);
87+
ncclResult_t (*deregMr)(void* comm, void* mhandle);
88+
// Asynchronous send to a peer.
89+
// May return request == NULL if the call cannot be performed (or would block)
90+
ncclResult_t (*isend)(void* sendComm, void* data, size_t size, int tag, void* mhandle, void* phandle, void** request);
91+
// Asynchronous recv from a peer.
92+
// May return request == NULL if the call cannot be performed (or would block)
93+
ncclResult_t (*irecv)(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles, void** phandles, void** request);
94+
// Perform a flush/fence to make sure all data received with NCCL_PTR_CUDA is
95+
// visible to the GPU
96+
ncclResult_t (*iflush)(void* recvComm, int n, void** data, int* sizes, void** mhandles, void** request);
97+
// Test whether a request is complete. If size is not NULL, it returns the
98+
// number of bytes sent/received.
99+
ncclResult_t (*test)(void* request, int* done, int* sizes);
100+
// Close and free send/recv comm objects
101+
ncclResult_t (*closeSend)(void* sendComm);
102+
ncclResult_t (*closeRecv)(void* recvComm);
103+
ncclResult_t (*closeListen)(void* listenComm);
104+
105+
// Copy the given mhandle to a dptr in a format usable by this plugin's device code
106+
ncclResult_t (*getDeviceMr)(void* comm, void* mhandle, void** dptr_mhandle);
107+
108+
// Notify the plugin that a recv has completed by the device
109+
ncclResult_t (*irecvConsumed)(void* recvComm, int n, void* request);
110+
111+
// Virtual NIC APIs. makeVDevice will create a virtual NIC given the specified properties, and tell the caller
112+
// what index this new vNIC exists at
113+
ncclResult_t (*makeVDevice)(int* d, ncclNetVDeviceProps_v11_t* props);
114+
// Finalize the network.
115+
ncclResult_t (*finalize)(void* ctx);
116+
117+
ncclResult_t (*setNetAttr)(void* ctx, ncclNetAttr_v11_t* netAttr);
118+
} ncclNet_v11_t;
119+
120+
#endif // end include guard

ext-net/example/nccl/net_v9.h

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,9 @@
55
#ifndef NET_V9_H_
66
#define NET_V9_H_
77

8-
#define NCCL_NET_MAX_DEVS_PER_NIC_V9 4
98
typedef struct {
109
int ndevs;
11-
int devs[NCCL_NET_MAX_DEVS_PER_NIC_V9];
10+
int devs[NCCL_NET_MAX_DEVS_PER_NIC];
1211
} ncclNetVDeviceProps_v9_t;
1312

1413
typedef struct {

0 commit comments

Comments
 (0)