-
Notifications
You must be signed in to change notification settings - Fork 192
[SYNC] 2.28.3 #2040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[SYNC] 2.28.3 #2040
Conversation
Device API (Experimental) * Introduces device-side APIs to integrate NCCL communication directly into application kernels. * Supports LSA (Load/Store Access) for CUDA P2P communication over NVLink and some PCIe platforms. * Supports Multimem for hardware multicast using NVLink SHARP. * Adds initial framework for GIN (GPU-Initiated Networking), currently under development. * Introduces device communicators created using ncclDevCommCreate. * Enables device-side communication operations with synchronization (ncclLsaBarrierSession) and memory accessors (ncclGetLsaPointer, ncclGetLsaMultimemPointer). * Experimental APIs - signatures and functionality may evolve in future releases. * No ABI compatibility is guaranteed — applications must be recompiled with each new NCCL release. Symmetric memory improvements * Support for aggregating symmetric operations using ncclGroupStart/End APIs. * Reimplement symmetric kernels using device API. New Host APIs * Introduce new host collective APIs: ncclAlltoAll, ncclScatter, ncclGather. CE (Copy Engine) Collectives * Reduce SM utilization for alltoall, scatter, gather, and allgather within a single (MN)NVL domain. * Free up SM capacity for the application to do computation at the same time. * To enable the feature for ncclAllGather, ncclAlltoAll, ncclGather, ncclScatter, register buffers into symmetric windows and use the NCCL_CTA_POLICY_ZERO flag in the communicator config_t. NCCL Inspector Plugin * Introduces an Inspector plugin for always-on performance monitoring. * Produces structured JSON output with metadata, execution time, bandwidth, and optional event traces for each NCCL operation. * Enables integration with analysis tools such as Performance Exporter to visualize NCCL performance bottlenecks. * Lightweight to enable via environment variables NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE. CMake support (Experiemental) * Adds a CMake build system as an alternative to existing Makefiles. * Known issues: pkg.build and Device API currently do not work with CMake. * The known issues will be addressed in a future release. Decreased max CTA count from 32 to 16 on Blackwell * SM overhead is decreased by 50% with this improvement. * This may cause some perf drop on Blackwell because of the reduced SM usage. * If the extra SM capacity is not desired, two options are available to restore to previous behavior: 1) Setting NCCL_MIN_CTAS=32 NCCL_MAX_CTAS=32 environment variables; 2) setting communicator config to over-write max CTA count to 32. * Based on community feedback, future versions may consider different trade-offs between performance and SM overhead. Plugins * Network * App-aware Network plugin. NCCL passes information about communication operations to be executed on the network end point. This allows for better tuning of network end points and their use in the plugins. * Improve handling of physical and virtual network devices and load/unload. * Network plugin version 11 - add explicit context and communication ID support for per communicator init/finalize. * Add Multi-Request Net API. Using this will help NCCL to anticipate multiple send/recv requests and optimize for it. See maxMultiRequestSize field in ncclNetProperties_v11_t. * Profiler * Add support for API events (group, collective, and p2p) and for tracking kernel launches in the profiler plugin. * Add Inspector Profiler Plugin (see section above). * Add a hook to Google’s CoMMA profiler on github. * Tuner * Expose NCCL tuning constants at tuner initialization via ncclTunerConstants_v5_t. * Add NVL Domain Information API. * Support multiple plugin types from a single shared object. New Parameterization and ncclConfig changes: * Add new option NCCL_MNNVL_CLIQUE_ID=-2 which will use rack serial number to partition the MNNVL clique. This will limit NVLink domains to GPUs within a single rack. * Add NCCL_NETDEVS_POLICY to control how NET devices are assigned to GPUs. The default (AUTO) is the policy used in previous versions. * Add NCCL_SINGLE_PROC_MEM_REG_ENABLE control variable to enable NVLS UB registration in the “one process, multiple ranks” case as opt in. * Move nChannelsPerNetPeer into ncclConfig. NCCL_NCHANNELS_PER_NET_PEER can override the value in ncclConfig. * Enable PxN over C2C by default * PxN over C2C will improve performance for Grace-Blackwell platforms by allowing NCCL to leverage the NIC attached to a peer GPU over NVLINK, C2C, and PCIe. * This behavior can be overridden by setting NCCL_PXN_C2C=0. Other Improvements: * Allow FP8 support for non-reductive operations on pre sm90 devices. (See pytorch/pytorch#151594 (comment)) * Fix NVLS+CollNet and temporarily disables COLLNET_CHAIN for >8 GPUs. * Only consider running interfaces for socket traffic. NCCL will not attempt to use interfaces that do not have the IFF_RUNNING bit. (NVIDIA/nccl#1798) * Modernize mutex management. Convert to std::mutex and std::lock_guard. * Remove sm35 and sm50 GENCODE targets which have long been deprecated and were causing issues with the latest NCCL release builds. * Improved NVLS/NVLSTree tuning prediction to improve algorithm and protocol selection. * NVLSTree Tuning Fixes. Update tuning data for H100, GB200-NV72. * Respond better to RoCE link flaps. Instead of reporting an “unknown event” it will now report “GID table changed”. * Move libvirt bridge interface to the end of possible interfaces so that they are considered last. These interfaces are usually virtual bridges to relay traffic to containers running on the host and cannot be used for traffic to a remote node and are therefore unsuitable.
| rankOffset >= 744 * 1024 && rankAlign != 4 && rcclParamAllToAllPivotEnable()) { | ||
| struct ncclInfo info = { ncclFuncAllToAllPivot, "AllToAllPivot", | ||
| rankOffset >= 744 * 1024 && rankAlign != 4 /*&& rcclParamAllToAllPivotEnable()*/) { | ||
| struct ncclInfo info = { ncclFuncAlltoAll, "AlltoAll", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see any send-recv operations being posted here in nccl, they are posted in enqueue. Any reason we are not doing the same?
| // gfx name from hipDeviceProp_t [RCCL] | ||
| char* archName; | ||
| // multiProcessorCount from hipDeviceProp_t [RCCL] | ||
| int cuCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
archName and cuCount removed, I assume this was intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is changing here, is it just newline style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some NCCL changes missing in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some NCCL changes missing in here.
| /*! @brief All-To-All | ||
| @details Device (i) send (j)th block of data to device (j) and be placed as (i)th | ||
| block. Each block for sending/receiving has *count* elements, which means | ||
| that *recvbuff* and *sendbuff* should have a size of nranks*count elements. | ||
| In-place operation is NOT supported. It is the user's responsibility | ||
| to ensure that sendbuff and recvbuff are distinct. | ||
| @return Result code. See @ref rccl_result_code for more details. | ||
|
|
||
| @param[in] sendbuff Data array to send (contains blocks for each other rank) | ||
| @param[out] recvbuff Data array to receive (contains blocks from each other rank) | ||
| @param[in] count Number of elements to send between each pair of ranks | ||
| @param[in] datatype Data buffer element datatype | ||
| @param[in] comm Communicator group object to execute on | ||
| @param[in] stream HIP stream to execute collective on */ | ||
| ncclResult_t ncclAllToAll(const void* sendbuff, void* recvbuff, size_t count, | ||
| ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream); | ||
| /*! @cond include_hidden */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should maintain this documentation style in the declaration added above, but the content should match the new comment.
| * | ||
| * In-place operations will happen if recvbuff == sendbuff + root * count. | ||
| */ | ||
| ncclResult_t ncclScatter(const void* sendbuff, void* recvbuff, size_t count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for Gather, Scatter, Send on documentation style.
| } | ||
| } | ||
|
|
||
| NCCL_PARAM(PxnC2c, "PXN_C2C", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default for NCCL_PXN_C2C was changed to 1 in NCCL, was this left out intentionally?
| // LL128 max BW per channel | ||
| static const double llMaxBws[][3] = { | ||
| /* Volta-N1/Intel-N2/Intel-N4) */ {39.0, 39.0, 20.4}, | ||
| /* Ampere-N1/AMD-N2/AMD-N4) */ {87.7, 22.5 /*avg of ring & tree*/, 19.0}, | ||
| /* Hopper-N1/AMD-N2/AMD-N4) */ {141.0, 45.0 /*avg of ring & tree*/, 35.0}, | ||
| /* Blackwell-N1/AMD-N2/AMD-N4) */ {2*141.0, 2*45.0 /*avg of ring & tree*/, 2*35.0}, | ||
| }; | ||
|
|
||
| static const double perChMaxRingLL128Bws[][3] = { | ||
| /* Volta (N1/N2/N4) */ {20.0, 20.0, 20.0}, | ||
| /* Ampere (N1/N2/N4) */ {20.0, 20.0, 20.0}, | ||
| /* Hopper (N1/N2/N4) */ {36.7, 36.7, 36.7}, | ||
| /* Blackwell (N1/N2/N4) */ {2*36.7, 2*36.7, 2*36.7}, | ||
| }; | ||
| static const double perChMaxTreeLL128Bws[][3] = { | ||
| /* Volta (N1/N2/N4) */ {20.0, 20.0, 20.0}, | ||
| /* Ampere (N1/N2/N4) */ {20.0, 20.0, 20.0}, | ||
| /* Hopper (N1/N2/N4) */ {36.7, 36.7, 29.0}, | ||
| /* Blackwell (N1/N2/N4) */ {2*36.7, 2*36.7, 2*29.0}, | ||
| }; | ||
| static const double perChMaxTreeBws[][3] = { | ||
| /* Volta (N1/N2/N4) */ {26.5, 18.5, 10.0}, | ||
| /* Ampere (N1/N2/N4) */ {24.0, 23.6, 17.8}, | ||
| /* Hopper (N1/N2/N4) */ {38.7, 41.4, 36.0}, | ||
| /* Blackwell (N1/N2/N4) */ {2*38.7, 2*41.4, 2*36.0}, | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all this code can be deleted now.
| /* Array indexes used below */ | ||
| #define VOLTA_COMPCAP_IDX 0 | ||
| #define AMPERE_COMPCAP_IDX 1 | ||
| #define HOPPER_COMPCAP_IDX 2 | ||
| #define BLACKWELL_COMPCAP_IDX 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be removed now.
| // Latencies in us, Bandwidths in GB/s | ||
| // Tree { LL, LL128, Simple } , Ring { LL, LL128, Simple } | ||
| static const float baseLat [NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS] = { | ||
| { 12.0, 12.0, 17.0 }, { 12.0, 12.0, 17.0 }, // Tree, Ring | ||
| { 12.0, 12.0, 17.0 }, { 12.0, 12.0, 17.0 }, // Collnet Direct, Chain | ||
| { 0, 0, 0 }, { 0, 0, 0 }}; // NVLS, NVLS Tree | ||
|
|
||
| // NVLink, PCI, Network | ||
| #define NCCL_HW_NVLINK 0 | ||
| #define NCCL_HW_PCI 1 | ||
| #define NCCL_HW_NET 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code can be removed.
| struct tuningModel { | ||
| float hwLat [3][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS]; | ||
| float bwRatio [2][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS]; | ||
| float treeCorrectionFactor[NCCL_NUM_PROTOCOLS][27]; | ||
| float ringCorrectionFactor[NCCL_NUM_PROTOCOLS][27]; | ||
| uint64_t llProtoRanges[RCCL_TUNABLE_COLLS][NCCL_NUM_PROTOCOLS - 1][RCCL_PROTOCOL_ENTRY_SIZE]; | ||
| uint64_t channelThresholds[RCCL_TUNABLE_COLLS][RCCL_CHANNELS_TUNABLE_ENTRIES][3]; //for each collective, set for 5 channel-counts: 2,4,8,16,32,40,48,56,64, {min,max,nchannels} | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NCCL has changed its tuning model code, we should see if anything needs to happen here. Might want to check with others. @nusislam @mustafabar @PedramAlizadeh @isaki001 please see NVIDIA/nccl@f130899#diff-8029994d1ac745bdd5aa9db869bc52ccd10ad396e54b88acbb4d52e4cb343085
| int nRanks = comm->nRanks; | ||
| if (nRanks <= 1) return ncclSuccess; | ||
| #if !defined(__HIP_PLATFORM_AMD__) && !defined(__HIPCC__) | ||
| int compCapIndex = minCompCap >= 100 ? BLACKWELL_COMPCAP_IDX : (minCompCap >= 90 ? HOPPER_COMPCAP_IDX : minCompCap >= 80 ? AMPERE_COMPCAP_IDX : VOLTA_COMPCAP_IDX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed
| //INFO(NCCL_INIT, "algo %s proto %s busBw %f baseBw %f bw %f nChannels %d bwIntra %f bwInter %f", ncclAlgoStr[a], ncclProtoStr[p], busBw, comm->topo->baseBw, bw, graphs[a]->nChannels, graphs[a]->bwIntra, graphs[a]->bwInter); | ||
|
|
||
| if (a == NCCL_ALGO_NVLS) { | ||
| #if !defined(__HIP_PLATFORM_AMD__) && !defined(__HIPCC__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, but I'd just like to double check: why this is disabled on AMD?
| float lat = rcclTuningModel[comm->topo->tuning].hwLat[hw[a]][a][p]; | ||
| //float lat = comm->tunerConstants.hwLatencies[hw[a]][a][p]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is good because it will show up better in future merges.
|
|
||
| NCCL_API(ncclResult_t, ncclMemAlloc, void **ptr, size_t size); | ||
| ncclResult_t ncclMemAlloc_impl(void **ptr, size_t size) { | ||
| NVTX3_FUNC_RANGE_IN(nccl_domain); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing change
|
|
||
| NCCL_API(ncclResult_t, ncclMemFree, void *ptr); | ||
| ncclResult_t ncclMemFree_impl(void *ptr) { | ||
| NVTX3_FUNC_RANGE_IN(nccl_domain); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing change
| #define RCCL_API_TRACE_VERSION_MAJOR 0 | ||
|
|
||
| // should be increased every time new members are added to existing dispatch tables | ||
| #define RCCL_API_TRACE_VERSION_PATCH 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change we may need to bump the major version.
Alternatively, we could keep the old API all-to-all function and just make it call into the new all-to-all function. Then we only need to bump the patch version.
|
Can you squash all non-NCCL commits to one? It should have all original NCCL commits + 1 merge commit, similar to we have done in all early merges, i.e. #1928 |
This is the intention once fixes are complete. Just need to ensure no fixes are lost. |
d464291 to
b7b318b
Compare
| #ifdef GENERATE_SYM_KERNELS | ||
| for (int sym=0; sym <= 1; sym++) { | ||
| int kcount = sym==0 ? KernelCount : ncclSymKernelCount; | ||
| int kcount = sym==0 ? ncclDevKernelCount : ncclSymkKernelCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| int kcount = sym==0 ? ncclDevKernelCount : ncclSymkKernelCount; | |
| int kcount = sym==0 ? KernelCount : ncclSymkKernelCount; |
RCCL seems to have purposely maintained the use of KernelCount. @BertanDogancay do you know why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep KernelCount since rccl has its own kernel list defined at the beginning of this file.
| uint32_t isCollnet:1, isNvls:1; | ||
| uint32_t devFuncId:30; | ||
| uint32_t isCollnet:1, isNvls:1, isSymLast:1; | ||
| uint32_t devFuncId:29; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Knocking this down to 29 bits might not work, see device.h for the bit packing formula.
| #if defined(HIP_HOST_UNCACHED_MEMORY) | ||
| CUDACHECKGOTO(cudaHostRegister((void*)hptr, realShmSize, cudaHostRegisterPortable | cudaHostRegisterMapped | hipExtHostRegisterUncached), ret, fail); | ||
| #else | ||
| CUDACHECKGOTO(cudaThreadExchangeStreamCaptureMode(&mode), ret, fail); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete this, it is already in L118.
| comm->latencies[coll][a][p] = comm->tunerConstants.baseLatencies[a][p]; | ||
| float intraLat = comm->tunerConstants.hwLatencies[intraHw[a]][a][p]; | ||
| // With ppn=1 latencies are fully exposed, use the Tree network latency | ||
| float interLat = ppn == 1 ? comm->tunerConstants.hwLatencies[NCCL_HW_NET][NCCL_ALGO_TREE][p] : comm->tunerConstants.hwLatencies[NCCL_HW_NET][a][p]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should switch back to rcclTuningModel.
Details
Do not mention proprietary info or link to internal work items in this PR.
**Work item: [SYNC] 2.28.3
What were the changes?
NCCL 2.28.3-1
Device API (Experimental)
Symmetric memory improvements
New Host APIs
CE (Copy Engine) Collectives
NCCL Inspector Plugin
CMake support (Experiemental)
Decreased max CTA count from 32 to 16 on Blackwell
Plugins
New Parameterization and ncclConfig changes:
Other Improvements: