[SYNC] 2.28.3 #2040

mberenjk · 2025-11-07T17:37:59Z

Details

Do not mention proprietary info or link to internal work items in this PR.

**Work item: [SYNC] 2.28.3

What were the changes?
NCCL 2.28.3-1
Device API (Experimental)

Introduces device-side APIs to integrate NCCL communication directly into application kernels.
Supports LSA (Load/Store Access) for CUDA P2P communication over NVLink and some PCIe platforms.
Supports Multimem for hardware multicast using NVLink SHARP.
Adds initial framework for GIN (GPU-Initiated Networking), currently under development.
Introduces device communicators created using ncclDevCommCreate.
Enables device-side communication operations with synchronization (ncclLsaBarrierSession) and memory accessors (ncclGetLsaPointer, ncclGetLsaMultimemPointer).
Experimental APIs - signatures and functionality may evolve in future releases.
No ABI compatibility is guaranteed — applications must be recompiled with each new NCCL release.

Symmetric memory improvements

Support for aggregating symmetric operations using ncclGroupStart/End APIs.
Reimplement symmetric kernels using device API.

New Host APIs

Introduce new host collective APIs: ncclAlltoAll, ncclScatter, ncclGather.

CE (Copy Engine) Collectives

Reduce SM utilization for alltoall, scatter, gather, and allgather within a single (MN)NVL domain.
Free up SM capacity for the application to do computation at the same time.
To enable the feature for ncclAllGather, ncclAlltoAll, ncclGather, ncclScatter, register buffers into symmetric windows and use the NCCL_CTA_POLICY_ZERO flag in the communicator config_t.

NCCL Inspector Plugin

Introduces an Inspector plugin for always-on performance monitoring.
Produces structured JSON output with metadata, execution time, bandwidth, and optional event traces for each NCCL operation.
Enables integration with analysis tools such as Performance Exporter to visualize NCCL performance bottlenecks.
Lightweight to enable via environment variables NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE.

CMake support (Experiemental)

Adds a CMake build system as an alternative to existing Makefiles.
Known issues: pkg.build and Device API currently do not work with CMake.
The known issues will be addressed in a future release.

Decreased max CTA count from 32 to 16 on Blackwell

SM overhead is decreased by 50% with this improvement.
This may cause some perf drop on Blackwell because of the reduced SM usage.
If the extra SM capacity is not desired, two options are available to restore to previous behavior: 1) Setting NCCL_MIN_CTAS=32 NCCL_MAX_CTAS=32 environment variables; 2) setting communicator config to over-write max CTA count to 32.
Based on community feedback, future versions may consider different trade-offs between performance and SM overhead.

Plugins

Network
- App-aware Network plugin. NCCL passes information about communication operations to be executed on the network end point. This allows for better tuning of network end points and their use in the plugins.
- Improve handling of physical and virtual network devices and load/unload.
- Network plugin version 11 - add explicit context and communication ID support for per communicator init/finalize.
- Add Multi-Request Net API. Using this will help NCCL to anticipate multiple send/recv requests and optimize for it. See maxMultiRequestSize field in ncclNetProperties_v11_t.
Profiler
- Add support for API events (group, collective, and p2p) and for tracking kernel launches in the profiler plugin.
- Add Inspector Profiler Plugin (see section above).
- Add a hook to Google’s CoMMA profiler on github.
Tuner
- Expose NCCL tuning constants at tuner initialization via ncclTunerConstants_v5_t.
- Add NVL Domain Information API.
Support multiple plugin types from a single shared object.

New Parameterization and ncclConfig changes:

Add new option NCCL_MNNVL_CLIQUE_ID=-2 which will use rack serial number to partition the MNNVL clique. This will limit NVLink domains to GPUs within a single rack.
Add NCCL_NETDEVS_POLICY to control how NET devices are assigned to GPUs. The default (AUTO) is the policy used in previous versions.
Add NCCL_SINGLE_PROC_MEM_REG_ENABLE control variable to enable NVLS UB registration in the “one process, multiple ranks” case as opt in.
Move nChannelsPerNetPeer into ncclConfig. NCCL_NCHANNELS_PER_NET_PEER can override the value in ncclConfig.
Enable PxN over C2C by default
- PxN over C2C will improve performance for Grace-Blackwell platforms by allowing NCCL to leverage the NIC attached to a peer GPU over NVLINK, C2C, and PCIe.
- This behavior can be overridden by setting NCCL_PXN_C2C=0.

Other Improvements:

Allow FP8 support for non-reductive operations on pre sm90 devices. (See [CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 pytorch/pytorch#151594 (comment))
Fix NVLS+CollNet and temporarily disables COLLNET_CHAIN for >8 GPUs.
Only consider running interfaces for socket traffic. NCCL will not attempt to use interfaces that do not have the IFF_RUNNING bit. ([RFE]: src/misc/socket.cc: findInterfaces shouldn't return an offline interface NVIDIA/nccl#1798)
Modernize mutex management. Convert to std::mutex and std::lock_guard.
Remove sm35 and sm50 GENCODE targets which have long been deprecated and were causing issues with the latest NCCL release builds.
Improved NVLS/NVLSTree tuning prediction to improve algorithm and protocol selection.
NVLSTree Tuning Fixes. Update tuning data for H100, GB200-NV72.
Respond better to RoCE link flaps. Instead of reporting an “unknown event” it will now report “GID table changed”.
Move libvirt bridge interface to the end of possible interfaces so that they are considered last. These interfaces are usually virtual bridges to relay traffic to containers running on the host and cannot be used for traffic to a remote node and are therefore unsuitable.

Device API (Experimental) * Introduces device-side APIs to integrate NCCL communication directly into application kernels. * Supports LSA (Load/Store Access) for CUDA P2P communication over NVLink and some PCIe platforms. * Supports Multimem for hardware multicast using NVLink SHARP. * Adds initial framework for GIN (GPU-Initiated Networking), currently under development. * Introduces device communicators created using ncclDevCommCreate. * Enables device-side communication operations with synchronization (ncclLsaBarrierSession) and memory accessors (ncclGetLsaPointer, ncclGetLsaMultimemPointer). * Experimental APIs - signatures and functionality may evolve in future releases. * No ABI compatibility is guaranteed — applications must be recompiled with each new NCCL release. Symmetric memory improvements * Support for aggregating symmetric operations using ncclGroupStart/End APIs. * Reimplement symmetric kernels using device API. New Host APIs * Introduce new host collective APIs: ncclAlltoAll, ncclScatter, ncclGather. CE (Copy Engine) Collectives * Reduce SM utilization for alltoall, scatter, gather, and allgather within a single (MN)NVL domain. * Free up SM capacity for the application to do computation at the same time. * To enable the feature for ncclAllGather, ncclAlltoAll, ncclGather, ncclScatter, register buffers into symmetric windows and use the NCCL_CTA_POLICY_ZERO flag in the communicator config_t. NCCL Inspector Plugin * Introduces an Inspector plugin for always-on performance monitoring. * Produces structured JSON output with metadata, execution time, bandwidth, and optional event traces for each NCCL operation. * Enables integration with analysis tools such as Performance Exporter to visualize NCCL performance bottlenecks. * Lightweight to enable via environment variables NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE. CMake support (Experiemental) * Adds a CMake build system as an alternative to existing Makefiles. * Known issues: pkg.build and Device API currently do not work with CMake. * The known issues will be addressed in a future release. Decreased max CTA count from 32 to 16 on Blackwell * SM overhead is decreased by 50% with this improvement. * This may cause some perf drop on Blackwell because of the reduced SM usage. * If the extra SM capacity is not desired, two options are available to restore to previous behavior: 1) Setting NCCL_MIN_CTAS=32 NCCL_MAX_CTAS=32 environment variables; 2) setting communicator config to over-write max CTA count to 32. * Based on community feedback, future versions may consider different trade-offs between performance and SM overhead. Plugins * Network * App-aware Network plugin. NCCL passes information about communication operations to be executed on the network end point. This allows for better tuning of network end points and their use in the plugins. * Improve handling of physical and virtual network devices and load/unload. * Network plugin version 11 - add explicit context and communication ID support for per communicator init/finalize. * Add Multi-Request Net API. Using this will help NCCL to anticipate multiple send/recv requests and optimize for it. See maxMultiRequestSize field in ncclNetProperties_v11_t. * Profiler * Add support for API events (group, collective, and p2p) and for tracking kernel launches in the profiler plugin. * Add Inspector Profiler Plugin (see section above). * Add a hook to Google’s CoMMA profiler on github. * Tuner * Expose NCCL tuning constants at tuner initialization via ncclTunerConstants_v5_t. * Add NVL Domain Information API. * Support multiple plugin types from a single shared object. New Parameterization and ncclConfig changes: * Add new option NCCL_MNNVL_CLIQUE_ID=-2 which will use rack serial number to partition the MNNVL clique. This will limit NVLink domains to GPUs within a single rack. * Add NCCL_NETDEVS_POLICY to control how NET devices are assigned to GPUs. The default (AUTO) is the policy used in previous versions. * Add NCCL_SINGLE_PROC_MEM_REG_ENABLE control variable to enable NVLS UB registration in the “one process, multiple ranks” case as opt in. * Move nChannelsPerNetPeer into ncclConfig. NCCL_NCHANNELS_PER_NET_PEER can override the value in ncclConfig. * Enable PxN over C2C by default * PxN over C2C will improve performance for Grace-Blackwell platforms by allowing NCCL to leverage the NIC attached to a peer GPU over NVLINK, C2C, and PCIe. * This behavior can be overridden by setting NCCL_PXN_C2C=0. Other Improvements: * Allow FP8 support for non-reductive operations on pre sm90 devices. (See pytorch/pytorch#151594 (comment)) * Fix NVLS+CollNet and temporarily disables COLLNET_CHAIN for >8 GPUs. * Only consider running interfaces for socket traffic. NCCL will not attempt to use interfaces that do not have the IFF_RUNNING bit. (NVIDIA/nccl#1798) * Modernize mutex management. Convert to std::mutex and std::lock_guard. * Remove sm35 and sm50 GENCODE targets which have long been deprecated and were causing issues with the latest NCCL release builds. * Improved NVLS/NVLSTree tuning prediction to improve algorithm and protocol selection. * NVLSTree Tuning Fixes. Update tuning data for H100, GB200-NV72. * Respond better to RoCE link flaps. Instead of reporting an “unknown event” it will now report “GID table changed”. * Move libvirt bridge interface to the end of possible interfaces so that they are considered last. These interfaces are usually virtual bridges to relay traffic to containers running on the host and cannot be used for traffic to a remote node and are therefore unsuitable.

src/device/symmetric/primitives.cuh

nusislam · 2025-11-11T14:58:19Z

src/collectives.cc

-      rankOffset >= 744 * 1024 && rankAlign != 4 && rcclParamAllToAllPivotEnable()) {
-    struct ncclInfo info = { ncclFuncAllToAllPivot, "AllToAllPivot",
+      rankOffset >= 744 * 1024 && rankAlign != 4 /*&& rcclParamAllToAllPivotEnable()*/) {
+    struct ncclInfo info = { ncclFuncAlltoAll, "AlltoAll",


I do not see any send-recv operations being posted here in nccl, they are posted in enqueue. Any reason we are not doing the same?

ROCmMathLibrariesBot · 2025-11-12T19:40:33Z

regression-detection run on commit `b327cf0`

Artifacts - Results

ROCmMathLibrariesBot · 2025-11-13T01:48:53Z

regression-detection run on commit `d464291`

Artifacts - Results

corey-derochie-amd · 2025-11-13T15:16:50Z

src/include/comm.h

-  // gfx name from hipDeviceProp_t [RCCL]
-  char* archName;
-  // multiProcessorCount from hipDeviceProp_t [RCCL]
-  int cuCount;


archName and cuCount removed, I assume this was intentional?

corey-derochie-amd · 2025-11-13T15:24:01Z

src/debug.cc

What is changing here, is it just newline style?

There are some NCCL changes missing in here.

corey-derochie-amd · 2025-11-13T15:24:57Z

src/debug.cc

There are some NCCL changes missing in here.

corey-derochie-amd · 2025-11-13T15:28:16Z

src/nccl.h.in

-/*! @brief      All-To-All
-    @details    Device (i) send (j)th block of data to device (j) and be placed as (i)th
-                block. Each block for sending/receiving has *count* elements, which means
-                that *recvbuff* and *sendbuff* should have a size of nranks*count elements.
-                In-place operation is NOT supported. It is the user's responsibility
-                to ensure that sendbuff and recvbuff are distinct.
-    @return     Result code. See @ref rccl_result_code for more details.
-
-    @param[in]  sendbuff      Data array to send (contains blocks for each other rank)
-    @param[out] recvbuff      Data array to receive (contains blocks from each other rank)
-    @param[in]  count         Number of elements to send between each pair of ranks
-    @param[in]  datatype      Data buffer element datatype
-    @param[in]  comm          Communicator group object to execute on
-    @param[in]  stream        HIP stream to execute collective on */
-ncclResult_t  ncclAllToAll(const void* sendbuff, void* recvbuff, size_t count,
-    ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream);
-/*! @cond       include_hidden */


We should maintain this documentation style in the declaration added above, but the content should match the new comment.

corey-derochie-amd · 2025-11-13T15:29:30Z

src/nccl.h.in

+ *
+ * In-place operations will happen if recvbuff == sendbuff + root * count.
+ */
+ncclResult_t  ncclScatter(const void* sendbuff, void* recvbuff, size_t count,


Same for Gather, Scatter, Send on documentation style.

corey-derochie-amd · 2025-11-13T15:42:49Z

src/graph/paths.cc

  }
 }

 NCCL_PARAM(PxnC2c, "PXN_C2C", 0);


Default for NCCL_PXN_C2C was changed to 1 in NCCL, was this left out intentionally?

corey-derochie-amd · 2025-11-13T16:38:34Z

src/graph/tuning.cc

 // LL128 max BW per channel
 static const double llMaxBws[][3] = {
  /* Volta-N1/Intel-N2/Intel-N4) */ {39.0, 39.0, 20.4},
  /* Ampere-N1/AMD-N2/AMD-N4) */ {87.7, 22.5 /*avg of ring & tree*/, 19.0},
  /* Hopper-N1/AMD-N2/AMD-N4) */ {141.0, 45.0 /*avg of ring & tree*/, 35.0},
  /* Blackwell-N1/AMD-N2/AMD-N4) */ {2*141.0, 2*45.0 /*avg of ring & tree*/, 2*35.0},
 };

 static const double perChMaxRingLL128Bws[][3] = {
  /* Volta (N1/N2/N4) */  {20.0, 20.0, 20.0},
  /* Ampere (N1/N2/N4) */ {20.0, 20.0, 20.0},
  /* Hopper (N1/N2/N4) */ {36.7, 36.7, 36.7},
  /* Blackwell (N1/N2/N4) */ {2*36.7, 2*36.7, 2*36.7},
 };
 static const double perChMaxTreeLL128Bws[][3] = {
  /* Volta (N1/N2/N4) */  {20.0, 20.0, 20.0},
  /* Ampere (N1/N2/N4) */ {20.0, 20.0, 20.0},
  /* Hopper (N1/N2/N4) */ {36.7, 36.7, 29.0},
  /* Blackwell (N1/N2/N4) */ {2*36.7, 2*36.7, 2*29.0},
 };
 static const double perChMaxTreeBws[][3] = {
  /* Volta (N1/N2/N4) */  {26.5, 18.5, 10.0},
  /* Ampere (N1/N2/N4) */ {24.0, 23.6, 17.8},
  /* Hopper (N1/N2/N4) */ {38.7, 41.4, 36.0},
  /* Blackwell (N1/N2/N4) */ {2*38.7, 2*41.4, 2*36.0},
 };


I believe all this code can be deleted now.

corey-derochie-amd · 2025-11-13T16:38:54Z

src/graph/tuning.cc

 /* Array indexes used below */
 #define VOLTA_COMPCAP_IDX 0
 #define AMPERE_COMPCAP_IDX 1
 #define HOPPER_COMPCAP_IDX 2
 #define BLACKWELL_COMPCAP_IDX 3


Can be removed now.

corey-derochie-amd · 2025-11-13T16:42:32Z

src/graph/tuning.cc

 // Latencies in us, Bandwidths in GB/s
 // Tree { LL, LL128, Simple } , Ring { LL, LL128, Simple }
 static const float baseLat  [NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS] = {
      { 12.0, 12.0, 17.0 }, { 12.0, 12.0, 17.0 },   // Tree, Ring
      { 12.0, 12.0, 17.0 }, { 12.0, 12.0, 17.0 },   // Collnet Direct, Chain
      {    0,    0,    0 }, {    0,    0,    0 }};  // NVLS, NVLS Tree

 // NVLink, PCI, Network
 #define NCCL_HW_NVLINK 0
 #define NCCL_HW_PCI 1
 #define NCCL_HW_NET 2


Code can be removed.

corey-derochie-amd · 2025-11-13T16:45:22Z

src/graph/tuning.cc

 struct tuningModel {
  float hwLat [3][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
  float bwRatio [2][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
  float treeCorrectionFactor[NCCL_NUM_PROTOCOLS][27];
  float ringCorrectionFactor[NCCL_NUM_PROTOCOLS][27];
  uint64_t llProtoRanges[RCCL_TUNABLE_COLLS][NCCL_NUM_PROTOCOLS - 1][RCCL_PROTOCOL_ENTRY_SIZE];
  uint64_t channelThresholds[RCCL_TUNABLE_COLLS][RCCL_CHANNELS_TUNABLE_ENTRIES][3]; //for each collective, set for 5 channel-counts: 2,4,8,16,32,40,48,56,64, {min,max,nchannels}
 };


NCCL has changed its tuning model code, we should see if anything needs to happen here. Might want to check with others. @nusislam @mustafabar @PedramAlizadeh @isaki001 please see NVIDIA/nccl@f130899#diff-8029994d1ac745bdd5aa9db869bc52ccd10ad396e54b88acbb4d52e4cb343085

corey-derochie-amd · 2025-11-13T16:49:07Z

src/graph/tuning.cc

  int nRanks = comm->nRanks;
  if (nRanks <= 1) return ncclSuccess;
 #if !defined(__HIP_PLATFORM_AMD__) && !defined(__HIPCC__)
  int compCapIndex = minCompCap >= 100 ? BLACKWELL_COMPCAP_IDX : (minCompCap >= 90 ? HOPPER_COMPCAP_IDX : minCompCap >= 80 ? AMPERE_COMPCAP_IDX : VOLTA_COMPCAP_IDX);


corey-derochie-amd · 2025-11-13T16:52:39Z

src/graph/tuning.cc

        //INFO(NCCL_INIT, "algo %s proto %s busBw %f baseBw %f bw %f nChannels %d bwIntra %f bwInter %f", ncclAlgoStr[a], ncclProtoStr[p], busBw, comm->topo->baseBw, bw, graphs[a]->nChannels, graphs[a]->bwIntra, graphs[a]->bwInter);
-
-        if (a == NCCL_ALGO_NVLS) {
+#if !defined(__HIP_PLATFORM_AMD__) && !defined(__HIPCC__)


It makes sense, but I'd just like to double check: why this is disabled on AMD?

corey-derochie-amd · 2025-11-13T16:55:57Z

src/graph/tuning.cc

          float lat = rcclTuningModel[comm->topo->tuning].hwLat[hw[a]][a][p];
+          //float lat = comm->tunerConstants.hwLatencies[hw[a]][a][p];


Yes, this is good because it will show up better in future merges.

corey-derochie-amd · 2025-11-13T17:40:10Z

src/allocator.cc


 NCCL_API(ncclResult_t, ncclMemAlloc, void **ptr, size_t size);
 ncclResult_t  ncclMemAlloc_impl(void **ptr, size_t size) {
  NVTX3_FUNC_RANGE_IN(nccl_domain);


Missing change

corey-derochie-amd · 2025-11-13T17:40:41Z

src/allocator.cc


 NCCL_API(ncclResult_t, ncclMemFree, void *ptr);
 ncclResult_t  ncclMemFree_impl(void *ptr) {
  NVTX3_FUNC_RANGE_IN(nccl_domain);


Missing change

corey-derochie-amd · 2025-11-13T19:43:52Z

src/include/api_trace.h

 #define RCCL_API_TRACE_VERSION_MAJOR 0

 // should be increased every time new members are added to existing dispatch tables
 #define RCCL_API_TRACE_VERSION_PATCH 2


With this change we may need to bump the major version.

Alternatively, we could keep the old API all-to-all function and just make it call into the new all-to-all function. Then we only need to bump the patch version.

wenkaidu · 2025-11-14T22:34:07Z

Can you squash all non-NCCL commits to one? It should have all original NCCL commits + 1 merge commit, similar to we have done in all early merges, i.e. #1928

corey-derochie-amd · 2025-11-17T14:47:42Z

Can you squash all non-NCCL commits to one? It should have all original NCCL commits + 1 merge commit, similar to we have done in all early merges, i.e. #1928

This is the intention once fixes are complete. Just need to ensure no fixes are lost.

corey-derochie-amd · 2025-11-19T18:36:36Z

src/enqueue.cc

 #ifdef GENERATE_SYM_KERNELS
  for (int sym=0; sym <= 1; sym++) {
-    int kcount = sym==0 ? KernelCount : ncclSymKernelCount;
+    int kcount = sym==0 ? ncclDevKernelCount : ncclSymkKernelCount;


Suggested change

int kcount = sym==0 ? ncclDevKernelCount : ncclSymkKernelCount;

int kcount = sym==0 ? KernelCount : ncclSymkKernelCount;

RCCL seems to have purposely maintained the use of KernelCount. @BertanDogancay do you know why?

We should keep KernelCount since rccl has its own kernel list defined at the beginning of this file.

corey-derochie-amd · 2025-11-19T19:22:35Z

src/include/comm.h

-  uint32_t isCollnet:1, isNvls:1;
-  uint32_t devFuncId:30;
+  uint32_t isCollnet:1, isNvls:1, isSymLast:1;
+  uint32_t devFuncId:29;


Knocking this down to 29 bits might not work, see device.h for the bit packing formula.

wenkaidu · 2025-11-22T00:30:15Z

src/misc/shmutils.cc

 #if defined(HIP_HOST_UNCACHED_MEMORY)
    CUDACHECKGOTO(cudaHostRegister((void*)hptr, realShmSize, cudaHostRegisterPortable | cudaHostRegisterMapped | hipExtHostRegisterUncached), ret, fail);
 #else
+    CUDACHECKGOTO(cudaThreadExchangeStreamCaptureMode(&mode), ret, fail);


delete this, it is already in L118.

corey-derochie-amd · 2025-11-26T16:32:53Z

src/graph/tuning.cc

+        comm->latencies[coll][a][p] = comm->tunerConstants.baseLatencies[a][p];
+        float intraLat = comm->tunerConstants.hwLatencies[intraHw[a]][a][p];
+        // With ppn=1 latencies are fully exposed, use the Tree network latency
+        float interLat = ppn == 1 ? comm->tunerConstants.hwLatencies[NCCL_HW_NET][NCCL_ALGO_TREE][p] : comm->tunerConstants.hwLatencies[NCCL_HW_NET][a][p];


Should switch back to rcclTuningModel.

alex-breslow-amd reviewed Nov 11, 2025

View reviewed changes

src/device/symmetric/primitives.cuh Outdated Show resolved Hide resolved

alex-breslow-amd reviewed Nov 11, 2025

View reviewed changes

src/device/symmetric/primitives.cuh Outdated Show resolved Hide resolved

alex-breslow-amd reviewed Nov 11, 2025

View reviewed changes

src/device/symmetric/primitives.cuh Outdated Show resolved Hide resolved

mberenjk marked this pull request as ready for review November 11, 2025 02:14

nusislam reviewed Nov 11, 2025

View reviewed changes

nileshnegi added the ci:regression-detection Run through all collectives and data types to identify any performance issues label Nov 12, 2025

corey-derochie-amd reviewed Nov 13, 2025

View reviewed changes

corey-derochie-amd requested changes Nov 13, 2025

View reviewed changes

speriaswamy-amd added ci:regression-detection Run through all collectives and data types to identify any performance issues and removed ci:regression-detection Run through all collectives and data types to identify any performance issues labels Nov 13, 2025

corey-derochie-amd reviewed Nov 13, 2025

View reviewed changes

corey-derochie-amd requested changes Nov 13, 2025

View reviewed changes

Merge remote-tracking branch 'nccl/master' into develop

b7b318b

mberenjk force-pushed the mberenjk/nccl-sync-2.28 branch from d464291 to b7b318b Compare November 17, 2025 18:44

corey-derochie-amd requested changes Nov 19, 2025

View reviewed changes

wenkaidu reviewed Nov 22, 2025

View reviewed changes

corey-derochie-amd reviewed Nov 26, 2025

View reviewed changes

		float lat = rcclTuningModel[comm->topo->tuning].hwLat[hw[a]][a][p];
		//float lat = comm->tunerConstants.hwLatencies[hw[a]][a][p];

	int kcount = sym==0 ? ncclDevKernelCount : ncclSymkKernelCount;
	int kcount = sym==0 ? KernelCount : ncclSymkKernelCount;

[SYNC] 2.28.3 #2040

Are you sure you want to change the base?

[SYNC] 2.28.3 #2040

Uh oh!

Conversation

mberenjk commented Nov 7, 2025

Details

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ROCmMathLibrariesBot commented Nov 12, 2025

regression-detection run on commit b327cf0

Uh oh!

ROCmMathLibrariesBot commented Nov 13, 2025

regression-detection run on commit d464291

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenkaidu commented Nov 14, 2025

Uh oh!

corey-derochie-amd commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

regression-detection run on commit `b327cf0`

regression-detection run on commit `d464291`