Skip to content

Conversation

@mberenjk
Copy link
Contributor

@mberenjk mberenjk commented Nov 7, 2025

Details

Do not mention proprietary info or link to internal work items in this PR.

**Work item: [SYNC] 2.28.3

What were the changes?
NCCL 2.28.3-1
Device API (Experimental)

  • Introduces device-side APIs to integrate NCCL communication directly into application kernels.
  • Supports LSA (Load/Store Access) for CUDA P2P communication over NVLink and some PCIe platforms.
  • Supports Multimem for hardware multicast using NVLink SHARP.
  • Adds initial framework for GIN (GPU-Initiated Networking), currently under development.
  • Introduces device communicators created using ncclDevCommCreate.
  • Enables device-side communication operations with synchronization (ncclLsaBarrierSession) and memory accessors (ncclGetLsaPointer, ncclGetLsaMultimemPointer).
  • Experimental APIs - signatures and functionality may evolve in future releases.
  • No ABI compatibility is guaranteed — applications must be recompiled with each new NCCL release.

Symmetric memory improvements

  • Support for aggregating symmetric operations using ncclGroupStart/End APIs.
  • Reimplement symmetric kernels using device API.

New Host APIs

  • Introduce new host collective APIs: ncclAlltoAll, ncclScatter, ncclGather.

CE (Copy Engine) Collectives

  • Reduce SM utilization for alltoall, scatter, gather, and allgather within a single (MN)NVL domain.
  • Free up SM capacity for the application to do computation at the same time.
  • To enable the feature for ncclAllGather, ncclAlltoAll, ncclGather, ncclScatter, register buffers into symmetric windows and use the NCCL_CTA_POLICY_ZERO flag in the communicator config_t.

NCCL Inspector Plugin

  • Introduces an Inspector plugin for always-on performance monitoring.
  • Produces structured JSON output with metadata, execution time, bandwidth, and optional event traces for each NCCL operation.
  • Enables integration with analysis tools such as Performance Exporter to visualize NCCL performance bottlenecks.
  • Lightweight to enable via environment variables NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE.

CMake support (Experiemental)

  • Adds a CMake build system as an alternative to existing Makefiles.
  • Known issues: pkg.build and Device API currently do not work with CMake.
  • The known issues will be addressed in a future release.

Decreased max CTA count from 32 to 16 on Blackwell

  • SM overhead is decreased by 50% with this improvement.
  • This may cause some perf drop on Blackwell because of the reduced SM usage.
  • If the extra SM capacity is not desired, two options are available to restore to previous behavior: 1) Setting NCCL_MIN_CTAS=32 NCCL_MAX_CTAS=32 environment variables; 2) setting communicator config to over-write max CTA count to 32.
  • Based on community feedback, future versions may consider different trade-offs between performance and SM overhead.

Plugins

  • Network
    • App-aware Network plugin. NCCL passes information about communication operations to be executed on the network end point. This allows for better tuning of network end points and their use in the plugins.
    • Improve handling of physical and virtual network devices and load/unload.
    • Network plugin version 11 - add explicit context and communication ID support for per communicator init/finalize.
    • Add Multi-Request Net API. Using this will help NCCL to anticipate multiple send/recv requests and optimize for it. See maxMultiRequestSize field in ncclNetProperties_v11_t.
  • Profiler
    • Add support for API events (group, collective, and p2p) and for tracking kernel launches in the profiler plugin.
    • Add Inspector Profiler Plugin (see section above).
    • Add a hook to Google’s CoMMA profiler on github.
  • Tuner
    • Expose NCCL tuning constants at tuner initialization via ncclTunerConstants_v5_t.
    • Add NVL Domain Information API.
  • Support multiple plugin types from a single shared object.

New Parameterization and ncclConfig changes:

  • Add new option NCCL_MNNVL_CLIQUE_ID=-2 which will use rack serial number to partition the MNNVL clique. This will limit NVLink domains to GPUs within a single rack.
  • Add NCCL_NETDEVS_POLICY to control how NET devices are assigned to GPUs. The default (AUTO) is the policy used in previous versions.
  • Add NCCL_SINGLE_PROC_MEM_REG_ENABLE control variable to enable NVLS UB registration in the “one process, multiple ranks” case as opt in.
  • Move nChannelsPerNetPeer into ncclConfig. NCCL_NCHANNELS_PER_NET_PEER can override the value in ncclConfig.
  • Enable PxN over C2C by default
    • PxN over C2C will improve performance for Grace-Blackwell platforms by allowing NCCL to leverage the NIC attached to a peer GPU over NVLINK, C2C, and PCIe.
    • This behavior can be overridden by setting NCCL_PXN_C2C=0.

Other Improvements:

  • Allow FP8 support for non-reductive operations on pre sm90 devices. (See [CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6  pytorch/pytorch#151594 (comment))
  • Fix NVLS+CollNet and temporarily disables COLLNET_CHAIN for >8 GPUs.
  • Only consider running interfaces for socket traffic. NCCL will not attempt to use interfaces that do not have the IFF_RUNNING bit. ([RFE]: src/misc/socket.cc: findInterfaces shouldn't return an offline interface NVIDIA/nccl#1798)
  • Modernize mutex management. Convert to std::mutex and std::lock_guard.
  • Remove sm35 and sm50 GENCODE targets which have long been deprecated and were causing issues with the latest NCCL release builds.
  • Improved NVLS/NVLSTree tuning prediction to improve algorithm and protocol selection.
  • NVLSTree Tuning Fixes. Update tuning data for H100, GB200-NV72.
  • Respond better to RoCE link flaps. Instead of reporting an “unknown event” it will now report “GID table changed”.
  • Move libvirt bridge interface to the end of possible interfaces so that they are considered last. These interfaces are usually virtual bridges to relay traffic to containers running on the host and cannot be used for traffic to a remote node and are therefore unsuitable.

Device API (Experimental)
 * Introduces device-side APIs to integrate NCCL communication directly into application kernels.
 * Supports LSA (Load/Store Access) for CUDA P2P communication over NVLink and some PCIe platforms.
 * Supports Multimem for hardware multicast using NVLink SHARP.
 * Adds initial framework for GIN (GPU-Initiated Networking), currently under development.
 * Introduces device communicators created using ncclDevCommCreate.
 * Enables device-side communication operations with synchronization (ncclLsaBarrierSession) and memory accessors (ncclGetLsaPointer, ncclGetLsaMultimemPointer).
 * Experimental APIs - signatures and functionality may evolve in future releases.
 * No ABI compatibility is guaranteed — applications must be recompiled with each new NCCL release.

Symmetric memory improvements
 * Support for aggregating symmetric operations using ncclGroupStart/End APIs.
 * Reimplement symmetric kernels using device API.

New Host APIs
 * Introduce new host collective APIs: ncclAlltoAll, ncclScatter, ncclGather.

CE (Copy Engine) Collectives
 * Reduce SM utilization for alltoall, scatter, gather, and allgather within a single (MN)NVL domain.
 * Free up SM capacity for the application to do computation at the same time.
 * To enable the feature for ncclAllGather, ncclAlltoAll, ncclGather, ncclScatter, register buffers into symmetric windows and use the NCCL_CTA_POLICY_ZERO flag in the communicator config_t.

NCCL Inspector Plugin
 * Introduces an Inspector plugin for always-on performance monitoring.
 * Produces structured JSON output with metadata, execution time, bandwidth, and optional event traces for each NCCL operation.
 * Enables integration with analysis tools such as Performance Exporter to visualize NCCL performance bottlenecks.
 * Lightweight to enable via environment variables NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE.

CMake support (Experiemental)
 * Adds a CMake build system as an alternative to existing Makefiles.
 * Known issues: pkg.build and Device API currently do not work with CMake.
 * The known issues will be addressed in a future release.

Decreased max CTA count from 32 to 16 on Blackwell
 * SM overhead is decreased by 50% with this improvement.
 * This may cause some perf drop on Blackwell because of the reduced SM usage.
 * If the extra SM capacity is not desired, two options are available to restore to previous behavior: 1) Setting NCCL_MIN_CTAS=32 NCCL_MAX_CTAS=32 environment variables; 2) setting communicator config to over-write max CTA count to 32.
 * Based on community feedback, future versions may consider different trade-offs between performance and SM overhead.

Plugins
 * Network
   * App-aware Network plugin. NCCL passes information about communication operations to be executed on the network end point. This allows for better tuning of network end points and their use in the plugins.
   * Improve handling of physical and virtual network devices and load/unload.
   * Network plugin version 11 - add explicit context and communication ID support for per communicator init/finalize.
   * Add Multi-Request Net API. Using this will help NCCL to anticipate multiple send/recv requests and optimize for it. See maxMultiRequestSize field in ncclNetProperties_v11_t.
 * Profiler
   * Add support for API events (group, collective, and p2p) and for tracking kernel launches in the profiler plugin.
   * Add Inspector Profiler Plugin (see section above).
   * Add a hook to Google’s CoMMA profiler on github.
 * Tuner
   * Expose NCCL tuning constants at tuner initialization via ncclTunerConstants_v5_t.
   * Add NVL Domain Information API.
 * Support multiple plugin types from a single shared object.

New Parameterization and ncclConfig changes:
 * Add new option NCCL_MNNVL_CLIQUE_ID=-2 which will use rack serial number to partition the MNNVL clique. This will limit NVLink domains to GPUs within a single rack.
 * Add NCCL_NETDEVS_POLICY to control how NET devices are assigned to GPUs. The default (AUTO) is the policy used in previous versions.
 * Add NCCL_SINGLE_PROC_MEM_REG_ENABLE control variable to enable NVLS UB registration in the “one process, multiple ranks” case as opt in.
 * Move nChannelsPerNetPeer into ncclConfig. NCCL_NCHANNELS_PER_NET_PEER can override the value in ncclConfig.
 * Enable PxN over C2C by default
   * PxN over C2C will improve performance for Grace-Blackwell platforms by allowing NCCL to leverage the NIC attached to a peer GPU over NVLINK, C2C, and PCIe.
   * This behavior can be overridden by setting NCCL_PXN_C2C=0.

Other Improvements:
 * Allow FP8 support for non-reductive operations on pre sm90 devices. (See pytorch/pytorch#151594 (comment))
 * Fix NVLS+CollNet and temporarily disables COLLNET_CHAIN for >8 GPUs.
 * Only consider running interfaces for socket traffic. NCCL will not attempt to use interfaces that do not have the IFF_RUNNING bit. (NVIDIA/nccl#1798)
 * Modernize mutex management. Convert to std::mutex and std::lock_guard.
 * Remove sm35 and sm50 GENCODE targets which have long been deprecated and were causing issues with the latest NCCL release builds.
 * Improved NVLS/NVLSTree tuning prediction to improve algorithm and protocol selection.
 * NVLSTree Tuning Fixes. Update tuning data for H100, GB200-NV72.
 * Respond better to RoCE link flaps. Instead of reporting an “unknown event” it will now report “GID table changed”.
 * Move libvirt bridge interface to the end of possible interfaces so that they are considered last. These interfaces are usually virtual bridges to relay traffic to containers running on the host and cannot be used for traffic to a remote node and are therefore unsuitable.
@mberenjk mberenjk marked this pull request as ready for review November 11, 2025 02:14
rankOffset >= 744 * 1024 && rankAlign != 4 && rcclParamAllToAllPivotEnable()) {
struct ncclInfo info = { ncclFuncAllToAllPivot, "AllToAllPivot",
rankOffset >= 744 * 1024 && rankAlign != 4 /*&& rcclParamAllToAllPivotEnable()*/) {
struct ncclInfo info = { ncclFuncAlltoAll, "AlltoAll",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see any send-recv operations being posted here in nccl, they are posted in enqueue. Any reason we are not doing the same?

@nileshnegi nileshnegi added the ci:regression-detection Run through all collectives and data types to identify any performance issues label Nov 12, 2025
@ROCmMathLibrariesBot
Copy link

regression-detection run on commit b327cf0

Artifacts - Results

@ROCmMathLibrariesBot
Copy link

regression-detection run on commit d464291

Artifacts - Results

Comment on lines 720 to 723
// gfx name from hipDeviceProp_t [RCCL]
char* archName;
// multiProcessorCount from hipDeviceProp_t [RCCL]
int cuCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

archName and cuCount removed, I assume this was intentional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is changing here, is it just newline style?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some NCCL changes missing in here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some NCCL changes missing in here.

Comment on lines -776 to -792
/*! @brief All-To-All
@details Device (i) send (j)th block of data to device (j) and be placed as (i)th
block. Each block for sending/receiving has *count* elements, which means
that *recvbuff* and *sendbuff* should have a size of nranks*count elements.
In-place operation is NOT supported. It is the user's responsibility
to ensure that sendbuff and recvbuff are distinct.
@return Result code. See @ref rccl_result_code for more details.

@param[in] sendbuff Data array to send (contains blocks for each other rank)
@param[out] recvbuff Data array to receive (contains blocks from each other rank)
@param[in] count Number of elements to send between each pair of ranks
@param[in] datatype Data buffer element datatype
@param[in] comm Communicator group object to execute on
@param[in] stream HIP stream to execute collective on */
ncclResult_t ncclAllToAll(const void* sendbuff, void* recvbuff, size_t count,
ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream);
/*! @cond include_hidden */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should maintain this documentation style in the declaration added above, but the content should match the new comment.

*
* In-place operations will happen if recvbuff == sendbuff + root * count.
*/
ncclResult_t ncclScatter(const void* sendbuff, void* recvbuff, size_t count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for Gather, Scatter, Send on documentation style.

}
}

NCCL_PARAM(PxnC2c, "PXN_C2C", 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default for NCCL_PXN_C2C was changed to 1 in NCCL, was this left out intentionally?

Comment on lines 436 to 461
// LL128 max BW per channel
static const double llMaxBws[][3] = {
/* Volta-N1/Intel-N2/Intel-N4) */ {39.0, 39.0, 20.4},
/* Ampere-N1/AMD-N2/AMD-N4) */ {87.7, 22.5 /*avg of ring & tree*/, 19.0},
/* Hopper-N1/AMD-N2/AMD-N4) */ {141.0, 45.0 /*avg of ring & tree*/, 35.0},
/* Blackwell-N1/AMD-N2/AMD-N4) */ {2*141.0, 2*45.0 /*avg of ring & tree*/, 2*35.0},
};

static const double perChMaxRingLL128Bws[][3] = {
/* Volta (N1/N2/N4) */ {20.0, 20.0, 20.0},
/* Ampere (N1/N2/N4) */ {20.0, 20.0, 20.0},
/* Hopper (N1/N2/N4) */ {36.7, 36.7, 36.7},
/* Blackwell (N1/N2/N4) */ {2*36.7, 2*36.7, 2*36.7},
};
static const double perChMaxTreeLL128Bws[][3] = {
/* Volta (N1/N2/N4) */ {20.0, 20.0, 20.0},
/* Ampere (N1/N2/N4) */ {20.0, 20.0, 20.0},
/* Hopper (N1/N2/N4) */ {36.7, 36.7, 29.0},
/* Blackwell (N1/N2/N4) */ {2*36.7, 2*36.7, 2*29.0},
};
static const double perChMaxTreeBws[][3] = {
/* Volta (N1/N2/N4) */ {26.5, 18.5, 10.0},
/* Ampere (N1/N2/N4) */ {24.0, 23.6, 17.8},
/* Hopper (N1/N2/N4) */ {38.7, 41.4, 36.0},
/* Blackwell (N1/N2/N4) */ {2*38.7, 2*41.4, 2*36.0},
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all this code can be deleted now.

Comment on lines 429 to 433
/* Array indexes used below */
#define VOLTA_COMPCAP_IDX 0
#define AMPERE_COMPCAP_IDX 1
#define HOPPER_COMPCAP_IDX 2
#define BLACKWELL_COMPCAP_IDX 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed now.

Comment on lines 135 to 145
// Latencies in us, Bandwidths in GB/s
// Tree { LL, LL128, Simple } , Ring { LL, LL128, Simple }
static const float baseLat [NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS] = {
{ 12.0, 12.0, 17.0 }, { 12.0, 12.0, 17.0 }, // Tree, Ring
{ 12.0, 12.0, 17.0 }, { 12.0, 12.0, 17.0 }, // Collnet Direct, Chain
{ 0, 0, 0 }, { 0, 0, 0 }}; // NVLS, NVLS Tree

// NVLink, PCI, Network
#define NCCL_HW_NVLINK 0
#define NCCL_HW_PCI 1
#define NCCL_HW_NET 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code can be removed.

Comment on lines 149 to 156
struct tuningModel {
float hwLat [3][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
float bwRatio [2][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
float treeCorrectionFactor[NCCL_NUM_PROTOCOLS][27];
float ringCorrectionFactor[NCCL_NUM_PROTOCOLS][27];
uint64_t llProtoRanges[RCCL_TUNABLE_COLLS][NCCL_NUM_PROTOCOLS - 1][RCCL_PROTOCOL_ENTRY_SIZE];
uint64_t channelThresholds[RCCL_TUNABLE_COLLS][RCCL_CHANNELS_TUNABLE_ENTRIES][3]; //for each collective, set for 5 channel-counts: 2,4,8,16,32,40,48,56,64, {min,max,nchannels}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL has changed its tuning model code, we should see if anything needs to happen here. Might want to check with others. @nusislam @mustafabar @PedramAlizadeh @isaki001 please see NVIDIA/nccl@f130899#diff-8029994d1ac745bdd5aa9db869bc52ccd10ad396e54b88acbb4d52e4cb343085

@speriaswamy-amd speriaswamy-amd added ci:regression-detection Run through all collectives and data types to identify any performance issues and removed ci:regression-detection Run through all collectives and data types to identify any performance issues labels Nov 13, 2025
int nRanks = comm->nRanks;
if (nRanks <= 1) return ncclSuccess;
#if !defined(__HIP_PLATFORM_AMD__) && !defined(__HIPCC__)
int compCapIndex = minCompCap >= 100 ? BLACKWELL_COMPCAP_IDX : (minCompCap >= 90 ? HOPPER_COMPCAP_IDX : minCompCap >= 80 ? AMPERE_COMPCAP_IDX : VOLTA_COMPCAP_IDX);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed

//INFO(NCCL_INIT, "algo %s proto %s busBw %f baseBw %f bw %f nChannels %d bwIntra %f bwInter %f", ncclAlgoStr[a], ncclProtoStr[p], busBw, comm->topo->baseBw, bw, graphs[a]->nChannels, graphs[a]->bwIntra, graphs[a]->bwInter);

if (a == NCCL_ALGO_NVLS) {
#if !defined(__HIP_PLATFORM_AMD__) && !defined(__HIPCC__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, but I'd just like to double check: why this is disabled on AMD?

Comment on lines 727 to +728
float lat = rcclTuningModel[comm->topo->tuning].hwLat[hw[a]][a][p];
//float lat = comm->tunerConstants.hwLatencies[hw[a]][a][p];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is good because it will show up better in future merges.


NCCL_API(ncclResult_t, ncclMemAlloc, void **ptr, size_t size);
ncclResult_t ncclMemAlloc_impl(void **ptr, size_t size) {
NVTX3_FUNC_RANGE_IN(nccl_domain);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing change


NCCL_API(ncclResult_t, ncclMemFree, void *ptr);
ncclResult_t ncclMemFree_impl(void *ptr) {
NVTX3_FUNC_RANGE_IN(nccl_domain);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing change

Comment on lines 31 to 34
#define RCCL_API_TRACE_VERSION_MAJOR 0

// should be increased every time new members are added to existing dispatch tables
#define RCCL_API_TRACE_VERSION_PATCH 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change we may need to bump the major version.

Alternatively, we could keep the old API all-to-all function and just make it call into the new all-to-all function. Then we only need to bump the patch version.

@wenkaidu
Copy link
Contributor

Can you squash all non-NCCL commits to one? It should have all original NCCL commits + 1 merge commit, similar to we have done in all early merges, i.e. #1928

@corey-derochie-amd
Copy link
Contributor

Can you squash all non-NCCL commits to one? It should have all original NCCL commits + 1 merge commit, similar to we have done in all early merges, i.e. #1928

This is the intention once fixes are complete. Just need to ensure no fixes are lost.

@mberenjk mberenjk force-pushed the mberenjk/nccl-sync-2.28 branch from d464291 to b7b318b Compare November 17, 2025 18:44
#ifdef GENERATE_SYM_KERNELS
for (int sym=0; sym <= 1; sym++) {
int kcount = sym==0 ? KernelCount : ncclSymKernelCount;
int kcount = sym==0 ? ncclDevKernelCount : ncclSymkKernelCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int kcount = sym==0 ? ncclDevKernelCount : ncclSymkKernelCount;
int kcount = sym==0 ? KernelCount : ncclSymkKernelCount;

RCCL seems to have purposely maintained the use of KernelCount. @BertanDogancay do you know why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep KernelCount since rccl has its own kernel list defined at the beginning of this file.

uint32_t isCollnet:1, isNvls:1;
uint32_t devFuncId:30;
uint32_t isCollnet:1, isNvls:1, isSymLast:1;
uint32_t devFuncId:29;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Knocking this down to 29 bits might not work, see device.h for the bit packing formula.

#if defined(HIP_HOST_UNCACHED_MEMORY)
CUDACHECKGOTO(cudaHostRegister((void*)hptr, realShmSize, cudaHostRegisterPortable | cudaHostRegisterMapped | hipExtHostRegisterUncached), ret, fail);
#else
CUDACHECKGOTO(cudaThreadExchangeStreamCaptureMode(&mode), ret, fail);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this, it is already in L118.

Comment on lines +718 to +721
comm->latencies[coll][a][p] = comm->tunerConstants.baseLatencies[a][p];
float intraLat = comm->tunerConstants.hwLatencies[intraHw[a]][a][p];
// With ppn=1 latencies are fully exposed, use the Tree network latency
float interLat = ppn == 1 ? comm->tunerConstants.hwLatencies[NCCL_HW_NET][NCCL_ALGO_TREE][p] : comm->tunerConstants.hwLatencies[NCCL_HW_NET][a][p];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should switch back to rcclTuningModel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:regression-detection Run through all collectives and data types to identify any performance issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants