Skip to content

Conversation

@mustafabar
Copy link
Contributor

@mustafabar mustafabar commented Nov 22, 2025

Work item: Internal

What were the changes?
This pull request introduces significant changes to enable per-warp channel assignment ("WarpSpeed") in RCCL device kernels. The main goal is to allow each GPU warp to operate on its own communication channel. Most device code is updated to reference the correct per-warp channel and channel ID, and kernel argument handling is generalized.

image

Why were the changes made?
It was found from TransferBench that higher/peak BW can be achieved on a single node with lower number of CUs when each warp within a CU transfers data to a remote GPU over XGMI on MI3xx platforms.

How was the outcome achieved?

WarpSpeed Enablement and Per-Warp Channel Assignment

  • Added support for per-warp channel assignment in shared memory (ncclShmem), including new fields warpChannelId and warpChannel, and logic to assign channels to warps based on a mask and communication mode
  • Updated kernel launch signatures and shared memory setup to use a generalized kernel argument storage type (ncclDevKernelArgsDefaultStorage), replacing the previous fixed-size type.

Device Collectives and Primitives Refactoring

  • Update all device collective implementations (all_gather.h, all_reduce.h, broadcast.h, reduce.h, reduce_scatter.h) to use per-warp channel and channel ID for ring-based algorithms.
  • Updated device primitives (prims_simple.h, prims_ll.h, prims_ll128.h) to reference per-warp channel data for peer notifications, barrier synchronization, and connection setup, supporting both warp and block-level communication.

Kernel Execution Logic

  • Modified the main kernel execution logic to compute warp indices, assign channels per warp, and copy channel data to shared memory for each warp. Also added logic to handle both warp-level and block-level communication modes.
  • Updated collective work scheduling to make sure that warp groups are only used when warp communication is enabled.

Miscellaneous

  • Fixed a type issue in the enqueue logic, changing the type of nChannels to allow more than 128 channels.

Additional Documentation:

  • This feature has potential for e2e workloads: Notably higher relative peak BW when smaller grid is used (e.g. 16 x 256/512).
  • Higher multimode performance up to 32 x 256 grid dimensions
  • Best performance requires medium size message algo/protocol/CU usage optimization for small to medium messages
  • May need thresholding since it is not a latency-optimal approach (e.g. implementation uses more channel and synchronization)
  • Currently provided as experimental feature for further testing with e2e requiring overlap with compute and lower CU usage from comms
  • To experiment peak performance impact on a single MI350 node use RCCL_WARP_SPEED_ENABLE=1 RCCL_UNROLL_FACTOR=1 RCCL_WARP_SPEED_CU_COUNT=56 RCCL_THREADS_PER_BLOCK=256

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

@mustafabar mustafabar requested a review from a team November 22, 2025 01:12
if( rcclParamUnrollFactor() != -1 ) {
comm->unroll = rcclParamUnrollFactor(); //-1 to map to 0 based indexing
if(comm->unroll < NCCL_UNROLL_1 || comm->unroll >= NCCL_NUM_UNROLLS) {
WARN("Invalid RCCL_UNROLL_FACTOR %d specified. Valid values are 0 to 2 corresponding to unroll factors of 1, 2, and 4 respectively.", comm->unroll);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't work for local gpu build targets, right, where we have a limited no. of unroll factors (depending on GPU targets)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it won't work unless all unrolls are built. It is more of a debugging feature and should not be recommended to users. We need to have a way to build all unrolls irrespective of the target for experimentation purposes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is possible to build with all unrolls, if you do not use -l and instead specify the target GPU like --amdgpu_targets=gfx950

@nileshnegi
Copy link
Contributor

UnitTests failing due to MSCCL.

@wenkaidu
Copy link
Contributor

To reduce risk and make future NCCL sync easier, can we guard these code using #ifdef since this is a very significant code change?

@mustafabar
Copy link
Contributor Author

UnitTests failing due to MSCCL.

I added a fix for MSCCL. I see there are still some failures in AllGather out-of-place. I am running those UTs locally to see

@mustafabar
Copy link
Contributor Author

mustafabar commented Nov 25, 2025

To reduce risk and make future NCCL sync easier, can we guard these code using #ifdef since this is a very significant code change?

@wenkaidu The code is currently a Macro chaos due to the NPKIT injections. I thought of this to make it less messy. Will this address your concern? any issues with this approach or ideas ?


​RCCL_INJECT_WS(
{
    // RCCL custom logic
},
{
    // NCCL original logic 
});

​#pragma once



#ifndef RCCL_WARP_SPEED
#define RCCL_WARP_SPEED 1
#endif



#if RCCL_WARP_SPEED



// run RCCL block, ignore NCCL block
#define RCCL_INJECT_WS(block_rccl, block_nccl) \
  do { block_rccl } while(0)



#else



// run NCCL block, ignore RCCL block
#define RCCL_INJECT_WS(block_rccl, block_nccl) \
  do { block_nccl } while(0)



#endif

@wenkaidu
Copy link
Contributor

To reduce risk and make future NCCL sync easier, can we guard these code using #ifdef since this is a very significant code change?

@wenkaidu The code is currently a Macro chaos due to the NPKIT injections. I thought of this to make it less messy. Will this address your concern? any issues with this approach or ideas ?


​RCCL_INJECT_WS(
{
    // RCCL custom logic
},
{
    // NCCL original logic 
});
​#pragma once



#ifndef RCCL_WARP_SPEED
#define RCCL_WARP_SPEED 1
#endif



#if RCCL_WARP_SPEED



// run RCCL block, ignore NCCL block
#define RCCL_INJECT_WS(block_rccl, block_nccl) \
  do { block_rccl } while(0)



#else



// run NCCL block, ignore RCCL block
#define RCCL_INJECT_WS(block_rccl, block_nccl) \
  do { block_nccl } while(0)



#endif

Since this is a very specific enhancement for gfx950 only, it would be better to use cmake compilation option to have the capability to turn it off for majority of cases. I think ifdef is the best option as original code path is preserved, thus understanding/merging NCCL sync will be much easier.

@mustafabar
Copy link
Contributor Author

@wenkaidu I have added compiler based Macro ENABLE_WARP_SPEED that guards the changes and is only enabled for MI3xx by default (there's also opt out option with --disable-warp-speed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants