[Device] WarpSpeed feature enablement #2073

mustafabar · 2025-11-22T00:38:22Z

Work item: Internal

What were the changes?
This pull request introduces significant changes to enable per-warp channel assignment ("WarpSpeed") in RCCL device kernels. The main goal is to allow each GPU warp to operate on its own communication channel. Most device code is updated to reference the correct per-warp channel and channel ID, and kernel argument handling is generalized.

Why were the changes made?
It was found from TransferBench that higher/peak BW can be achieved on a single node with lower number of CUs when each warp within a CU transfers data to a remote GPU over XGMI on MI3xx platforms.

How was the outcome achieved?

WarpSpeed Enablement and Per-Warp Channel Assignment

Added support for per-warp channel assignment in shared memory (ncclShmem), including new fields warpChannelId and warpChannel, and logic to assign channels to warps based on a mask and communication mode
Updated kernel launch signatures and shared memory setup to use a generalized kernel argument storage type (ncclDevKernelArgsDefaultStorage), replacing the previous fixed-size type.

Device Collectives and Primitives Refactoring

Update all device collective implementations (all_gather.h, all_reduce.h, broadcast.h, reduce.h, reduce_scatter.h) to use per-warp channel and channel ID for ring-based algorithms.
Updated device primitives (prims_simple.h, prims_ll.h, prims_ll128.h) to reference per-warp channel data for peer notifications, barrier synchronization, and connection setup, supporting both warp and block-level communication.

Kernel Execution Logic

Modified the main kernel execution logic to compute warp indices, assign channels per warp, and copy channel data to shared memory for each warp. Also added logic to handle both warp-level and block-level communication modes.
Updated collective work scheduling to make sure that warp groups are only used when warp communication is enabled.

Miscellaneous

Fixed a type issue in the enqueue logic, changing the type of nChannels to allow more than 128 channels.

Additional Documentation:

This feature has potential for e2e workloads: Notably higher relative peak BW when smaller grid is used (e.g. 16 x 256/512).
Higher multimode performance up to 32 x 256 grid dimensions
Best performance requires medium size message algo/protocol/CU usage optimization for small to medium messages
May need thresholding since it is not a latency-optimal approach (e.g. implementation uses more channel and synchronization)
Currently provided as experimental feature for further testing with e2e requiring overlap with compute and lower CU usage from comms
To experiment peak performance impact on a single MI350 node use RCCL_WARP_SPEED_ENABLE=1 RCCL_UNROLL_FACTOR=1 RCCL_WARP_SPEED_CU_COUNT=56 RCCL_THREADS_PER_BLOCK=256

Approval Checklist

Do not approve until these items are satisfied.

Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.

…speed_v1

This reverts commit d9e8cb4.

nileshnegi · 2025-11-22T01:57:43Z

src/rccl_wrap.cc

+  if( rcclParamUnrollFactor() != -1 ) {
+    comm->unroll = rcclParamUnrollFactor(); //-1 to map to 0 based indexing
+    if(comm->unroll < NCCL_UNROLL_1 || comm->unroll >= NCCL_NUM_UNROLLS) {
+      WARN("Invalid RCCL_UNROLL_FACTOR %d specified. Valid values are 0 to 2 corresponding to unroll factors of 1, 2, and 4 respectively.", comm->unroll);


this won't work for local gpu build targets, right, where we have a limited no. of unroll factors (depending on GPU targets)?

No it won't work unless all unrolls are built. It is more of a debugging feature and should not be recommended to users. We need to have a way to build all unrolls irrespective of the target for experimentation purposes

it is possible to build with all unrolls, if you do not use -l and instead specify the target GPU like --amdgpu_targets=gfx950

nileshnegi · 2025-11-24T17:43:21Z

UnitTests failing due to MSCCL.

wenkaidu · 2025-11-24T17:54:58Z

To reduce risk and make future NCCL sync easier, can we guard these code using #ifdef since this is a very significant code change?

mustafabar · 2025-11-25T13:57:30Z

UnitTests failing due to MSCCL.

I added a fix for MSCCL. I see there are still some failures in AllGather out-of-place. I am running those UTs locally to see

mustafabar · 2025-11-25T15:45:38Z

To reduce risk and make future NCCL sync easier, can we guard these code using #ifdef since this is a very significant code change?

@wenkaidu The code is currently a Macro chaos due to the NPKIT injections. I thought of this to make it less messy. Will this address your concern? any issues with this approach or ideas ?


RCCL_INJECT_WS(
{
    // RCCL custom logic
},
{
    // NCCL original logic 
});

#pragma once



#ifndef RCCL_WARP_SPEED
#define RCCL_WARP_SPEED 1
#endif



#if RCCL_WARP_SPEED



// run RCCL block, ignore NCCL block
#define RCCL_INJECT_WS(block_rccl, block_nccl) \
  do { block_rccl } while(0)



#else



// run NCCL block, ignore RCCL block
#define RCCL_INJECT_WS(block_rccl, block_nccl) \
  do { block_nccl } while(0)



#endif

wenkaidu · 2025-11-25T16:41:03Z

To reduce risk and make future NCCL sync easier, can we guard these code using #ifdef since this is a very significant code change?

@wenkaidu The code is currently a Macro chaos due to the NPKIT injections. I thought of this to make it less messy. Will this address your concern? any issues with this approach or ideas ?
RCCL_INJECT_WS(
{
    // RCCL custom logic
},
{
    // NCCL original logic 
});
#pragma once



#ifndef RCCL_WARP_SPEED
#define RCCL_WARP_SPEED 1
#endif



#if RCCL_WARP_SPEED



// run RCCL block, ignore NCCL block
#define RCCL_INJECT_WS(block_rccl, block_nccl) \
  do { block_rccl } while(0)



#else



// run NCCL block, ignore RCCL block
#define RCCL_INJECT_WS(block_rccl, block_nccl) \
  do { block_nccl } while(0)



#endif

Since this is a very specific enhancement for gfx950 only, it would be better to use cmake compilation option to have the capability to turn it off for majority of cases. I think ifdef is the best option as original code path is preserved, thus understanding/merging NCCL sync will be much easier.

mustafabar · 2025-11-25T23:54:58Z

@wenkaidu I have added compiler based Macro ENABLE_WARP_SPEED that guards the changes and is only enabled for MI3xx by default (there's also opt out option with --disable-warp-speed)

mustafabar and others added 30 commits November 7, 2025 15:24

Add support for 256 channel count

8568514

Add LL with validation issues

8ef18e4

Fix bug and add concept simple support

bf97215

Add cleanup

3cd8227

Add minor edits

ff1d576

Add slightly improved version

67b95c7

Add a working v1 in drafty phase

e8546e6

Force channels to be multiple of 7

3545ce9

No barrier when nthreads == WARP_SIZE

e6e0b25

Add all_gather warp_level

385729d

Gen all unrolls for mi350

c2f80c3

Fix cases where nChannels is not multiple of 7 for single node

1915071

Enable up to 512 channels

773077b

Enable any thread block size

dc05efc

Add support for LL128

312c75b

Revert MinTrafficPerChannel change

e4087e4

Add threads per block control

b1266f3

Add RS support

6ecb5b7

Generate unroll 3 and add env var

a1f32bb

Fix SendRecv and Tree

7fa1926

Rename and simplify symbols

378d54c

Avoid more than 64 channels for Tree

e15b9e0

Added install.sh flag to suppress warnings.

d9e8cb4

Add feature knobs and refactor changes

99de243

Merge branch 'warp_speed_v1' of github.com:mustafabar/rccl into warp_…

6394818

…speed_v1

Add warpspeed tuning

0cda162

Merge conflicts

64c4549

Fix channel tuning for multinode

3190e86

Reduce Kernel Argsize

a77f6c8

Add broadcast support

a13c782

mustafabar added 16 commits November 21, 2025 17:34

Add Reduce support

a47eddb

Revert "Added install.sh flag to suppress warnings."

f77efa0

This reverts commit d9e8cb4.

Use NCCL_MAX_GROUPS for max Warps per block

78259af

Add clarifying comments on the Warp's channel loading

0c0eaaf

Reuse tidInBlock in init

50d914b

Remove comment

9fefa8b

Edit comments

d717584

Reflect correct type name

4cdf269

Return channel logic for WARP_SIZE < 64

fc53c54

Return UNROLLs to original

9b695e3

Add unroll 4

c87510b

Unroll back to what they were

ad8d80e

Go back to -O1 for debug build

fa7d972

Modify unroll factor treatment

dff8220

Use -O1 for debug

c3539b3

Remove unneeded ringIx

e0ad5f8

mustafabar requested a review from a team November 22, 2025 01:12

nileshnegi reviewed Nov 22, 2025

View reviewed changes

Fix MSCCL compatibility

9f18a01

mustafabar and others added 3 commits November 25, 2025 23:36

Guard changes by MACRO enabled for MI3xx targets only

6b8cf5c

Better align diffs for RunWorkBatch

e6905e9

Merge branch 'develop' into warp_speed_v1

8b5a1f5

Fix preprocessor directive syntax in all_gather.h

0410c4d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Device] WarpSpeed feature enablement #2073

[Device] WarpSpeed feature enablement #2073

mustafabar commented Nov 22, 2025 •

edited

Loading

Uh oh!

nileshnegi Nov 22, 2025

Uh oh!

mustafabar Nov 25, 2025

Uh oh!

nileshnegi Nov 25, 2025

Uh oh!

nileshnegi commented Nov 24, 2025

Uh oh!

wenkaidu commented Nov 24, 2025

Uh oh!

mustafabar commented Nov 25, 2025

Uh oh!

mustafabar commented Nov 25, 2025 •

edited

Loading

Uh oh!

wenkaidu commented Nov 25, 2025

Uh oh!

mustafabar commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Device] WarpSpeed feature enablement #2073

Are you sure you want to change the base?

[Device] WarpSpeed feature enablement #2073

Conversation

mustafabar commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WarpSpeed Enablement and Per-Warp Channel Assignment

Device Collectives and Primitives Refactoring

Kernel Execution Logic

Miscellaneous

Approval Checklist

Uh oh!

nileshnegi Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

mustafabar Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

nileshnegi Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

nileshnegi commented Nov 24, 2025

Uh oh!

wenkaidu commented Nov 24, 2025

Uh oh!

mustafabar commented Nov 25, 2025

Uh oh!

mustafabar commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenkaidu commented Nov 25, 2025

Uh oh!

mustafabar commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mustafabar commented Nov 22, 2025 •

edited

Loading

mustafabar commented Nov 25, 2025 •

edited

Loading