-
Notifications
You must be signed in to change notification settings - Fork 192
[Device] WarpSpeed feature enablement #2073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
This reverts commit d9e8cb4.
| if( rcclParamUnrollFactor() != -1 ) { | ||
| comm->unroll = rcclParamUnrollFactor(); //-1 to map to 0 based indexing | ||
| if(comm->unroll < NCCL_UNROLL_1 || comm->unroll >= NCCL_NUM_UNROLLS) { | ||
| WARN("Invalid RCCL_UNROLL_FACTOR %d specified. Valid values are 0 to 2 corresponding to unroll factors of 1, 2, and 4 respectively.", comm->unroll); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't work for local gpu build targets, right, where we have a limited no. of unroll factors (depending on GPU targets)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it won't work unless all unrolls are built. It is more of a debugging feature and should not be recommended to users. We need to have a way to build all unrolls irrespective of the target for experimentation purposes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is possible to build with all unrolls, if you do not use -l and instead specify the target GPU like --amdgpu_targets=gfx950
|
UnitTests failing due to MSCCL. |
|
To reduce risk and make future NCCL sync easier, can we guard these code using #ifdef since this is a very significant code change? |
I added a fix for MSCCL. I see there are still some failures in AllGather out-of-place. I am running those UTs locally to see |
@wenkaidu The code is currently a Macro chaos due to the NPKIT injections. I thought of this to make it less messy. Will this address your concern? any issues with this approach or ideas ? |
Since this is a very specific enhancement for gfx950 only, it would be better to use cmake compilation option to have the capability to turn it off for majority of cases. I think ifdef is the best option as original code path is preserved, thus understanding/merging NCCL sync will be much easier. |
|
@wenkaidu I have added compiler based Macro |
Work item: Internal
What were the changes?
This pull request introduces significant changes to enable per-warp channel assignment ("WarpSpeed") in RCCL device kernels. The main goal is to allow each GPU warp to operate on its own communication channel. Most device code is updated to reference the correct per-warp channel and channel ID, and kernel argument handling is generalized.
Why were the changes made?
It was found from TransferBench that higher/peak BW can be achieved on a single node with lower number of CUs when each warp within a CU transfers data to a remote GPU over XGMI on MI3xx platforms.
How was the outcome achieved?
WarpSpeed Enablement and Per-Warp Channel Assignment
ncclShmem), including new fieldswarpChannelIdandwarpChannel, and logic to assign channels to warps based on a mask and communication modencclDevKernelArgsDefaultStorage), replacing the previous fixed-size type.Device Collectives and Primitives Refactoring
all_gather.h,all_reduce.h,broadcast.h,reduce.h,reduce_scatter.h) to use per-warp channel and channel ID for ring-based algorithms.prims_simple.h,prims_ll.h,prims_ll128.h) to reference per-warp channel data for peer notifications, barrier synchronization, and connection setup, supporting both warp and block-level communication.Kernel Execution Logic
Miscellaneous
nChannelsto allow more than 128 channels.Additional Documentation:
RCCL_WARP_SPEED_ENABLE=1 RCCL_UNROLL_FACTOR=1 RCCL_WARP_SPEED_CU_COUNT=56 RCCL_THREADS_PER_BLOCK=256Approval Checklist
Do not approve until these items are satisfied.