Skip to content

Conversation

@gilbertlee-amd
Copy link
Contributor

Details

Implement ability to adjust how P2P channels are mapped to parts

Work item: "Internal", or link to GitHub issue (if applicable).
SWDEV-567810

What were the changes?
Addition of new complementary ncclP2pChannelForPart / ncclP2pChannelToPart functions that take in a shiftSize variable that adjusts how work is assigned

Why were the changes made?
This allows for further investigation to see if balancing channels across XCCs can yield improved performance

How was the outcome achieved?
Modifications to the two key functions listed above, then additions to ncclComm / ncclDevComm to carry this new information.
A new RCCL environment variable to adjust, as well as a function for properly setting the value / establishing default shiftSize.

Additional Documentation:
image

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

int nChannelsLog2 = countOneBits(nP2pChannels-1);
int delta = (channel-base) & (nP2pChannels-1);
return reverseBits(delta, nChannelsLog2);
return (((channel - base) - ((base>>shiftSize)<<shiftSize))>>shiftSize) & (nParts-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nParts should reflect CHANNELS_PER_NET_PEER. This is usually a smaller number << nChannels, and 1 at large scale, which results in ANDing with 0. I have not debugged this to see what the final value in ncclShmem.comm.p2pnChannelsPerPeer is, but believe it is safer if we AND with nP2PChannels-1 instead? I wonder how this worked when tested by AINIC team.
Note that NCCL completely avoids nParts in its functions https://github.com/NVIDIA/nccl/blob/dbc86fd06e8b0c4517b95d8958a09ccacf9520c9/src/include/device.h#L270

Copy link
Contributor

@mustafabar mustafabar Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thought about it again, I think this will hang if we force channels per net peer for > 2 nodes to a value > 1. It usually ends up auto tuned to 1 for > 2 nodes, so nParts is 1 and part id is 0 therefore it went unnoticed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants