Skip to content

Conversation

@nileshnegi
Copy link
Contributor

(cherry picked from commit 641c0eb)

Details

Work item:
Internal (SWDEV-565260)

What were the changes?

  • prevent batching when send/recv bytes don't match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes
  • correct computation for channel to part mapping
  • disabling p2p-batching by default

Why were the changes made?
Re-introducing bit-reversal for channels<->parts mapping prevents perf. regression on AINIC

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

* prevent batching when send/recv bytes dont match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes

* correct computation for channel to part mapping

* update changelog

* disabling p2p-batching by default

(cherry picked from commit 641c0eb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants