Skip to content

Conversation

gongweibao
Copy link

@gongweibao gongweibao commented Mar 5, 2025

Such as

using GPU  to perform barrier as devices used by this process are currently unknown. 
This can potentially cause a hang if this rank to GPU mapping is incorrect. 
Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id

@ZeroRF
Copy link

ZeroRF commented Jun 11, 2025

It works! I have been confused by NCCL Error for a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants