-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Labels
Description
Please provide the below details to ensure we understand your needs
For example, in following output, rank 12 and rank 13 are not stragglers so we don't care about them. It would be more useful if you print the process for the 30 ranks what have launched up to operation 640.
NCCL version 2.26.2 compiled with CUDA 12.6
CUDA runtime version 12060, driver version 12060
Job summary
===========
Nodes Processes GPUs Processes GPUs
(total) per node per process (total) (total)
4 8 1 32 32
Communicators... (0.00s)
=============
Group Comms Nodes Ranks Ranks Ranks Status Errors
# in group per comm per node per comm in group
0 1 4 8 32 32 RUNNING MISMATCH
Errors
======
Warnings
========
#0-0 (fa93daea854b060b) MISMATCH
Communicator ranks have different AllGather operation counts
30 ranks have launched up to operation 640
2 ranks have launched up to operation 641
Rank 12 -- GPU 4 managed by process 323032 on node 10.137.190.124
Rank 13 -- GPU 5 managed by process 323033 on node 10.137.190.124