Skip to content

[RFE]: NCCL RAS when mismatch print the list of stragglers #1911

@xman1979

Description

@xman1979

Please provide the below details to ensure we understand your needs

For example, in following output, rank 12 and rank 13 are not stragglers so we don't care about them. It would be more useful if you print the process for the 30 ranks what have launched up to operation 640.

NCCL version 2.26.2 compiled with CUDA 12.6
CUDA runtime version 12060, driver version 12060

Job summary
===========

  Nodes  Processes         GPUs  Processes     GPUs
(total)   per node  per process    (total)  (total)
      4          8            1         32       32

Communicators... (0.00s)
=============

Group     Comms     Nodes     Ranks     Ranks     Ranks    Status  Errors
    #  in group  per comm  per node  per comm  in group
    0         1         4         8        32        32   RUNNING  MISMATCH

Errors
======

Warnings
========

#0-0 (fa93daea854b060b) MISMATCH
  Communicator ranks have different AllGather operation counts
  30 ranks have launched up to operation 640
  2 ranks have launched up to operation 641
  Rank 12 -- GPU 4 managed by process 323032 on node 10.137.190.124
  Rank 13 -- GPU 5 managed by process 323033 on node 10.137.190.124

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions