Commit 7efe568
Yinglin Sun
More logging in socketPollConnect and let some error cases retry
Recently we noticed a hang case during NCCL bootstrap.
Some signature in job error log:
rank_604.s12gn3064.4031852.log:[0604/1024][2025-02-16 22:31:34,422] [misc/socket.cc:586] [s12gn3064:pid=3942408] NCCL WARN socketPollConnect poll() returned 1, no POLLOUT events
rank_607.s12gn3064.4031852.log:[0607/1024][2025-02-16 22:31:34,418] [misc/socket.cc:586] [s12gn3064:pid=3942411] NCCL WARN socketPollConnect poll() returned 1, no POLLOUT events
For our case, the issue is caused by the destination node dropping TCP SYN packets, which triggers TCP connection timeout on src side. However, this is general case. When poll returns 1 and there is no POLLOUT event, it could be POLLERR or POLLHUP. For such cases, we would want to go to socketConnectCheck for retry, instead of returning error.
Testing:
we reproduced the issue. The job ran into this case and retry worked. The job completed successfully.
[03/16][2025-02-25 15:46:48,389] [misc/socket.cc:602] [s11gn1268:pid=3987720] NCCL WARN socketPollConnect poll() failed, ret 1, error Connection timed out, revents 8, connect to 172.16.30.229:39953
[03/16][2025-02-25 15:46:48,390] [misc/socket.cc:551] [s11gn1268:pid=3987720] NCCL INFO socketPollConnect: connect returned Connection timed out, retrying (1/34) after sleep for 100 msec1 parent 80f6bda commit 7efe568
1 file changed
+20
-7
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
565 | 565 | | |
566 | 566 | | |
567 | 567 | | |
568 | | - | |
569 | | - | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
570 | 574 | | |
571 | 575 | | |
572 | 576 | | |
| |||
576 | 580 | | |
577 | 581 | | |
578 | 582 | | |
579 | | - | |
| 583 | + | |
580 | 584 | | |
581 | | - | |
582 | | - | |
| 585 | + | |
| 586 | + | |
583 | 587 | | |
584 | 588 | | |
585 | 589 | | |
586 | 590 | | |
587 | | - | |
588 | | - | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
589 | 602 | | |
590 | 603 | | |
591 | 604 | | |
| |||
0 commit comments