Skip to content

Commit 7efe568

Browse files
author
Yinglin Sun
committed
More logging in socketPollConnect and let some error cases retry
Recently we noticed a hang case during NCCL bootstrap. Some signature in job error log: rank_604.s12gn3064.4031852.log:[0604/1024][2025-02-16 22:31:34,422] [misc/socket.cc:586] [s12gn3064:pid=3942408] NCCL WARN socketPollConnect poll() returned 1, no POLLOUT events rank_607.s12gn3064.4031852.log:[0607/1024][2025-02-16 22:31:34,418] [misc/socket.cc:586] [s12gn3064:pid=3942411] NCCL WARN socketPollConnect poll() returned 1, no POLLOUT events For our case, the issue is caused by the destination node dropping TCP SYN packets, which triggers TCP connection timeout on src side. However, this is general case. When poll returns 1 and there is no POLLOUT event, it could be POLLERR or POLLHUP. For such cases, we would want to go to socketConnectCheck for retry, instead of returning error. Testing: we reproduced the issue. The job ran into this case and retry worked. The job completed successfully. [03/16][2025-02-25 15:46:48,389] [misc/socket.cc:602] [s11gn1268:pid=3987720] NCCL WARN socketPollConnect poll() failed, ret 1, error Connection timed out, revents 8, connect to 172.16.30.229:39953 [03/16][2025-02-25 15:46:48,390] [misc/socket.cc:551] [s11gn1268:pid=3987720] NCCL INFO socketPollConnect: connect returned Connection timed out, retrying (1/34) after sleep for 100 msec
1 parent 80f6bda commit 7efe568

File tree

1 file changed

+20
-7
lines changed

1 file changed

+20
-7
lines changed

src/misc/socket.cc

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -565,8 +565,12 @@ static ncclResult_t socketStartConnect(struct ncclSocket* sock) {
565565

566566
static ncclResult_t socketPollConnect(struct ncclSocket* sock) {
567567
struct pollfd pfd;
568-
int timeout = 1, ret;
569-
socklen_t rlen = sizeof(int);
568+
int timeout = 1, ret, sockErrorCode;
569+
socklen_t errCodeLen = sizeof(sockErrorCode);
570+
char sockstr[SOCKET_NAME_MAXLEN+1];
571+
const char *warnStr = "socketPollConnect poll() failed, ret %d, error %s, revents %hd, connect to %s";
572+
573+
ncclSocketToString(&sock->addr, sockstr);
570574

571575
memset(&pfd, 0, sizeof(struct pollfd));
572576
pfd.fd = sock->fd;
@@ -576,16 +580,25 @@ static ncclResult_t socketPollConnect(struct ncclSocket* sock) {
576580
if (ret == 0 || (ret < 0 && errno == EINTR)) {
577581
return ncclSuccess;
578582
} else if (ret < 0) {
579-
WARN("socketPollConnect poll() failed with error %s", strerror(errno));
583+
WARN(warnStr, ret, strerror(errno), pfd.revents, sockstr);
580584
return ncclRemoteError;
581-
} else if (ret != 1 || (pfd.revents & POLLOUT) == 0) {
582-
WARN("socketPollConnect poll() returned %d%s", ret, (pfd.revents & POLLOUT) ? "" : ", no POLLOUT events");
585+
} else if (ret != 1) {
586+
WARN(warnStr, ret, strerror(errno), pfd.revents, sockstr);
583587
return ncclSystemError;
584588
}
585589

586590
/* check socket status */
587-
SYSCHECK(getsockopt(sock->fd, SOL_SOCKET, SO_ERROR, (void*)&ret, &rlen), "getsockopt");
588-
return socketConnectCheck(sock, ret, __func__);
591+
SYSCHECK(getsockopt(sock->fd, SOL_SOCKET, SO_ERROR, (void*)&sockErrorCode, &errCodeLen), "getsockopt");
592+
if ((pfd.revents & POLLOUT) == 0) {
593+
/*
594+
* When poll returns 1 however no POLLOUT, it could be POLLERR or POLLHUP.
595+
* Those cases could be triggered by transient networking issue, so we would
596+
* want to go to socketConnectCheck for retry, instead of returning error directly.
597+
*/
598+
WARN(warnStr, ret, strerror(sockErrorCode), pfd.revents, sockstr);
599+
}
600+
601+
return socketConnectCheck(sock, sockErrorCode, __func__);
589602
}
590603

591604
ncclResult_t ncclSocketPollConnect(struct ncclSocket* sock) {

0 commit comments

Comments
 (0)