Skip to content

Conversation

@qingyunlei-tencent
Copy link

We have addressed the issue raised in #1808 by introducing a new socket state specifically to handle invalid magic values, preventing the program from entering an infinite loop. The PR is based on #1834, and we extend it to cover a key edge case (successful external connection but no magic value or invalid magic field length). Details of the state transitions and handling mechanism are visualized in the diagram below:
image
Test Method as Follows:
Write a shell script that executes nccl-tests alltoall_perf in a loop. During the program runtime, use naabu to simulate external connections. When the port used by the proxy thread is connected, it will generate logs and prevent program hangs.
image

Additionally, we have extended the solution to cover an important edge case:
When an external connection is successfully established but either:

  1. No magic value is transmitted, or
  2. The sent data does not meet the required length for the magic field

A timeout mechanism has been implemented to prevent the program from hanging indefinitely in such cases. This ensures the system can gracefully handle incomplete or malformed connection handshakes, improving overall stability and fault tolerance.
The timeout duration is set to a reasonable default that balances responsiveness with allowance for legitimate network delays, while still preventing permanent hangs in failure scenarios.

// Added printing of local listening address and link type information
ncclSocketAddress listenAddr;
struct sockaddr_in addr;
socklen_t len = sizeof(addr);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, shouldn't this be a constexpr? Or getsocketname not like that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, shouldn't this be a constexpr? Or getsocketname not like that?

I’ve referred to some online examples, and this way of writing seems fine. Would you be willing to elaborate on your questions in detail?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
socklen_t len = sizeof(addr);
constexpr socklen_t len = sizeof(addr);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants