Fix potential hang issues when encountering illegal external connections #1880
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We have addressed the issue raised in #1808 by introducing a new socket state specifically to handle invalid magic values, preventing the program from entering an infinite loop. The PR is based on #1834, and we extend it to cover a key edge case (successful external connection but no magic value or invalid magic field length). Details of the state transitions and handling mechanism are visualized in the diagram below:


Test Method as Follows:
Write a shell script that executes nccl-tests alltoall_perf in a loop. During the program runtime, use
naabuto simulate external connections. When the port used by the proxy thread is connected, it will generate logs and prevent program hangs.Additionally, we have extended the solution to cover an important edge case:
When an external connection is successfully established but either:
A timeout mechanism has been implemented to prevent the program from hanging indefinitely in such cases. This ensures the system can gracefully handle incomplete or malformed connection handshakes, improving overall stability and fault tolerance.
The timeout duration is set to a reasonable default that balances responsiveness with allowance for legitimate network delays, while still preventing permanent hangs in failure scenarios.