[QUESTION] Gloo connectFullMesh failed when the number of nodes setting "export GLOO_SOCKET_IFNAME=bond4" exceeds 60 #977
Unanswered
Genlovy-Hoo
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
When we train models on a multi-node cluster, it will raise "RuntimeError: Gloo connectFullMesh failed ..." if the number of nodes setting "export GLOO_SOCKET_IFNAME=bond4" exceeds 60, such as 64. And it works when the "bond4-nodes" is less than or equal to 60.
Are there any restrictions for using the Gloo backend with the bond4 network configuration during training?
Beta Was this translation helpful? Give feedback.
All reactions