Title: JACCL placement fails on 2-node Thunderbolt 5 RDMA topology — Phase never records SocketConnection to peer
Labels: bug, jaccl, thunderbolt
Environment
- exo: latest
main branch (cloned Apr 23, 2026)
- macOS: 26.4.1 on both nodes
- Hardware: 2× Mac Studio M3 Ultra 512GB, direct Thunderbolt 5 cable, RDMA enabled via
rdma_ctl enable
- RDMA interface:
rdma_en5 on both nodes, 10.10.0.1 ↔ 10.10.0.2 (/24)
- MLX: 0.31.3
- MLX-LM: 0.31.2
Model
mlx-community/Kimi-K2.6-mlx-DQ3_K_M-q8 (K2.6, 470 GB, MoE 1T/32B active, kimi_k25 architecture)
Symptom
exo's libp2p peer discovery works — both nodes find each other and the model is registered in the topology. The RDMA interface (rdma_en5) is detected. However, Phase (node A) never records a SocketConnection to Mana (node B) despite:
- Manual HTTP reachability probes from Phase → Mana succeeding
check_reachability firing on Phase
ping discovered never logging from Phase's perspective
As a result, placement_utils.get_mlx_jaccl_coordinators cannot build a valid symmetric placement and the model never starts serving via exo's auto-parallel scheduler.
Expected behavior
On a 2-node RDMA topology with a single direct TB5 cable, exo should record bidirectional SocketConnections and produce a valid JACCL placement for tensor-parallel inference.
Workaround
Bare MLX-distributed via mlx.launch --backend jaccl --hostfile hostfile.json works perfectly on the same hardware with the same model. The JACCL transport, all-reduce correctness, and inference all function — only exo's placement layer is broken.
Suspected root cause
Asymmetric topology assumption in placement_utils.get_mlx_jaccl_coordinators. The function appears to require symmetric SocketConnection records from both sides before proceeding, but the discovery/connection flow doesn't guarantee both sides register simultaneously or at all in a 2-node direct-cable topology.
Reproducer
- Two Mac Studios M3 Ultra, direct TB5 cable, RDMA enabled, IPs on a /24 subnet
- Install exo from source on both
- Set matching
EXO_LIBP2P_NAMESPACE
- Launch
exo --models-dir <dir> on both
- Observe dashboard: both nodes visible, model registered, but no serving endpoint materializes
- Check logs: Phase shows
check_reachability but no SocketConnection to Mana
Related issues
Exo Bug Report Draft — Post on GitHub at https://github.com/exo-explore/exo/issues/new
Title: JACCL placement fails on 2-node Thunderbolt 5 RDMA topology — Phase never records SocketConnection to peer
Labels: bug, jaccl, thunderbolt
Environment
mainbranch (cloned Apr 23, 2026)rdma_ctl enablerdma_en5on both nodes, 10.10.0.1 ↔ 10.10.0.2 (/24)Model
mlx-community/Kimi-K2.6-mlx-DQ3_K_M-q8(K2.6, 470 GB, MoE 1T/32B active, kimi_k25 architecture)Symptom
exo's libp2p peer discovery works — both nodes find each other and the model is registered in the topology. The RDMA interface (
rdma_en5) is detected. However, Phase (node A) never records aSocketConnectionto Mana (node B) despite:check_reachabilityfiring on Phaseping discoverednever logging from Phase's perspectiveAs a result,
placement_utils.get_mlx_jaccl_coordinatorscannot build a valid symmetric placement and the model never starts serving via exo's auto-parallel scheduler.Expected behavior
On a 2-node RDMA topology with a single direct TB5 cable, exo should record bidirectional SocketConnections and produce a valid JACCL placement for tensor-parallel inference.
Workaround
Bare MLX-distributed via
mlx.launch --backend jaccl --hostfile hostfile.jsonworks perfectly on the same hardware with the same model. The JACCL transport, all-reduce correctness, and inference all function — only exo's placement layer is broken.Suspected root cause
Asymmetric topology assumption in
placement_utils.get_mlx_jaccl_coordinators. The function appears to require symmetricSocketConnectionrecords from both sides before proceeding, but the discovery/connection flow doesn't guarantee both sides register simultaneously or at all in a 2-node direct-cable topology.Reproducer
EXO_LIBP2P_NAMESPACEexo --models-dir <dir>on bothcheck_reachabilitybut noSocketConnectionto ManaRelated issues