Skip to content

JACCL placement fails on 2-node Thunderbolt 5 RDMA topology — Phase never records SocketConnection to peer #1975

@Pau1co

Description

@Pau1co

Exo Bug Report Draft — Post on GitHub at https://github.com/exo-explore/exo/issues/new


Title: JACCL placement fails on 2-node Thunderbolt 5 RDMA topology — Phase never records SocketConnection to peer

Labels: bug, jaccl, thunderbolt


Environment

  • exo: latest main branch (cloned Apr 23, 2026)
  • macOS: 26.4.1 on both nodes
  • Hardware: 2× Mac Studio M3 Ultra 512GB, direct Thunderbolt 5 cable, RDMA enabled via rdma_ctl enable
  • RDMA interface: rdma_en5 on both nodes, 10.10.0.1 ↔ 10.10.0.2 (/24)
  • MLX: 0.31.3
  • MLX-LM: 0.31.2

Model

mlx-community/Kimi-K2.6-mlx-DQ3_K_M-q8 (K2.6, 470 GB, MoE 1T/32B active, kimi_k25 architecture)

Symptom

exo's libp2p peer discovery works — both nodes find each other and the model is registered in the topology. The RDMA interface (rdma_en5) is detected. However, Phase (node A) never records a SocketConnection to Mana (node B) despite:

  • Manual HTTP reachability probes from Phase → Mana succeeding
  • check_reachability firing on Phase
  • ping discovered never logging from Phase's perspective

As a result, placement_utils.get_mlx_jaccl_coordinators cannot build a valid symmetric placement and the model never starts serving via exo's auto-parallel scheduler.

Expected behavior

On a 2-node RDMA topology with a single direct TB5 cable, exo should record bidirectional SocketConnections and produce a valid JACCL placement for tensor-parallel inference.

Workaround

Bare MLX-distributed via mlx.launch --backend jaccl --hostfile hostfile.json works perfectly on the same hardware with the same model. The JACCL transport, all-reduce correctness, and inference all function — only exo's placement layer is broken.

Suspected root cause

Asymmetric topology assumption in placement_utils.get_mlx_jaccl_coordinators. The function appears to require symmetric SocketConnection records from both sides before proceeding, but the discovery/connection flow doesn't guarantee both sides register simultaneously or at all in a 2-node direct-cable topology.

Reproducer

  1. Two Mac Studios M3 Ultra, direct TB5 cable, RDMA enabled, IPs on a /24 subnet
  2. Install exo from source on both
  3. Set matching EXO_LIBP2P_NAMESPACE
  4. Launch exo --models-dir <dir> on both
  5. Observe dashboard: both nodes visible, model registered, but no serving endpoint materializes
  6. Check logs: Phase shows check_reachability but no SocketConnection to Mana

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions