Skip to content

[BUG] RDMA ports stop working after being inactive for a while #1973

@leanzero-srl

Description

@leanzero-srl

Describe the bug

Yesterday when 1.0.71 released, grabbed it, downloaded 3.6 27b 8bit and ran it in my cluster. it worked very WELL all night actually using qwen code.

Woke up this morning, tried to do one prompt, it immediately failed the entire cluster with error Changing queue pair to RTR failed with errno 96.

The problem now is that it's failing to load completely! it is absolutely failing to load completely. I restarted exo on both machines and yeah it's dead..

`[ 2026-04-24 09:03:00.825 | INFO | exo.main:main:275 ] ========================================
[ 2026-04-24 09:03:00.826 | INFO | exo.main:main:276 ] Starting EXO | pid=32378
[ 2026-04-24 09:03:00.826 | INFO | exo.main:main:277 ] ========================================
[ 2026-04-24 09:03:00.826 | INFO | exo.main:main:278 ] EXO_LIBP2P_NAMESPACE: 1.0.71
[ 2026-04-24 09:03:00.829 | INFO | exo.main:create:69 ] Starting node 12D3KooWDfKzXmqiWvtw2YuCf8cHJUry1Vi6f7PFP7Hh869ZbCtz
[ 2026-04-24 09:03:00.856 | INFO | exo.shared.election:run:87 ] Starting Election
[ 2026-04-24 09:03:00.856 | INFO | exo.download.coordinator:run:134 ] Starting DownloadCoordinator
[ 2026-04-24 09:03:00.856 | INFO | exo.worker.main:run:101 ] Starting Worker
[ 2026-04-24 09:03:00.856 | INFO | exo.master.main:run:101 ] Starting Master
[ 2026-04-24 09:03:00.856 | INFO | exo.api.main:run:1766 ] Starting API
[ 2026-04-24 09:03:00.876 | INFO | exo.routing.router:_networking_subscribe:182 ] Subscribed to global_events
[ 2026-04-24 09:03:00.876 | INFO | exo.routing.router:_networking_subscribe:182 ] Subscribed to local_events
[ 2026-04-24 09:03:00.876 | INFO | exo.routing.router:_networking_subscribe:182 ] Subscribed to commands
[ 2026-04-24 09:03:00.878 | INFO | exo.main:_elect_loop:200 ] Node elected Master
[ 2026-04-24 09:03:00.878 | INFO | exo.api.main:unpause:293 ] Unpausing API
[ 2026-04-24 09:03:00.878 | INFO | exo.routing.router:_networking_subscribe:182 ] Subscribed to election_messages
[ 2026-04-24 09:03:00.878 | INFO | logging:handle:1681 ] Running on http://0.0.0.0:52415 (CTRL + C to quit)
[ 2026-04-24 09:03:00.878 | INFO | logging:handle:1681 ] Running on http://0.0.0.0:52415 (CTRL + C to quit)
[ 2026-04-24 09:03:00.879 | INFO | exo.routing.router:_networking_subscribe:182 ] Subscribed to connection_messages
[ 2026-04-24 09:03:00.880 | INFO | exo.routing.router:_networking_subscribe:182 ] Subscribed to download_commands
[ 2026-04-24 09:03:06.038 | INFO | exo.shared.election:_campaign:197 ] Waiting for other campaign to finish
[ 2026-04-24 09:03:09.040 | INFO | exo.main:_elect_loop:200 ] Node elected Master
[ 2026-04-24 09:03:09.041 | INFO | exo.api.main:unpause:293 ] Unpausing API
[ 2026-04-24 09:03:09.620 | INFO | exo.master.main:_command_processor:122 ] Executing command: RequestEventLog(command_id='54e9ed3a-7e54-475f-bbf7-6d4fcd92cb0e' since_idx=0)
[ 2026-04-24 09:03:31.265 | INFO | exo.master.main:_command_processor:122 ] Executing command: CreateInstance(command_id='7a94046f-445d-47ef-8513-0ac7db40bd0d' instance=MlxJacclInstance(instance_id='90257cdb-2d84-4bc3-bfc6-e812da010fd0', shard_assignments=ShardAssignments(model_id='mlx-community/Qwen3.6-27B-8bit', runner_to_shard={'62e6b023-e928-43f7-a20a-685deeea1357': TensorShardMetadata(model_card=ModelCard(model_id='mlx-community/Qwen3.6-27B-8bit', storage_size=Memory.from_bytes(29500938720), n_layers=64, hidden_size=5120, supports_tensor=True, num_key_value_heads=4, tasks=[<ModelTask.TextGeneration: 'TextGeneration'>], components=None, family='qwen', quantization='8bit', base_model='Qwen3.6 27B', capabilities=['text', 'thinking', 'thinking_toggle', 'vision'], context_length=262144, uses_cfg=False, trust_remote_code=True, is_custom=False, vision=VisionCardConfig(image_token_id=248056, model_type='qwen3_5', weights_repo='mlx-community/Qwen3.6-27B-8bit', image_token=None, processor_repo=None), sampling_defaults=SamplingDefaults(temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, repetition_penalty=1.0, presence_penalty=1.5, frequency_penalty=None, thinking=None, non_thinking=SamplingValues(temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, repetition_penalty=1.0, presence_penalty=1.5, frequency_penalty=None))), device_rank=0, world_size=2, immediate_exception=False, should_timeout=None, start_layer=0, end_layer=64, n_layers=64), '4027544f-ee09-4146-b59f-bc2883d764df': TensorShardMetadata(model_card=ModelCard(model_id='mlx-community/Qwen3.6-27B-8bit', storage_size=Memory.from_bytes(29500938720), n_layers=64, hidden_size=5120, supports_tensor=True, num_key_value_heads=4, tasks=[<ModelTask.TextGeneration: 'TextGeneration'>], components=None, family='qwen', quantization='8bit', base_model='Qwen3.6 27B', capabilities=['text', 'thinking', 'thinking_toggle', 'vision'], context_length=262144, uses_cfg=False, trust_remote_code=True, is_custom=False, vision=VisionCardConfig(image_token_id=248056, model_type='qwen3_5', weights_repo='mlx-community/Qwen3.6-27B-8bit', image_token=None, processor_repo=None), sampling_defaults=SamplingDefaults(temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, repetition_penalty=1.0, presence_penalty=1.5, frequency_penalty=None, thinking=None, non_thinking=SamplingValues(temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, repetition_penalty=1.0, presence_penalty=1.5, frequency_penalty=None))), device_rank=1, world_size=2, immediate_exception=False, should_timeout=None, start_layer=0, end_layer=64, n_layers=64)}, node_to_runner={'12D3KooWA6ArqMz963AzT974o1U7gBmcLyFhbA3aY2vWMXYBau8M': '62e6b023-e928-43f7-a20a-685deeea1357', '12D3KooWDfKzXmqiWvtw2YuCf8cHJUry1Vi6f7PFP7Hh869ZbCtz': '4027544f-ee09-4146-b59f-bc2883d764df'}), jaccl_devices=[[None, 'rdma_en3'], ['rdma_en3', None]], jaccl_coordinators={'12D3KooWA6ArqMz963AzT974o1U7gBmcLyFhbA3aY2vWMXYBau8M': '0.0.0.0:63657', '12D3KooWDfKzXmqiWvtw2YuCf8cHJUry1Vi6f7PFP7Hh869ZbCtz': '192.168.8.220:63657'}))
[ 2026-04-24 09:03:31.307 | INFO | exo.worker.main:plan_step:214 ] Worker plan: CreateRunner
[ 2026-04-24 09:03:31.719 | INFO | exo.worker.runner.bootstrap:entrypoint:34 ] Fast synch flag: 1
[ 2026-04-24 09:03:32.861 | INFO | exo.worker.runner.llm_inference.runner:init:93 ] hello from the runner
[ 2026-04-24 09:03:32.861 | INFO | exo.worker.runner.llm_inference.runner:init:113 ] runner created
[ 2026-04-24 09:03:32.931 | INFO | exo.worker.main:plan_step:214 ] Worker plan: ConnectToGroup
[ 2026-04-24 09:03:32.932 | INFO | exo.worker.runner.runner_supervisor:start_task:182 ] Starting task ConnectToGroup(task_id='146af62d-d7eb-4733-b4d0-8905c88d73ea' task_status=<TaskStatus.Pending: 'Pending'> instance_id='90257cdb-2d84-4bc3-bfc6-e812da010fd0')
[ 2026-04-24 09:03:32.932 | INFO | exo.worker.runner.llm_inference.runner:handle_first_task:149 ] runner connecting
[ 2026-04-24 09:03:32.933 | INFO | exo.worker.engines.mlx.utils_mlx:mlx_distributed_init:97 ] Starting initialization for rank 1
[ 2026-04-24 09:03:32.933 | INFO | exo.worker.engines.mlx.utils_mlx:mlx_distributed_init:135 ] rank 1 MLX_IBV_DEVICES: /var/folders/41/670wc_gs2y93f8rs0330cc780000gn/T/tmps978zbfw/hosts_90257cdb-2d84-4bc3-bfc6-e812da010fd0_1.json with devices: [[null, "rdma_en3"], ["rdma_en3", null]]
[ 2026-04-24 09:03:32.933 | INFO | exo.worker.engines.mlx.utils_mlx:mlx_distributed_init:138 ] rank 1 MLX_JACCL_COORDINATOR: 192.168.8.220:63657
[ 2026-04-24 09:03:32.966 | WARNING | exo.worker.runner.bootstrap:entrypoint:59 ] Runner 4027544f-ee09-4146-b59f-bc2883d764df crashed with critical exception [jaccl] Changing queue pair to RTR failed with errno 96
Traceback (most recent call last):

File "main.py", line 38, in

File "pyi_rth_multiprocessing.py", line 48, in _freeze_support

File "multiprocessing/spawn.py", line 122, in spawn_main

File "multiprocessing/spawn.py", line 135, in _main

File "multiprocessing/process.py", line 313, in _bootstrap

File "multiprocessing/process.py", line 108, in run

File "exo/worker/runner/bootstrap.py", line 54, in entrypoint

File "exo/worker/runner/llm_inference/runner.py", line 139, in main

File "exo/worker/runner/llm_inference/runner.py", line 153, in handle_first_task

File "exo/worker/engines/mlx/utils_mlx.py", line 159, in initialize_mlx

File "exo/worker/engines/mlx/utils_mlx.py", line 142, in mlx_distributed_init

ValueError: [jaccl] Changing queue pair to RTR failed with errno 96
[ 2026-04-24 09:03:32.967 | INFO | exo.worker.runner.bootstrap:entrypoint:75 ] bye from the runner
[ 2026-04-24 09:03:32.968 | INFO | exo.worker.runner.runner_supervisor:_check_runner:255 ] Checking runner's status
[ 2026-04-24 09:03:32.968 | INFO | exo.worker.runner.runner_supervisor:_check_runner:257 ] Runner was found to be alive, attempting to join process
[ 2026-04-24 09:03:33.036 | INFO | exo.worker.main:plan_step:214 ] Worker plan: Shutdown
[ 2026-04-24 09:03:33.036 | INFO | exo.worker.runner.runner_supervisor:start_task:182 ] Starting task Shutdown(task_id='e772911c-f436-4862-a195-99b784fd1fe6' task_status=<TaskStatus.Pending: 'Pending'> instance_id='90257cdb-2d84-4bc3-bfc6-e812da010fd0' runner_id='4027544f-ee09-4146-b59f-bc2883d764df')
[ 2026-04-24 09:03:33.037 | WARNING | exo.worker.runner.runner_supervisor:start_task:190 ] Task Shutdown(task_id='e772911c-f436-4862-a195-99b784fd1fe6' task_status=<TaskStatus.Pending: 'Pending'> instance_id='90257cdb-2d84-4bc3-bfc6-e812da010fd0' runner_id='4027544f-ee09-4146-b59f-bc2883d764df') dropped, runner closed communication.
[ 2026-04-24 09:03:33.138 | INFO | exo.worker.main:plan_step:214 ] Worker plan: CreateRunner
[ 2026-04-24 09:03:33.252 | INFO | exo.worker.main:plan_step:214 ] Worker plan: Shutdown
[ 2026-04-24 09:03:33.252 | INFO | exo.worker.runner.runner_supervisor:start_task:182 ] Starting task Shutdown(task_id='3c898095-920c-46be-8d01-c6bd78446298' task_status=<TaskStatus.Pending: 'Pending'> instance_id='90257cdb-2d84-4bc3-bfc6-e812da010fd0' runner_id='4027544f-ee09-4146-b59f-bc2883d764df')
[ 2026-04-24 09:03:33.300 | INFO | exo.worker.runner.runner_supervisor:_check_runner:260 ] Runner exited with exit code 0
[ 2026-04-24 09:03:33.300 | INFO | exo.worker.runner.runner_supervisor:run:118 ] Runner supervisor shutting down
[ 2026-04-24 09:03:33.300 | INFO | exo.worker.runner.runner_supervisor:run:164 ] Runner process succesfully terminated
[ 2026-04-24 09:03:33.540 | INFO | exo.worker.runner.bootstrap:entrypoint:34 ] Fast synch flag: 1`

To Reproduce

Steps to reproduce the behavior:

  1. Vibe hard uysing qwen code on RDMA + Tensor
  2. Prompt
  3. or reload

Expected behavior

works...

Actual behavior

Fails

Environment

  • macOS Version: 26.4.1
  • EXO Version: 1.0.71
  • Hardware:
    • M4 max 128gb
    • M3 ultra 96gb

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions