Skip to content

macos-app: ClusterStateService polling writes ~600 KB/sec to disk via URLCache #2004

@ecohash-co

Description

@ecohash-co

Summary

The macOS EXO app shell sustains ~500–620 KB/sec of file-backed memory dirtied while the cluster-state polling loop is running. macOS treats this as anomalous (the per-process daily-average disk-write baseline is ~25 KB/sec) and emits microstackshot diagnostic reports under /Library/Logs/DiagnosticReports/EXO_*.diag.

The writes are entirely from __CFURLCache::CreateAndStoreCacheNode flushing HTTP response bodies to ~/Library/Caches/exolabs.EXO/. They serve no functional purpose — ClusterStateService polls /state at 2 Hz and never reads from the cache, only writes to it.

A fix is straightforward; PR follows.

Environment

  • macOS 26.4.1 (Build 25E253)
  • Mac Studio (Mac15,14), M3 Ultra, 512 GB
  • EXO 1.0.71 (1000071999)
  • Single-node and multi-node configurations both reproduce

Symptom

Six microstackshot reports collected on one node over eight days:

Filename Total writes Duration Sustained rate
EXO_2026-04-22-131216_atlas.diag 2.15 GB 4077 s 527 KB/s
EXO_2026-04-22-180238_atlas.diag 8.59 GB 17421 s 493 KB/s
EXO_2026-04-23-111208_atlas.diag 2.15 GB 3487 s 616 KB/s
EXO_2026-04-23-150044_atlas.diag 8.59 GB 13715 s 626 KB/s
EXO_2026-04-24-062837_atlas.diag 34.36 GB 55673 s (≈15 h) 617 KB/s
EXO_2026-04-29-125940_atlas.diag 2.15 GB 3463 s 620 KB/s

Each report headline:

Event:            disk writes
Action taken:     none
Writes:           <N> MB of file backed memory dirtied over <secs> seconds (<rate> KB/sec average),
                  exceeding limit of 24.86 KB per second over 86400 seconds

Root cause

The Heaviest Stack on every report (177 of 185 samples on the most recent one; 3066 of 3116 samples on the 15-hour one) is:

start_wqthread
  _pthread_wqthread
    _dispatch_workloop_worker_thread
      _dispatch_root_queue_drain_deferred_wlh
        _dispatch_lane_invoke
          _dispatch_lane_serial_drain
            _dispatch_client_callout
              _dispatch_block_async_invoke2
                invocation function for block in __CFURLCache::CreateAndStoreCacheNode(...)
                  write + 8 (libsystem_kernel.dylib)
                  _CFURLCacheFSWriteCachedResponseToFS

That stack tells us that 96–98% of the dispatched work is CFNetwork's URL response cache writing cached response bodies to disk.

The relevant code is in app/EXO/EXO/Services/ClusterStateService.swift:

  • init defaults session: URLSession = .shared.
  • URLSession.shared ships with URLCache.shared attached, which has an on-disk diskCapacity by default.
  • startPolling(interval:) defaults to 0.5 s and calls fetchSnapshot() on every tick — that's GET /state against the local exo Python server twice per second.
  • The per-URLRequest cache policy is set to .reloadIgnoringLocalCacheData, but that only affects read behavior — the response is still written to the URL cache after each successful fetch. (See Apple's docs on URLRequest.CachePolicy.)

So every snapshot poll persists its response body to disk, regardless of whether the client will ever read it back.

/state responses scale with the number of models loaded × peers × instances and can easily reach tens of KB; at 2 Hz that's hundreds of KB/sec sustained — exactly what the diagnostic reports show.

Why this matters

  • SSD wear — 34 GB of cache writes for a 15-hour idle-ish polling session is gratuitous. Internal SSDs on Mac Studios can't be replaced without sending the unit to Apple.
  • Background CPU_dispatch_block_async_invoke2 → write sustained on a worker thread.
  • Cache directory growth~/Library/Caches/exolabs.EXO/ accumulates indefinitely.
  • macOS resource-limit microstackshots — macOS tags the process as "noisy on disk" (Action taken: none today, but the OS may escalate over time).

Cross-checked: zero microstackshot reports on the same network's other M3 Ultra running EXO 1.0.71 and serving inference but not running the EXO macOS app shell (the Swift menubar process is what hits this — the headless Python exo CLI alone does not). That confirms the issue is in the Swift shell's URL cache behavior, not in the Python core.

Suggested fix

Switch ClusterStateService's default session from URLSession.shared to an ephemeral session with urlCache = nil. Cluster-state responses are time-sensitive and small; nothing benefits from being cached on disk.

private static func makeNonCachingSession() -> URLSession {
    let config = URLSessionConfiguration.ephemeral
    config.urlCache = nil
    config.requestCachePolicy = .reloadIgnoringLocalCacheData
    return URLSession(configuration: config)
}

PR with this fix incoming as a follow-up.

Alternative considered

App-wide URLCache.shared = URLCache(memoryCapacity: 0, diskCapacity: 0) at app launch. This would also cover BugReportService (which uses URLSession.shared for crash report uploads) and any future callers. It's a one-line change but has a larger blast radius — Sparkle.framework and other system code that uses the shared session would also lose caching. The per-service fix is the minimum surgical change.

Happy to switch the PR to the app-wide approach if maintainers prefer.

Reproduction

  1. Run EXO 1.0.71 on macOS 26.x (any recent version).
  2. Let it idle (no inference) for 30+ minutes.
  3. Check /Library/Logs/DiagnosticReports/ for EXO_*.diag files.
  4. The first sample arrives once macOS detects the per-process disk-write daily average being exceeded.

The headless Python exo CLI does not reproduce — only the macOS menubar app.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions