macos-app: ClusterStateService polling writes ~600 KB/sec to disk via URLCache

## Summary

The macOS EXO app shell sustains ~500–620 KB/sec of file-backed memory dirtied while the cluster-state polling loop is running. macOS treats this as anomalous (the per-process daily-average disk-write baseline is ~25 KB/sec) and emits microstackshot diagnostic reports under `/Library/Logs/DiagnosticReports/EXO_*.diag`.

The writes are entirely from `__CFURLCache::CreateAndStoreCacheNode` flushing HTTP response bodies to `~/Library/Caches/exolabs.EXO/`. They serve no functional purpose — `ClusterStateService` polls `/state` at 2 Hz and never reads from the cache, only writes to it.

A fix is straightforward; PR follows.

## Environment

- macOS 26.4.1 (Build 25E253)
- Mac Studio (Mac15,14), M3 Ultra, 512 GB
- EXO 1.0.71 (1000071999)
- Single-node and multi-node configurations both reproduce

## Symptom

Six microstackshot reports collected on one node over eight days:

| Filename | Total writes | Duration | Sustained rate |
|---|---|---|---|
| `EXO_2026-04-22-131216_atlas.diag` | 2.15 GB | 4077 s | 527 KB/s |
| `EXO_2026-04-22-180238_atlas.diag` | 8.59 GB | 17421 s | 493 KB/s |
| `EXO_2026-04-23-111208_atlas.diag` | 2.15 GB | 3487 s | 616 KB/s |
| `EXO_2026-04-23-150044_atlas.diag` | 8.59 GB | 13715 s | 626 KB/s |
| `EXO_2026-04-24-062837_atlas.diag` | **34.36 GB** | 55673 s (≈15 h) | 617 KB/s |
| `EXO_2026-04-29-125940_atlas.diag` | 2.15 GB | 3463 s | 620 KB/s |

Each report headline:

```
Event:            disk writes
Action taken:     none
Writes:           <N> MB of file backed memory dirtied over <secs> seconds (<rate> KB/sec average),
                  exceeding limit of 24.86 KB per second over 86400 seconds
```

## Root cause

The Heaviest Stack on every report (177 of 185 samples on the most recent one; 3066 of 3116 samples on the 15-hour one) is:

```
start_wqthread
  _pthread_wqthread
    _dispatch_workloop_worker_thread
      _dispatch_root_queue_drain_deferred_wlh
        _dispatch_lane_invoke
          _dispatch_lane_serial_drain
            _dispatch_client_callout
              _dispatch_block_async_invoke2
                invocation function for block in __CFURLCache::CreateAndStoreCacheNode(...)
                  write + 8 (libsystem_kernel.dylib)
                  _CFURLCacheFSWriteCachedResponseToFS
```

That stack tells us that 96–98% of the dispatched work is CFNetwork's URL response cache writing cached response bodies to disk.

The relevant code is in `app/EXO/EXO/Services/ClusterStateService.swift`:

- `init` defaults `session: URLSession = .shared`.
- `URLSession.shared` ships with `URLCache.shared` attached, which has an on-disk diskCapacity by default.
- `startPolling(interval:)` defaults to **0.5 s** and calls `fetchSnapshot()` on every tick — that's `GET /state` against the local exo Python server twice per second.
- The per-`URLRequest` cache policy is set to `.reloadIgnoringLocalCacheData`, but that **only affects read behavior** — the response is still written to the URL cache after each successful fetch. (See [Apple's docs on `URLRequest.CachePolicy`](https://developer.apple.com/documentation/foundation/nsurlrequest/cachepolicy).)

So every snapshot poll persists its response body to disk, regardless of whether the client will ever read it back.

`/state` responses scale with the number of models loaded × peers × instances and can easily reach tens of KB; at 2 Hz that's hundreds of KB/sec sustained — exactly what the diagnostic reports show.

## Why this matters

- **SSD wear** — 34 GB of cache writes for a 15-hour idle-ish polling session is gratuitous. Internal SSDs on Mac Studios can't be replaced without sending the unit to Apple.
- **Background CPU** — `_dispatch_block_async_invoke2 → write` sustained on a worker thread.
- **Cache directory growth** — `~/Library/Caches/exolabs.EXO/` accumulates indefinitely.
- **macOS resource-limit microstackshots** — macOS tags the process as "noisy on disk" (Action taken: none today, but the OS may escalate over time).

Cross-checked: zero microstackshot reports on the same network's other M3 Ultra running EXO 1.0.71 and serving inference but **not** running the EXO macOS app shell (the Swift menubar process is what hits this — the headless Python `exo` CLI alone does not). That confirms the issue is in the Swift shell's URL cache behavior, not in the Python core.

## Suggested fix

Switch `ClusterStateService`'s default session from `URLSession.shared` to an ephemeral session with `urlCache = nil`. Cluster-state responses are time-sensitive and small; nothing benefits from being cached on disk.

```swift
private static func makeNonCachingSession() -> URLSession {
    let config = URLSessionConfiguration.ephemeral
    config.urlCache = nil
    config.requestCachePolicy = .reloadIgnoringLocalCacheData
    return URLSession(configuration: config)
}
```

PR with this fix incoming as a follow-up.

## Alternative considered

App-wide `URLCache.shared = URLCache(memoryCapacity: 0, diskCapacity: 0)` at app launch. This would also cover `BugReportService` (which uses `URLSession.shared` for crash report uploads) and any future callers. It's a one-line change but has a larger blast radius — `Sparkle.framework` and other system code that uses the shared session would also lose caching. The per-service fix is the minimum surgical change.

Happy to switch the PR to the app-wide approach if maintainers prefer.

## Reproduction

1. Run EXO 1.0.71 on macOS 26.x (any recent version).
2. Let it idle (no inference) for 30+ minutes.
3. Check `/Library/Logs/DiagnosticReports/` for `EXO_*.diag` files.
4. The first sample arrives once macOS detects the per-process disk-write daily average being exceeded.

The headless Python `exo` CLI does not reproduce — only the macOS menubar app.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

macos-app: ClusterStateService polling writes ~600 KB/sec to disk via URLCache #2004

Summary

Environment

Symptom

Root cause

Why this matters

Suggested fix

Alternative considered

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Filename	Total writes	Duration	Sustained rate
`EXO_2026-04-22-131216_atlas.diag`	2.15 GB	4077 s	527 KB/s
`EXO_2026-04-22-180238_atlas.diag`	8.59 GB	17421 s	493 KB/s
`EXO_2026-04-23-111208_atlas.diag`	2.15 GB	3487 s	616 KB/s
`EXO_2026-04-23-150044_atlas.diag`	8.59 GB	13715 s	626 KB/s
`EXO_2026-04-24-062837_atlas.diag`	34.36 GB	55673 s (≈15 h)	617 KB/s
`EXO_2026-04-29-125940_atlas.diag`	2.15 GB	3463 s	620 KB/s

macos-app: ClusterStateService polling writes ~600 KB/sec to disk via URLCache #2004

Description

Summary

Environment

Symptom

Root cause

Why this matters

Suggested fix

Alternative considered

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions