Skip to content

v0.8.66: Move sub-agent state persistence out of manager write-lock hot paths #3805

Description

@Hmbown

Problem

Sub-agent manager update paths can perform synchronous JSON serialization and file writes while the manager write lock is held. Under high fanout, launch/completion/list operations contend on that write lock, and persistence can amplify stalls.

Parent: #3800

Verified evidence

  • SubAgentManager::spawn inserts an agent and calls persist_state_best_effort() before returning the snapshot.
  • update_from_result / update_failed update terminal state and call persist_state_best_effort() on change.
  • persist_state_best_effort() calls persist_state(), which calls write_json_atomic.
  • write_json_atomic performs serde_json::to_string_pretty, fs::create_dir_all, fs::write, and fs::rename synchronously.
  • These methods are reached while callers hold Arc<RwLock<SubAgentManager>>::write().await.

Critical framing

Earlier broad claims that blocking I/O starved the worker pool were disproven for the old freeze. This issue should target only the manager-lock critical section: make the lock-held work small and measurable. Do not add a blanket speculative spawn_blocking patch without proving lock contention improves.

Suggested implementation options

  • Build the serializable state snapshot under the manager lock, then release the lock before serialization and disk I/O.
  • Coalesce terminal persists during completion bursts while guaranteeing final state is flushed.
  • Keep existing debounce behavior for hot per-step checkpoint paths, but ensure launch/completion persists do not monopolize the manager write lock.
  • Add timing instrumentation around manager lock hold duration and persist duration.

Acceptance criteria

  • Manager write-lock hold time during spawn/completion does not include synchronous disk I/O.
  • Persistence remains atomic and recoverable after process interruption.
  • Tests cover spawn/completion persistence and state reload after coalesced writes.
  • Instrumentation can distinguish manager lock wait, manager lock hold, serialization, and disk write time.
  • The 20-agent release gate in v0.8.66: Release gate for multi sub-agent fanout freeze #3800 shows completion/listing no longer bunches behind persistence.

Security / policy guardrails

Persistence refactors must preserve state integrity:

  • Build a consistent serializable snapshot while holding the manager lock, then release the lock before expensive serialization/disk I/O.
  • Preserve atomic write behavior and existing symlink/path hardening.
  • Do not lose terminal completion/failure/cancellation state during coalesced writes; final state must flush on terminal transitions and shutdown.
  • Recovery after restart must not resurrect already-terminal agents as running or hide failed/cancelled children.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrelease-blockerMust be fixed before the next releasereliabilityReliability, flaky behavior, retries, fallbacks, and robustnesssubagentsSub-agent orchestration, lifecycle, and completion handlingtuiTerminal UI behavior, rendering, or interactionv0.8.66Targeting v0.8.66

    Projects

    Status
    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions