Skip to content

v0.8.66: Release gate for multi sub-agent fanout freeze #3800

Description

@Hmbown

Problem

A high-fanout parent turn can make the TUI appear frozen while launching or completing many sub-agents. The current repro is a user-triggered fanout of roughly 20 agent calls: sidebar state changes, input responsiveness stalls, and completions appear to roll in slowly or inconsistently.

This issue is the release-gate umbrella for the v0.8.66 freeze work. It should stay open until CodeWhale can launch and observe a large sub-agent fanout without freezing the UI or losing critical completion state.

Verified evidence from current tree

  • agent inherits supports_parallel() == false from the default tool spec (crates/tui/src/tools/spec.rs), so each agent call is planned as a serial tool batch.
  • Non-parallel tools take the write side of the global tool execution lock in crates/tui/src/core/engine/tool_execution.rs.
  • turn_loop.rs already documents the user-visible freeze symptom for serial fanout: six agent calls resolving routes under the global tool lock can read as a hard TUI freeze.
  • The engine op channel is bounded at 32 and the event channel at 256 (crates/tui/src/core/engine.rs). UI code awaits Op::ListSubAgents after event-drain batches.
  • Sub-agent completion UI events use try_send, so under event-channel pressure they can be dropped from the UI stream even if parent completion signaling uses a separate path.
  • ListSubAgents and sub-agent completions contend on the SubAgentManager write lock; manager update paths can persist JSON state synchronously while holding that lock.
  • The async TUI loop still calls blocking std::sync::Mutex::lock() on the shell manager in sidebar/live-output refresh paths.

Critical framing

Do not treat this as the old v0.8.61 freeze unless fresh measurements prove it is the same failure mode. The earlier speculative “blocking I/O starves the worker pool” theory was disproven. This lane should fix the currently observed fanout/backpressure paths with targeted evidence, not a blanket spawn_blocking patch.

Required outcome

A parent turn that requests 20 sub-agents should remain interruptible and visibly alive:

  • TUI input polling continues while launches are in progress.
  • The sidebar/status view updates without blocking the event loop.
  • Launching N agents should not take roughly N × route-resolution/spawn latency when the work is independent and within configured sub-agent limits.
  • Completion events that matter to parent-turn correctness are not lost; UI refresh events can be coalesced but must converge to correct state.
  • Cancelling the parent turn interrupts any remaining launch queue quickly and leaves well-formed tool results.

Acceptance criteria

  • Add an automated or documented release-gate repro that launches a high-fanout sub-agent batch (target: 20) and records launch latency, input responsiveness, event backlog, and completion convergence.
  • The repro passes with no multi-second TUI input freeze on a normal local checkout.
  • The fix set includes targeted tests for tool dispatch batching, engine/op channel backpressure, ListSubAgents coalescing, and manager persistence/locking behavior.
  • Runtime logs include enough bounded diagnostics to distinguish launch serialization, event-channel pressure, op-channel pressure, and manager-lock contention.
  • The implementation does not remove sub-agent depth/concurrency configurability and does not reintroduce old lifecycle/delegate tool surfaces.

Child work

Create/track child issues for:

  • parallel-safe agent launch dispatch,
  • engine/TUI channel backpressure,
  • nonblocking/coalesced sub-agent sidebar refresh,
  • shell-manager lock removal from async UI hot paths,
  • SubAgentManager lock/persistence hot-path cleanup.

Tracked child issues

Security / policy guardrails

Responsiveness fixes must not change the authority model. The implementation must preserve these constraints:

  • Parallel agent launch must not bypass approval policy, sandbox policy, tool gates, sub-agent depth limits, sub-agent concurrency limits, token/budget limits, or cancellation.
  • Parallelism should only reduce waiting to start independent children; it must not grant children more authority than a serial launch would.
  • Critical events must remain durable or recoverable: approval prompts, tool results, fatal errors, user-input requests, cancellation, parent completion, and terminal sub-agent state. Only refresh/status events may be coalesced or dropped, and only if the next snapshot converges to correct state.
  • Cleanup cannot disappear when sidebar listing becomes read-only. Stale child timeout/cancel enforcement must remain periodic or event-driven.
  • UI snapshots may become stale under contention, but enforcement must never depend on a successful UI refresh.
  • Persistence changes must keep atomic writes, symlink/path hardening, and recoverable terminal state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrelease-blockerMust be fixed before the next releasereliabilityReliability, flaky behavior, retries, fallbacks, and robustnesssubagentsSub-agent orchestration, lifecycle, and completion handlingtuiTerminal UI behavior, rendering, or interactionv0.8.66Targeting v0.8.66

    Projects

    Status
    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions