Skip to content

fix(channel-runtime): Lark /delete-agent tombstone-not-reflected race#413

Merged
eanzhao merged 6 commits intodevfrom
fix/2026-04-25_delete-agent-tombstone-visibility
Apr 25, 2026
Merged

fix(channel-runtime): Lark /delete-agent tombstone-not-reflected race#413
eanzhao merged 6 commits intodevfrom
fix/2026-04-25_delete-agent-tombstone-visibility

Conversation

@eanzhao
Copy link
Copy Markdown
Contributor

@eanzhao eanzhao commented Apr 25, 2026

Summary

  • Lark /delete-agent <id> confirm was returning "Delete submitted but registry tombstone is not yet reflected" and leaving the agent visible in /agents. Three compounding causes, all fixed.
  • Same fix lifts /disable-agent, /enable-agent, and AgentDeliveryTargetTool upsert/delete onto a 15 s, projection-primed wait.

Root cause

# Cause Effect
1 Polling budget = 10 × 500 ms = 5 s Below the 95th percentile of production projection lag; tool fell through to "accepted" while the document was still materializing.
2 Projection scope grain not guaranteed alive at dispatch time If the Orleans projection scope grain idle-deactivated, the actor's CommittedStateEventPublished for the tombstone landed in the stream with no subscribed materializerDeleteAsync was never called and the document persisted indefinitely. The activation contract is "be alive when the event arrives", not "replay missed events", so a late prime cannot recover an event the projector already missed. delete_agent did no priming at all; existing create_agent primed after dispatch (same race, narrower window).
3 Polling was absence-only, no version anchor Any stale-replica read counted as "still present" and consumed a poll attempt without progress.

OrleansActor.HandleEventAsync is fire-and-forget (return _streams.GetStream(Id).ProduceAsync(envelope, ct)) — await only signals "envelope queued", not "handler ran". So the post-dispatch wait must be defensive.

Fix

  • agents/Aevatar.GAgents.ChannelRuntime/ProjectionWaitDefaults.cs (new)
    • Shared internal static class with Attempts = 30 / DelayMilliseconds = 500 (15 s budget). Used as default values for constructor parameters in both tools below; tests inject shrunk values per-instance via the constructor instead of mutating process-global state.
  • agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs
    • Per-instance _projectionWaitAttempts / _projectionWaitDelayMilliseconds readonly fields populated from constructor parameters (defaulted to ProjectionWaitDefaults).
    • DeleteAgentAsync now: capture versionBeforeEnsureUserAgentCatalogProjectionAsync → dispatch disable to skill-runner → revoke API key → dispatch tombstone to registry → WaitForTombstoneReflectedAsync (version-aware, exits early on document deletion). Priming runs before every dispatch so the projection scope is alive when the actor publishes; a late prime cannot recover a missed event.
    • DisableAgentAsync / EnableAgentAsync now also prime before dispatch (previously relied on WaitForAgentStatusAsync's late prime, which was a no-op for the missed-event case).
    • WaitForAgentStatusAsync is now version+status dual-gated (mirrors WaitForCreatedAgentAsync) and returns (bool Confirmed, UserAgentCatalogEntry? Entry): the caller captures versionBefore before dispatch and the helper waits for StateVersion > versionBefore && Status == expected. When the budget exhausts the helper returns (Confirmed: false, Entry: null)no un-gated terminal GetAsync fallback that would surface a stale-but-expected-looking entry and let callers report success despite the contract not being satisfied. Both halves of the gate matter: status alone admits stale replicas that surface a historically expected status (e.g., from a prior disable→enable→disable cycle) before this dispatch materializes; version alone admits unrelated state events that advance the version without changing status. An earlier iteration captured versionBefore inside the helper after dispatch, which let a fast projection make versionAfter == versionBefore and burned the budget — that's why capture must happen before dispatch.
    • DisableAgentAsync / EnableAgentAsync branch on observation.Confirmed: confirmed → success copy ("Agent disabled. Scheduling paused." / "Agent enabled. Scheduling resumed.") with the projector-observed entry; unconfirmed → surface the pre-dispatch entry (preserving the original status field) plus an honest propagating note ("Disable submitted. Run /agent-status in a few seconds to confirm the agent is paused."), symmetric with DeleteAgentAsync's deleted vs accepted / "Tombstone is propagating" pattern.
    • Fallback copy when budget exhausts: "Tombstone is propagating. Run /agents in a few seconds to confirm the agent is gone." (no longer suggests the dispatch may not have landed).
  • agents/Aevatar.GAgents.ChannelRuntime/AgentDeliveryTargetTool.cs
    • Same per-instance constructor-injected budget (replaces the previous private const, which prevented tests from exercising the budget-exhausted branch).
    • DeleteAsync mirrors the AgentBuilderTool pattern (prime-before-dispatch + version-aware tombstone wait + 15 s). UpsertAsync picks up the longer budget through the same fields.

Other slash commands reviewed

Command Status
/agents, /agent-status, /templates Pure reads, no actor dispatch — unaffected.
/run-agent Fire-and-forget by design (user follows up with /agent-status); no polling, no race.
/disable-agent, /enable-agent Same root cause — fixed via prime-before-dispatch + dual-gated (StateVersion > versionBefore && Status == expected) WaitForAgentStatusAsync.
/daily, /social-media (→ create_agent) Now also primes before dispatch (was post-dispatch — same race-class as the original delete bug). Wait helper continues to use the existing 5 s budget; if create surfaces the budget-exhausted symptom in the future, switching the helper defaults to ProjectionWaitDefaults is a one-liner.
agent_delivery_targets upsert / delete Fixed (constructor-injected budget + prime-before-dispatch on both upsert and delete). List is read-only.

Test plan

  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj — 417 / 417 pass in ~6 s
  • bash tools/ci/test_stability_guards.sh — passed
  • bash tools/ci/query_projection_priming_guard.sh — passed
  • bash tools/ci/projection_route_mapping_guard.sh — passed
  • bash tools/ci/projection_state_version_guard.sh — passed
  • bash tools/ci/projection_state_mirror_current_state_guard.sh — passed
  • ExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_WhenTombstoneDoesNotReflectWithinBudget covers the fallback branch by injecting a shrunk wait budget through the constructor (projectionWaitAttempts: 3, projectionWaitDelayMilliseconds: 1) and asserts the new copy is shown — no process-global mutation, no try/finally reset.
  • ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionAdvancesOnFirstPoll pins the dual-gate first-poll exit (versionBefore=42, helper poll sees versionAfter=43 && Status==Disabled → return immediately, < 1 s). A regression that re-introduces status-only polling or moves the version capture back inside the helper makes this test fail.
  • ExecuteAsync_DisableAgent_KeepsWaitingWhenStatusMatchesButVersionStale: stale-replica defense with both path-level and outcome-level assertions. Path: GetStateVersionAsync is called exactly 4 times (1 caller baseline + 3 helper polls). Outcome: when the dual gate never passes, the user-facing payload must surface status=Running (the pre-dispatch value) and a propagating note ("Disable submitted" + "/agent-status") — never "Scheduling paused". A regression that re-introduces the un-gated terminal GetAsync OR drops the Confirmed/Entry branching makes this test fail.
  • Existing disable / enable / workflow-disable tests now mock GetStateVersionAsync to return the caller-side baseline 5L then advance to 6L on the helper poll — exercising the dual-gate fast-exit (suite recovers from ~50 s → ~6 s).
  • After merge: regression-verify on aevatar.eanzhao.com by recreating + deleting a skill-runner via Lark and confirming /agents reflects the delete within ~15 s.

Review-driven changes

  • 948cba58 — drop the helper-side version gate from WaitForAgentStatusAsync (codex P2: versionBefore captured after dispatch let a fast projection burn the full budget). Replaced again in 062f4728 with a caller-side version gate that does not have this race.
  • 5183b0c5 — move priming before every dispatch (delete / disable / enable); switch shared budget from mutable static to constructor-injected per-instance readonly fields; extract ProjectionWaitDefaults.
  • 062f4728 — prime before create_agent dispatches as well (daily report + social media — same race-class as the originally fixed delete path); add caller-side versionBefore capture for disable_agent / enable_agent and dual-gate WaitForAgentStatusAsync on StateVersion > versionBefore && Status == expected (defends against stale read replicas surfacing a historically expected status).
  • be4292d0(this commit) close the dual-gate bypass: WaitForAgentStatusAsync's terminal fallback used to do an un-gated GetAsync and callers used ?? entry.value + unconditional success copy, so stale replicas could still produce "Scheduling paused" despite the dual gate having been violated. Helper now returns (Confirmed, Entry); callers branch on Confirmed and surface honest "submitted / propagating" copy when the gate never passes. The new test for stale-replica defense gained an outcome-level assertion that was missing in 062f4728.

🤖 Generated with Claude Code

The Lark `/delete-agent <id> confirm` flow was returning "Delete submitted
but registry tombstone is not yet reflected" and leaving the agent visible
in subsequent `/agents` listings. Three compounding causes:

* The post-tombstone wait polled the read model for only 10 attempts ×
  500 ms = 5 s. Production projection latency (actor -> committed event ->
  projection scope -> projector -> Elasticsearch) routinely exceeds this
  budget, so the polling loop fell through to the "accepted" branch
  while the document was still being materialized.
* Unlike `create_agent`, the delete path never called
  `EnsureUserAgentCatalogProjectionAsync`. If the projection scope grain
  had idle-deactivated, the actor's published `CommittedStateEventPublished`
  for the tombstone landed in the stream with no subscribed materializer
  to translate it into a `DeleteAsync` on the document store - the
  document then persisted indefinitely.
* The polling was absence-only, with no version anchor. Any read-model
  blip (stale replica, cached scope) that returned the pre-tombstone
  document was treated as "still present" and consumed an attempt.

Mirrors the create_agent pattern so the user-visible delete now:
  1. Captures the read-model state version before issuing the tombstone.
  2. Calls EnsureUserAgentCatalogProjectionAsync after dispatching the
     tombstone so a deactivated scope reactivates before the projector
     would otherwise miss the committed event.
  3. Waits 30 × 500 ms = 15 s using a version-aware helper that returns
     successful as soon as either the document is deleted (GetStateVersion
     returns null) or version advances past the captured version with
     GetAsync returning null.
  4. When the budget is exhausted, returns the same "accepted" payload
     but with a copy that points the user at /agents to verify
     ("Tombstone is propagating. Run /agents in a few seconds...")
     rather than implying the tombstone might not have landed.

Same fix applied to /disable-agent + /enable-agent (WaitForAgentStatusAsync
now version-aware and primes the projection scope) and to
AgentDeliveryTargetTool.DeleteAsync. AgentDeliveryTargetTool.UpsertAsync
also picks up the longer budget through the shared constants.

Tests:
* New `ExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_*`
  covers the not-reflected branch by shrinking the wait budget via
  internal-only static fields and asserting the new copy.
* Existing disable / enable / workflow-disable tests now mock
  `GetStateVersionAsync` to advance on the second poll, exercising the
  version-aware fast-exit path instead of timing out the full 15 s.
* All 403 tests in Aevatar.GAgents.ChannelRuntime.Tests pass; suite
  drops from ~50 s back to ~6 s once the mocks are aligned with the
  version-aware wait.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9e984f981

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +791 to +793
var versionAfter = await queryPort.GetStateVersionAsync(agentId, ct) ?? -1;
if (versionAfter <= versionBefore)
continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check status even when state version has not advanced

WaitForAgentStatusAsync now skips GetAsync unless GetStateVersionAsync is strictly greater than the captured baseline, but this method is called after dispatching disable_agent/enable_agent. If the projection applies the status change before the baseline read, versionAfter never exceeds versionBefore, so the loop waits the full 15-second budget before returning even though the agent is already in the expected status. This introduces a noticeable latency regression for fast-path updates in production.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — pushed 948cba58 removing the version gate from WaitForAgentStatusAsync. The status field is the authoritative signal (enum-like, only flips on lifecycle event commit), so polling it directly handles both the slow-projection case and the codex-flagged fast-projection-before-baseline race. Added ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionUpdatedBeforeBaselineRead to pin the no-deadlock behavior with GetStateVersionAsync returning a flat value. WaitForTombstoneReflectedAsync keeps its version anchor because its caller (DeleteAgentAsync) captures versionBefore before dispatch and the absence-vs-presence signal benefits from ignoring stale replicas.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.39%. Comparing base (6131ed7) to head (be4292d).
⚠️ Report is 7 commits behind head on dev.

@@           Coverage Diff           @@
##              dev     #413   +/-   ##
=======================================
  Coverage   70.38%   70.39%           
=======================================
  Files        1175     1175           
  Lines       84452    84452           
  Branches    11124    11124           
=======================================
+ Hits        59441    59446    +5     
+ Misses      20719    20716    -3     
+ Partials     4292     4290    -2     
Flag Coverage Δ
ci 70.39% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Codex P2 (r3141672095): the version-gated fast-exit on
`WaitForAgentStatusAsync` was wrong. Because the helper is called *after*
dispatching `disable_agent`/`enable_agent`, a fast Local actor + projection
could land the new status before the helper captured `versionBefore`. In
that race the loop saw `versionAfter == versionBefore` on every poll and
burned the entire 15 s budget despite the read model already being
correct - a noticeable latency regression for fast-path updates.

The version anchor only buys a stale-replica defense, but the `Status`
field is enum-like and only flips when the lifecycle event materializes,
so reading it on every attempt is both correct and cheap. Drop the gate;
keep the longer wait window and the `EnsureUserAgentCatalogProjectionAsync`
priming added in this PR. `WaitForTombstoneReflectedAsync` keeps its
version anchor because the caller (`DeleteAgentAsync`) captures version
*before* dispatch and the absence-vs-presence signal benefits from
ignoring stale-replica reads of the pre-tombstone document.

Tests:
* New `ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionUpdatedBeforeBaselineRead`
  pins the codex scenario: `GetStateVersionAsync` returns the same value
  on every poll while `GetAsync` already surfaces the disabled status.
  Asserts the call returns in <1 s. A regression that re-introduces the
  version gate would deadlock until the 15 s budget elapses.
* The temporary `GetStateVersionAsync` mocks I added to the existing
  disable / enable / workflow-disable tests are now unnecessary - the
  status-only path matches the original mock contract.
* All 404 channel runtime tests pass in ~6 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Root cause analysis is solid and the version-aware fast-exit in WaitForTombstoneReflectedAsync is well-reasoned. Three things worth addressing — see inline.

Meta point: PR description says WaitForCreatedAgentAsync is left at the old 5 s budget because creates aren't in user reports. But it's the same pipeline, same projection lag — if production lag legitimately exceeds 5 s for delete/disable, the same lag is silently degrading /daily and /social-media creates today: user gets accepted + "registry projection is not yet confirmed" and the run_immediately branch gets skipped on creates that actually did succeed. Worth aligning to 15 s for symmetry; there's no reason to keep two different timeouts for the same materializer.

for (var attempt = 0; attempt < 10; attempt++)
// Capture the readmodel version before waiting so we can distinguish "the
// expected status hasn't materialized yet" from "the read model is stale".
var versionBefore = await queryPort.GetStateVersionAsync(agentId, ct) ?? -1;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race: versionBefore is captured after the dispatch already happened.

DisableAgentAsync (line 600) and EnableAgentAsync (line 622) call DispatchAgentLifecycleAsync first and only then call into WaitForAgentStatusAsync, where this line captures versionBefore. Since HandleEventAsync is fire-and-forget (per your own PR description), nothing prevents the actor + projector from already advancing the document version by the time we get to line 780. Then the loop's versionAfter <= versionBefore continue-check waits for another, unrelated event to advance the version — which may never happen, exhausting the full 15 s budget on a flow that actually succeeded.

The fact that the existing disable/enable/workflow-disable tests had to mock GetStateVersionAsync to return 7L then 8L to avoid timing out (PR body: "~50 s → ~6 s once mocks align") is exactly this race surfacing — it's just hidden in production behind the actor not normally being faster than the GetStateVersionAsync round trip.

DeleteAgentAsync (line 509) and both create paths capture versionBefore before dispatch, which is the safer pattern. Suggest the same here: capture in the caller before DispatchAgentLifecycleAsync and pass it as a parameter (mirrors WaitForCreatedAgentAsync's signature).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already resolved in 948cba58 (the version gate was the bug — dropped it from WaitForAgentStatusAsync). The status field is now the authoritative signal: poll GetAsync and check Status == expected on every attempt; no version anchor inside the helper. Pinned with ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionUpdatedBeforeBaselineRead (mocks GetStateVersionAsync to return a flat value across polls, asserts <1 s). WaitForTombstoneReflectedAsync keeps the version anchor because its caller (DeleteAgentAsync) captures versionBefore before dispatch, and the absence-vs-presence read benefits from filtering stale replicas.

// Internal (not const) so tests can shrink the budget when they need to exercise
// the "did not reflect within budget" branch without burning the full 15 s.
internal static int ProjectionWaitAttempts = 30;
internal static int ProjectionWaitDelayMilliseconds = 500;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutable static budget is a process-global — flaky under future parallel test additions.

The one test that mutates these (ExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_WhenTombstoneDoesNotReflectWithinBudget) is safe today because xUnit serializes tests within a class, and AgentBuilderToolTests is the only consumer. The moment anyone adds a second test class that exercises any polling path on AgentBuilderTool, it will see the shrunk 3 × 1ms values whenever the bug test happens to be running in another collection — classic action-at-a-distance flake.

Two cleaner options:

  • Inject via IOptions<ProjectionWaitOptions> so the budget is per-instance and overridable in DI.
  • Or pass maxAttempts / delayMilliseconds as parameters into WaitForAgentStatusAsync and WaitForTombstoneReflectedAsync, matching the shape WaitForCreatedAgentAsync already has at line 743-744.

Also: AgentDeliveryTargetTool line 20-21 duplicates these constants (as private const, so its tests can't shrink them — meaning there's no equivalent budget-exhausted test on the delivery-target delete path). Worth extracting to a single shared ProjectionWaitDefaults.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5183b0c5. Switched both tools to per-instance readonly fields populated via constructor parameters with ProjectionWaitDefaults as the shared default source. Tests inject shrunk values directly: new AgentBuilderTool(sp, projectionWaitAttempts: 3, projectionWaitDelayMilliseconds: 1). No process-global mutation, no try/finally reset, no action-at-a-distance flake when collections parallelize. WaitForTombstoneReflectedAsync now also takes maxAttempts/delayMilliseconds parameters with default fallbacks, matching WaitForCreatedAgentAsync's shape. AgentDeliveryTargetTool's previously private const budget is now also constructor-injected so its tests can exercise the budget-exhausted branch the same way.

// committed tombstone event has a subscribed materializer to translate into a
// document delete. Without this priming, an Orleans grain that idle-deactivated
// may queue the tombstone with no consumer.
await EnsureUserAgentCatalogProjectionAsync(ct);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment claims this priming rescues the case where "an Orleans grain that idle-deactivated may queue the tombstone with no consumer" — but the priming runs after line 520 already published the event into the stream. If the projection grain was genuinely deactivated at publish time and the stream provider isn't rewindable / implicit-subscription-with-replay, calling EnsureProjection here is too late: the event is already gone.

This matches the existing create_agent flow so it's not a regression introduced by this PR, but the rationale in the comment is only correct under specific stream-provider semantics. Either:

  • Move the prime call to before HandleEventAsync (defensive — costs one extra activation in the warm-grain case), or
  • Update the comment to be honest about the assumption ("relies on rewindable / implicit-subscription stream provider so late-attached subscribers replay the queued tombstone").

Worth confirming which is actually true before shipping a 15 s wait that depends on it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5183b0c5. Moved EnsureUserAgentCatalogProjectionAsync to before every dispatch in DeleteAgentAsync, DisableAgentAsync, EnableAgentAsync (and removed the now-redundant call from inside WaitForAgentStatusAsync). AgentDeliveryTargetTool already primed before dispatch on the upsert/delete paths. Activation cost: at most one extra warm-grain round trip, paid up front; semantics: the projection scope is guaranteed alive when the actor publishes, so we no longer rely on rewindable-stream / late-attached-replay assumptions for correctness. Comment text updated to be honest about the rationale ("a late prime can't recover an event the projector already missed"). The create_agent flow has the same shape and is technically subject to the same constraint, but its run_immediately path is the only place where it would matter — left untouched here to keep this PR scoped to the reported bug; happy to send a follow-up if we want it tightened.

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

评审反馈:战术补丁 LGTM,但留了几条架构债

用户报告的 bug 确实修了,新增测试也精准锁定了对应回归点(尤其 ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionUpdatedBeforeBaselineRead 漂亮地把 codex review 的 P2 变成了永久 guard)。下面只列违反 CLAUDE.md 原则或本 PR 内部不一致的点。


1. PR 描述与最终代码不同步(合并前请更新 description)

PR body 里仍写 WaitForAgentStatusAsync upgraded to **capture-version** + prime + 15 s window,但 followup commit 948cba58 "drop version gate on WaitForAgentStatusAsync" 已经显式去掉了 version 锚点,改成 status-only 轮询。followup 的修正方向是对的(dispatch 后捕获的 versionBefore 会让 fast projection 烧光 15s 预算),但 description 没跟上。

2. 架构层面:callsite-priming 模式违反 CLAUDE.md「写侧预挂接 projection」

PR Cause #2 自己点破了真正根因 —— projection scope grain idle-deactivated 时 committed event 没有 subscriber。CLAUDE.md「Projection Pipeline」要求:

committed domain event 必须可观察:write-side 完成 committed event 后必须送入 projection 主链;禁止只落 event store 而不进入可观察流

刷新须通过正式 projection 会话、后台 materializer 或写侧预挂接 projection完成。

现在每个 Tool.cs 端点(create / delete / disable / enable / upsert delivery target / delete delivery target)都自己调一次 EnsureProjectionForActorAsync,projection 生命周期被泄露到所有命令派发器。符合 CLAUDE.md 的两种修法任选其一:

  • 生命周期联动:UserAgentCatalogGAgent 激活时同时激活 projection scope grain(grain lifecycle hook),actor 活着即保证 subscriber 活着。
  • 持久化订阅 + 重放:projection scope 用 durable cursor,grain 失活重激活时从 cursor 回放未消费 committed event(Orleans implicit subscription 的天然能力)。

两种方案任一都让所有 EnsureProjectionForActorAsync 调用消失。本 PR 不必扩大 scope,但建议另开 issue 跟进;否则下一个写新命令端点的开发者还得记得 "先 prime"。

3. 单元测试静态可变状态

ExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_WhenTombstoneDoesNotReflectWithinBudget 用 try/finally 修改 AgentBuilderTool.ProjectionWaitAttempts(internal static int)。如果未来 xUnit 切到并行 collection 或测试顺序变化,可能跨测试漏读旧值。可接受折中,但属于隐患 —— 注释里加一行 [CollectionDefinition(DisableParallelization = true)] 或者把可调旋钮做成 instance field + DI 会更稳。


具体 inline 见下方两条评论。

承诺方面:已完成 delete-agent + disable/enable + delivery-target delete 的 15s + prime + version-aware 升级;未完成承诺(可接受未完成) create_agent / delivery-target upsert 的预算统一(见 inline #2)。

// committed tombstone event has a subscribed materializer to translate into a
// document delete. Without this priming, an Orleans grain that idle-deactivated
// may queue the tombstone with no consumer.
await EnsureUserAgentCatalogProjectionAsync(ct);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

违反:priming 顺序与本 PR 内其他文件不一致(实际仍有 race)

本行的 EnsureUserAgentCatalogProjectionAsync(ct) 出现在 await registryActor.HandleEventAsync(...)(L520-522)之后。如果 projection scope grain 此刻 idle-deactivated,顺序是:

  1. tombstone 派发到 stream(此时 subscriber 仍未激活)
  2. 才开始 prime projection scope

步骤 1 与步骤 2 之间,事件能否被新激活的 subscriber 消费取决于底层 stream provider 的回放语义,不是 CLAUDE.md「committed domain event 必须可观察」要求的强保证。15s 预算只是把命中概率压低,不能根除。

同 PR 内已有正确范式可对齐

  • AgentDeliveryTargetTool.UpsertAsync (L193-223):先 prime,后 dispatch
  • AgentDeliveryTargetTool.DeleteAsync (本 PR 新增,L300-323):先 prime,后 dispatch
  • 本处 AgentBuilderTool.DeleteAgentAsync:先 dispatch,后 prime

注释 // Mirrors create_agent 是真的 —— 但 create_agent (L403-405) 同样也是 dispatch-before-prime,根因相同的 bug 早就存在,不是本 PR 引入的。建议本 PR 顺手把这里和 create_agent 一起改成:

// 先 prime
await EnsureUserAgentCatalogProjectionAsync(ct);

// 再 dispatch
var registryActor = await actorRuntime.GetAsync(...) ?? await actorRuntime.CreateAsync<...>(...);
await registryActor.HandleEventAsync(...);

这样和同 PR 的 AgentDeliveryTargetTool 完全对齐,也符合 CLAUDE.md「写侧预挂接 projection」的精神。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修复 ✅ (commit 5183b0c5)

而且超出了我提议的范围:

  • DeleteAgentAsync(L513-518)— ✅ priming 移到 dispatch 之前
  • DisableAgentAsync(L620-624)— ✅ 新增了 priming(原本完全没有,只靠 helper 内部 late-prime)
  • EnableAgentAsync(L647-651)— ✅ 新增了 priming(同上)
  • WaitForAgentStatusAsync — ✅ 移除了内部那个 "too-late" 的 priming,注释里也明确写了 "Projection scope priming happens in the caller before the dispatch"

rationale 也写得很正:

Activating up front costs at most one extra warm-grain round trip; eliminates the ambiguous comment.

现在四个 callsite 与 AgentDeliveryTargetTool 全部对齐成 prime-before-dispatch,符合「committed domain event 必须可观察」的要求。

//
// Internal (not const) so tests can shrink the budget when they need to exercise
// the "did not reflect within budget" branch without burning the full 15 s.
internal static int ProjectionWaitAttempts = 30;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

违反:同根因不统一治理

本 PR 引入了 ProjectionWaitAttempts = 30 / DelayMilliseconds = 500(15s 预算)作为「shared」常量,但 WaitForCreatedAgentAsync(此文件 L730-732)没切到这两个常量,仍是硬编码:

int maxAttempts = 10,
int delayMilliseconds = 500)

create_agent 走的还是旧的 5s 预算。

PR description 给出的理由是 create_agent... window only matters during creation; not in user reports —— 但根因完全一样:同样的 projection lag,同样的 idle-deactivation,同样的 fire-and-forget HandleEventAsync。create 没暴露只是因为前面有 API key 创建 + Nyx HTTP 给了 projection 更多 wall-clock 启动窗口才碰巧没碰到 race。

下次报障还是这个类。建议:

private async Task<bool> WaitForCreatedAgentAsync(
    IUserAgentCatalogQueryPort queryPort,
    string agentId,
    long versionBefore,
    Func<UserAgentCatalogEntry, bool> predicate,
    CancellationToken ct)
// 删掉 maxAttempts / delayMilliseconds 默认参数,内部直接用共享常量

如果 run_immediately=true 需要更长窗口,把那个特殊性放到调用点,而不是默认参数里散开。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

部分采纳,可接受 ⚠️(commit 5183b0c5)

作者的取舍:

  • ✅ 把 WaitForTombstoneReflectedAsync 也升级成「外部传入 + 默认 fallback」shape,默认值用 ProjectionWaitDefaults.Attempts/DelayMilliseconds —— shape 对齐了。
  • ❌ 但 WaitForCreatedAgentAsync 自己没改(此文件 L757-761 现在仍是 int maxAttempts = 10, int delayMilliseconds = 500),调用点也仍传 maxAttempts: 10/20 硬编码值。

PR description 里的理由是 "create window only matters during creation; not in user reports",我能接受这个判断 —— create 路径前面有 API key 创建 + Nyx HTTP roundtrip,wall-clock buffer 比 delete 宽很多,5s 在生产里目前够用。

但留个 trip-wire 在这里:下次 create_agent 报障 "projection not yet confirmed" 时,这两行默认值直接改成 ProjectionWaitDefaults.Attempts / ProjectionWaitDefaults.DelayMilliseconds 即可,根因和 delete 完全一样

…ce budget

Three review threads from the second pass:

* r3141706674 (race: versionBefore captured after dispatch) — already
  resolved by 948cba5 dropping the version gate from
  WaitForAgentStatusAsync. The status field is the authoritative signal.

* r3141707010 (priming runs after HandleEventAsync — too late if grain
  idle-deactivated). The previous comment claim that
  EnsureUserAgentCatalogProjectionAsync "rescues" a deactivated grain
  was only correct under specific stream-provider semantics. Move the
  call to *before* every dispatch (DeleteAgentAsync, DisableAgentAsync,
  EnableAgentAsync) so the projection scope is alive when the actor
  publishes. Costs at most one extra warm-grain round trip; eliminates
  the ambiguous comment. Removes the call from inside
  WaitForAgentStatusAsync — the helper is post-dispatch and a late prime
  there cannot recover an event the projector already missed.

* r3141706856 (mutable static budget = action-at-a-distance flake risk).
  Extract `ProjectionWaitDefaults` (shared `internal static` constants)
  and switch both AgentBuilderTool and AgentDeliveryTargetTool to
  per-instance readonly fields populated via constructor parameters.
  Tests inject shrunk values directly:
  `new AgentBuilderTool(sp, projectionWaitAttempts: 3, projectionWaitDelayMilliseconds: 1)`
  No process-global mutation, no try/finally reset, no flake when the
  test surface eventually parallelizes across collections.
  WaitForTombstoneReflectedAsync now also accepts maxAttempts/delayMs
  parameters with default fallbacks, matching the existing shape of
  WaitForCreatedAgentAsync. AgentDeliveryTargetTool's previously
  `private const` budget — which prevented its tests from exercising the
  budget-exhausted branch — is now a constructor-injected field too.

Tests:
* `ExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_*` no
  longer mutates a static; constructs with `projectionWaitAttempts: 3,
  projectionWaitDelayMilliseconds: 1` instead.
* All 404 channel runtime tests pass in ~7 s.
* tools/ci/test_stability_guards.sh, query_projection_priming_guard.sh,
  projection_route_mapping_guard.sh — all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第二轮 review:修复评估

followup commit 5183b0c5 的修复非常干净。打分:

反馈 状态
inline #1 priming 顺序 ✅ 修复 + 扩展到 disable/enable(超出最小要求)
inline #2 WaitForCreatedAgentAsync 默认值 ⚠️ 没改,但作者给出了能接受的理由(create wall-clock buffer 更宽)
body #3 mutable static 测试隐患 ✅ 修复(改成 constructor-injected per-instance readonly fields)
body #1 PR description 与代码不同步 仍未更新(见下)
body #2 callsite-priming 架构债 ⏭ follow-up,不在本 PR 范围

单独说一下 PR description

现在 PR body 与代码已经在 3 个地方不一致:

  1. "Shared internal static int ProjectionWaitAttempts = 30" —— 该静态字段已经被 5183b0c5 删除,改成 ProjectionWaitDefaults + per-instance readonly fields。
  2. "DeleteAgentAsync now: capture versionBeforedispatch tombstoneEnsureUserAgentCatalogProjectionAsync → ..." —— 实际顺序已改为 prime-before-dispatch。
  3. "WaitForAgentStatusAsync upgraded to capture-version + prime + 15 s window" —— version 锚点早在 948cba58 被去掉,prime 也在 5183b0c5 移到 caller。

合并前请同步 description,否则未来读 PR 历史的人会困惑(尤其 "prime after dispatch" 这种描述会让 reader 以为代码里也是这么做的)。

一个值得肯定的点

AgentDeliveryTargetTool 之前的 private const ProjectionWaitAttempts 也被改成了构造函数注入,让 delivery target 的 budget-exhausted 分支也变成可测试的 —— 这超出了我反馈的范围,是主动的质量提升。

其余意见

  • 架构层面的 callsite-priming 模式仍然违反 CLAUDE.md「写侧预挂接 projection」原则,但这是 follow-up issue 范畴,本 PR 不应扩大 scope。建议合并后另开 issue 跟进 "projection scope co-tied with actor lifecycle" 的设计。

PR description 同步后即可合并(从代码角度看 LGTM)。

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

Reviewed latest head 5183b0c5.

Code direction is good: the reported /delete-agent tombstone visibility race is materially improved. delete_agent, disable_agent, enable_agent, and delivery-target delete/upsert now use the longer 15s wait budget, and the dangerous mutable static test hook was replaced with per-instance constructor injection. I also verified locally:

  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo -> 404/404 passed
  • bash tools/ci/test_stability_guards.sh -> passed
  • bash tools/ci/query_projection_priming_guard.sh -> passed
  • bash tools/ci/projection_route_mapping_guard.sh -> passed
  • bash tools/ci/projection_state_version_guard.sh -> passed
  • bash tools/ci/projection_state_mirror_current_state_guard.sh -> passed

Remaining issues:

  1. create_agent still has the same late-prime shape. In AgentBuilderTool.cs, daily report dispatches initialize at line 277, optional immediate trigger at 281-283, then only primes projection at 285. Social media dispatches initialize at 415, then primes at 417. Given this PR's own corrected reasoning that "late prime can't recover an event the projector already missed", this leaves the same root-cause class open for create. If this PR is scoped strictly to delete/status, acceptable as follow-up; if the claim is "same root cause fixed", create should prime before dispatch too.

  2. WaitForAgentStatusAsync is now status-only (AgentBuilderTool.cs:815-817). That avoids the bad post-dispatch baseline bug, but it can still accept a stale replica that already had the expected historical status. Since the PR explicitly treats stale read-model replicas as part of the production problem, the stronger shape is: capture versionBefore before disable/enable dispatch, then wait for StateVersion > versionBefore && Status == expected. That preserves the fast-projection case without turning a stale status read into confirmation.

  3. PR description is stale and should be updated before merge. It still says the wait budget is mutable static, that delete order is dispatch tombstone -> EnsureUserAgentCatalogProjectionAsync, and that WaitForAgentStatusAsync is capture-version + prime. The code no longer does those things.

Architecture note: the callsite-level EnsureUserAgentCatalogProjectionAsync pattern is still a workaround. The better direction is a narrow catalog command/observation port or command skeleton that owns prime/bind projection -> dispatch -> accepted receipt -> observe read-model version, and ultimately projection scope lifecycle should be tied to actor/runtime lifecycle so committed events are observable by contract rather than because each tool remembered to prime.

…lifecycle wait

Two follow-up threads from the third review pass (#issuecomment-4318611186):

* create_agent late-prime: CreateDailyReportAgentAsync and
  CreateSocialMediaAgentAsync primed the projection AFTER
  HandleEventAsync, leaving the same race-class open as the originally
  fixed delete path. Move EnsureUserAgentCatalogProjectionAsync to
  before each dispatch so the projection scope is alive when the actor
  publishes initialize/trigger events. Same rationale as
  DeleteAgentAsync's fix in 5183b0c: a late prime cannot recover an
  event the projector already missed.

* WaitForAgentStatusAsync stale-replica defense: status-only polling
  could accept a read replica that surfaces a historically expected
  status (e.g., prior disable->enable->disable cycle's projected
  Disabled state) before the current dispatch materializes. Capture
  versionBefore in the caller (before dispatch — the original 948cba5
  fix removed it because it was being captured AFTER dispatch in the
  helper) and require StateVersion > versionBefore AND Status ==
  expected. Mirrors WaitForCreatedAgentAsync's contract.

Tests:
* DispatchesDisableAndReturnsStatus / DispatchesEnableAndReturnsStatus
  / DispatchesWorkflowDisableAndReturnsStatus: added
  GetStateVersionAsync mocks (caller baseline 5L -> helper poll 6L)
  for the new dual-gate contract.
* ReturnsStatusFast_WhenProjectionAdvancesOnFirstPoll (renamed from
  *_WhenProjectionUpdatedBeforeBaselineRead): rewritten to pin the
  new dual-gate first-poll exit (versionBefore=42, versionAfter=43
  AND status=Disabled -> return immediately, <1s).
* New KeepsWaitingWhenStatusMatchesButVersionStale: stale-replica
  defense — even if GetAsync surfaces an expected-looking Disabled,
  the wait must keep iterating until version advances. Pinned to a
  3-attempt budget injected through the constructor.

Verification:
* dotnet test Aevatar.GAgents.ChannelRuntime.Tests -> 405/405 pass in ~6s
* tools/ci/test_stability_guards.sh -> passed
* tools/ci/query_projection_priming_guard.sh -> passed
* tools/ci/projection_route_mapping_guard.sh -> passed
* tools/ci/projection_state_version_guard.sh -> passed
* tools/ci/projection_state_mirror_current_state_guard.sh -> passed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

Addressed in 062f4728:

(1) create_agent late-prime — fixed. Both CreateDailyReportAgentAsync and CreateSocialMediaAgentAsync now call EnsureUserAgentCatalogProjectionAsync immediately after capturing versionBefore and before actor.HandleEventAsync / TriggerSkillRunnerExecutionCommand / InitializeWorkflowAgentCommand. Same rationale as the delete-path fix in 5183b0c5 — a late prime can't recover an event the projector already missed.

(2) WaitForAgentStatusAsync stale-replica defense — fixed. Switched to your proposed dual-gate shape: caller captures versionBefore before dispatch (avoiding the bug 948cba58 removed, where the capture happened after dispatch in the helper), and the helper waits for StateVersion > versionBefore && Status == expected. Mirrors WaitForCreatedAgentAsync's contract exactly. Status alone would admit stale replicas surfacing a historically expected status; version alone would admit unrelated state events. Both gates needed.

(3) PR description — fixed. Updated to match the current code state (was already updated in the second pass; this commit adds an entry under ## Review-driven changes and corrects the WaitForAgentStatusAsync description from "status-driven (no version anchor)" to dual-gated).

(Architecture note — agreed, deferred.) Co-tying projection scope lifecycle to actor lifecycle (or moving to durable subscription replay) so callsites don't need to remember to prime is the right direction. That removes both the create-path race-class and the delete-path race-class structurally rather than per-callsite. Worth its own issue post-merge.

Verification:

  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj — 405/405 pass in ~6s
  • tools/ci/test_stability_guards.sh — passed
  • tools/ci/query_projection_priming_guard.sh — passed
  • tools/ci/projection_route_mapping_guard.sh — passed
  • tools/ci/projection_state_version_guard.sh — passed
  • tools/ci/projection_state_mirror_current_state_guard.sh — passed

New tests pinning the dual-gate contract:

  • ReturnsStatusFast_WhenProjectionAdvancesOnFirstPoll — first-poll exit when versionAfter > versionBefore && Status == expected (< 1 s).
  • KeepsWaitingWhenStatusMatchesButVersionStale — even when GetAsync surfaces an expected-looking status, the wait keeps iterating until version advances. Pinned to a 3-attempt injected budget so a status-only regression is visible (would call GetStateVersionAsync once instead of 4 times).

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

Reviewed 062f4728. The three requested fixes are mostly addressed:

  • create_agent now primes before dispatch on both daily report and social media paths.
  • disable_agent / enable_agent now capture versionBefore before dispatch and use a version+status dual gate.
  • PR description has been updated and now matches the current implementation.

Tests/guards I reran locally:

  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo -> 405/405 passed
  • bash tools/ci/test_stability_guards.sh -> passed
  • bash tools/ci/query_projection_priming_guard.sh -> passed
  • bash tools/ci/projection_state_version_guard.sh -> passed
  • bash tools/ci/projection_state_mirror_current_state_guard.sh -> passed
  • bash tools/ci/projection_route_mapping_guard.sh -> passed

One remaining bug in the new dual-gate path:

WaitForAgentStatusAsync correctly short-circuits while versionAfter <= versionBefore, but when the budget exhausts it does return await queryPort.GetAsync(agentId, ct) (AgentBuilderTool.cs:850). The callers then do:

  • var after = await WaitForAgentStatusAsync(...) ?? entry.value;
  • return SerializeAgentStatus(after, "Agent disabled. Scheduling paused.") / "Agent enabled. Scheduling resumed."

So the stale-replica scenario is still observable as a successful user response after the wait budget is exhausted. The new test ExecuteAsync_DisableAgent_KeepsWaitingWhenStatusMatchesButVersionStale actually sets this up: the terminal fallback GetAsync returns a stale-but-expected-looking Disabled entry, but the test only asserts GetStateVersionAsync call count and never asserts the returned payload. As written, the command can still return status=disabled + "Agent disabled. Scheduling paused." even though no StateVersion > versionBefore was observed.

Suggested fix: make the helper return an observation result, not a nullable entry, e.g. (confirmed, entry). If the dual gate never passes, return confirmed=false and have DisableAgentAsync / EnableAgentAsync use honest accepted/propagating copy rather than success copy. At minimum, do not perform the un-gated final GetAsync in WaitForAgentStatusAsync; it bypasses the contract the helper just enforced.

eanzhao and others added 2 commits April 25, 2026 17:49
…gate on budget exhaustion

Reviewer comment 4318670031 caught a real bug in 062f472:

WaitForAgentStatusAsync's terminal fallback was
`return await queryPort.GetAsync(agentId, ct)` — an un-gated read that
bypassed the dual gate the helper had just enforced. Callers
(DisableAgentAsync / EnableAgentAsync) used `?? entry.value` and then
unconditionally serialized the result with the success copy
("Agent disabled. Scheduling paused." / "Agent enabled. Scheduling
resumed."). Net effect: a stale-but-expected-looking replica observed
by the terminal fallback (or a pre-dispatch entry surfaced via the
null-coalesce) would still be reported as confirmed success even though
StateVersion never advanced past versionBefore.

The new test KeepsWaitingWhenStatusMatchesButVersionStale set up exactly
this scenario but only asserted the GetStateVersionAsync call count and
never asserted the returned payload — so the bug was invisible to the
test even though the test "passed".

Fix: change WaitForAgentStatusAsync's return type to
`(bool Confirmed, UserAgentCatalogEntry? Entry)`. Drop the un-gated
final GetAsync; if the dual gate never passes, return (false, null).
DisableAgentAsync / EnableAgentAsync now branch on Confirmed:

  - Confirmed=true:   SerializeAgentStatus(observation.Entry!,
                       "Agent disabled. Scheduling paused.")
  - Confirmed=false:  SerializeAgentStatus(entry.value,
                       "Disable submitted. Run /agent-status in a few
                       seconds to confirm the agent is paused.")

The unconfirmed branch surfaces the pre-dispatch entry (so status
field stays as the pre-disable value) plus an honest "submitted /
propagating" note — symmetric with DeleteAgentAsync's
"deleted" vs "accepted" / "Tombstone is propagating" pattern.

Tests:
* KeepsWaitingWhenStatusMatchesButVersionStale: added outcome
  assertions that the payload status field stays Running (the
  pre-dispatch value) and the note contains "Disable submitted" +
  "/agent-status" rather than "Scheduling paused". This is the
  assertion that should have been there from the start.
* No other test changes — Dispatches*ReturnsStatus and
  ReturnsStatusFast_WhenProjectionAdvancesOnFirstPoll all exercise
  the Confirmed=true path and continue to pass.

Verification:
* dotnet test Aevatar.GAgents.ChannelRuntime.Tests -> 417/417 pass in ~6s
* tools/ci/test_stability_guards.sh -> passed
* tools/ci/query_projection_priming_guard.sh -> passed
* tools/ci/projection_route_mapping_guard.sh -> passed
* tools/ci/projection_state_version_guard.sh -> passed
* tools/ci/projection_state_mirror_current_state_guard.sh -> passed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

Acknowledged — that was a real bug, and the test I added in 062f4728 had path-level assertions but no outcome-level assertions, so the bypass was invisible to it. Fixed in be4292d0:

Helper signature change. WaitForAgentStatusAsync now returns (bool Confirmed, UserAgentCatalogEntry? Entry). The terminal fallback await queryPort.GetAsync(agentId, ct) is gone — when the dual gate never passes, the helper returns (false, null) and refuses to surface a stale-but-expected-looking entry.

Caller branching. DisableAgentAsync / EnableAgentAsync now split on observation.Confirmed:

  • Confirmed=true — success copy ("Agent disabled. Scheduling paused." / "Agent enabled. Scheduling resumed.") with the projector-observed entry.
  • Confirmed=false — surface the pre-dispatch entry (so the status field stays at its real pre-disable/pre-enable value) plus a propagating note ("Disable submitted. Run /agent-status in a few seconds to confirm the agent is paused."), symmetric with DeleteAgentAsync's deleted / accepted + "Tombstone is propagating" pattern.

Test. KeepsWaitingWhenStatusMatchesButVersionStale gained the missing outcome assertions:

doc.RootElement.GetProperty("status").GetString().Should().Be(SkillRunnerDefaults.StatusRunning);
var note = doc.RootElement.GetProperty("note").GetString();
note.Should().Contain("Disable submitted")
    .And.Contain("/agent-status")
    .And.NotContain("Scheduling paused");

This is the assertion that should have been there from the start — a regression that re-introduces the un-gated terminal read, or drops the Confirmed/Entry branching, makes this test fail.

Verification:

  • dotnet test Aevatar.GAgents.ChannelRuntime.Tests417/417 pass in ~6 s (count went up from 405 because dev was merged in 64066e14 between commits)
  • All five projection / test-stability / route-mapping / state-version / state-mirror guards pass

Good catch — this was the right fix to push, not just the "At minimum" version.

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

Reviewed be4292d0.

The previous issue is fixed. WaitForAgentStatusAsync now returns (Confirmed, Entry) and no longer performs an un-gated terminal GetAsync; DisableAgentAsync / EnableAgentAsync branch on Confirmed and return honest submitted/propagating copy when the dual gate never passes.

The stale-replica test now has outcome-level assertions: status remains pre-dispatch, the note points users to /agent-status, and it does not claim Scheduling paused. That pins the behavior that was previously missed.

I reran:

  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo -> 417/417 passed
  • bash tools/ci/test_stability_guards.sh -> passed
  • bash tools/ci/query_projection_priming_guard.sh -> passed
  • bash tools/ci/projection_state_version_guard.sh -> passed
  • bash tools/ci/projection_state_mirror_current_state_guard.sh -> passed
  • bash tools/ci/projection_route_mapping_guard.sh -> passed

No remaining code-level blockers from my side. The only remaining item is the stated post-merge production regression check.

@eanzhao eanzhao merged commit cc95470 into dev Apr 25, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant