fix(channel-runtime): Lark /delete-agent tombstone-not-reflected race by eanzhao · Pull Request #413 · aevatarAI/aevatar

eanzhao · 2026-04-25T07:49:55Z

Summary

Lark /delete-agent <id> confirm was returning "Delete submitted but registry tombstone is not yet reflected" and leaving the agent visible in /agents. Three compounding causes, all fixed.
Same fix lifts /disable-agent, /enable-agent, and AgentDeliveryTargetTool upsert/delete onto a 15 s, projection-primed wait.

Root cause

#	Cause	Effect
1	Polling budget = 10 × 500 ms = 5 s	Below the 95th percentile of production projection lag; tool fell through to "accepted" while the document was still materializing.
2	Projection scope grain not guaranteed alive at dispatch time	If the Orleans projection scope grain idle-deactivated, the actor's `CommittedStateEventPublished` for the tombstone landed in the stream with no subscribed materializer — `DeleteAsync` was never called and the document persisted indefinitely. The activation contract is "be alive when the event arrives", not "replay missed events", so a late prime cannot recover an event the projector already missed. `delete_agent` did no priming at all; existing `create_agent` primed after dispatch (same race, narrower window).
3	Polling was absence-only, no version anchor	Any stale-replica read counted as "still present" and consumed a poll attempt without progress.

OrleansActor.HandleEventAsync is fire-and-forget (return _streams.GetStream(Id).ProduceAsync(envelope, ct)) — await only signals "envelope queued", not "handler ran". So the post-dispatch wait must be defensive.

Fix

agents/Aevatar.GAgents.ChannelRuntime/ProjectionWaitDefaults.cs (new)
- Shared internal static class with Attempts = 30 / DelayMilliseconds = 500 (15 s budget). Used as default values for constructor parameters in both tools below; tests inject shrunk values per-instance via the constructor instead of mutating process-global state.
agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs
- Per-instance _projectionWaitAttempts / _projectionWaitDelayMilliseconds readonly fields populated from constructor parameters (defaulted to ProjectionWaitDefaults).
- DeleteAgentAsync now: capture versionBefore → EnsureUserAgentCatalogProjectionAsync → dispatch disable to skill-runner → revoke API key → dispatch tombstone to registry → WaitForTombstoneReflectedAsync (version-aware, exits early on document deletion). Priming runs before every dispatch so the projection scope is alive when the actor publishes; a late prime cannot recover a missed event.
- DisableAgentAsync / EnableAgentAsync now also prime before dispatch (previously relied on WaitForAgentStatusAsync's late prime, which was a no-op for the missed-event case).
- WaitForAgentStatusAsync is now version+status dual-gated (mirrors WaitForCreatedAgentAsync) and returns (bool Confirmed, UserAgentCatalogEntry? Entry): the caller captures versionBefore before dispatch and the helper waits for StateVersion > versionBefore && Status == expected. When the budget exhausts the helper returns (Confirmed: false, Entry: null) — no un-gated terminal GetAsync fallback that would surface a stale-but-expected-looking entry and let callers report success despite the contract not being satisfied. Both halves of the gate matter: status alone admits stale replicas that surface a historically expected status (e.g., from a prior disable→enable→disable cycle) before this dispatch materializes; version alone admits unrelated state events that advance the version without changing status. An earlier iteration captured versionBefore inside the helper after dispatch, which let a fast projection make versionAfter == versionBefore and burned the budget — that's why capture must happen before dispatch.
- DisableAgentAsync / EnableAgentAsync branch on observation.Confirmed: confirmed → success copy ("Agent disabled. Scheduling paused." / "Agent enabled. Scheduling resumed.") with the projector-observed entry; unconfirmed → surface the pre-dispatch entry (preserving the original status field) plus an honest propagating note ("Disable submitted. Run /agent-status in a few seconds to confirm the agent is paused."), symmetric with DeleteAgentAsync's deleted vs accepted / "Tombstone is propagating" pattern.
- Fallback copy when budget exhausts: "Tombstone is propagating. Run /agents in a few seconds to confirm the agent is gone." (no longer suggests the dispatch may not have landed).
agents/Aevatar.GAgents.ChannelRuntime/AgentDeliveryTargetTool.cs
- Same per-instance constructor-injected budget (replaces the previous private const, which prevented tests from exercising the budget-exhausted branch).
- DeleteAsync mirrors the AgentBuilderTool pattern (prime-before-dispatch + version-aware tombstone wait + 15 s). UpsertAsync picks up the longer budget through the same fields.

Other slash commands reviewed

Command	Status
`/agents`, `/agent-status`, `/templates`	Pure reads, no actor dispatch — unaffected.
`/run-agent`	Fire-and-forget by design (user follows up with `/agent-status`); no polling, no race.
`/disable-agent`, `/enable-agent`	Same root cause — fixed via prime-before-dispatch + dual-gated (`StateVersion > versionBefore && Status == expected`) `WaitForAgentStatusAsync`.
`/daily`, `/social-media` (→ create_agent)	Now also primes before dispatch (was post-dispatch — same race-class as the original delete bug). Wait helper continues to use the existing 5 s budget; if create surfaces the budget-exhausted symptom in the future, switching the helper defaults to `ProjectionWaitDefaults` is a one-liner.
`agent_delivery_targets` upsert / delete	Fixed (constructor-injected budget + prime-before-dispatch on both upsert and delete). List is read-only.

Test plan

Review-driven changes

948cba58 — drop the helper-side version gate from WaitForAgentStatusAsync (codex P2: versionBefore captured after dispatch let a fast projection burn the full budget). Replaced again in 062f4728 with a caller-side version gate that does not have this race.
5183b0c5 — move priming before every dispatch (delete / disable / enable); switch shared budget from mutable static to constructor-injected per-instance readonly fields; extract ProjectionWaitDefaults.
062f4728 — prime before create_agent dispatches as well (daily report + social media — same race-class as the originally fixed delete path); add caller-side versionBefore capture for disable_agent / enable_agent and dual-gate WaitForAgentStatusAsync on StateVersion > versionBefore && Status == expected (defends against stale read replicas surfacing a historically expected status).
be4292d0 — (this commit) close the dual-gate bypass: WaitForAgentStatusAsync's terminal fallback used to do an un-gated GetAsync and callers used ?? entry.value + unconditional success copy, so stale replicas could still produce "Scheduling paused" despite the dual gate having been violated. Helper now returns (Confirmed, Entry); callers branch on Confirmed and surface honest "submitted / propagating" copy when the gate never passes. The new test for stale-replica defense gained an outcome-level assertion that was missing in 062f4728.

🤖 Generated with Claude Code

The Lark `/delete-agent <id> confirm` flow was returning "Delete submitted but registry tombstone is not yet reflected" and leaving the agent visible in subsequent `/agents` listings. Three compounding causes: * The post-tombstone wait polled the read model for only 10 attempts × 500 ms = 5 s. Production projection latency (actor -> committed event -> projection scope -> projector -> Elasticsearch) routinely exceeds this budget, so the polling loop fell through to the "accepted" branch while the document was still being materialized. * Unlike `create_agent`, the delete path never called `EnsureUserAgentCatalogProjectionAsync`. If the projection scope grain had idle-deactivated, the actor's published `CommittedStateEventPublished` for the tombstone landed in the stream with no subscribed materializer to translate it into a `DeleteAsync` on the document store - the document then persisted indefinitely. * The polling was absence-only, with no version anchor. Any read-model blip (stale replica, cached scope) that returned the pre-tombstone document was treated as "still present" and consumed an attempt. Mirrors the create_agent pattern so the user-visible delete now: 1. Captures the read-model state version before issuing the tombstone. 2. Calls EnsureUserAgentCatalogProjectionAsync after dispatching the tombstone so a deactivated scope reactivates before the projector would otherwise miss the committed event. 3. Waits 30 × 500 ms = 15 s using a version-aware helper that returns successful as soon as either the document is deleted (GetStateVersion returns null) or version advances past the captured version with GetAsync returning null. 4. When the budget is exhausted, returns the same "accepted" payload but with a copy that points the user at /agents to verify ("Tombstone is propagating. Run /agents in a few seconds...") rather than implying the tombstone might not have landed. Same fix applied to /disable-agent + /enable-agent (WaitForAgentStatusAsync now version-aware and primes the projection scope) and to AgentDeliveryTargetTool.DeleteAsync. AgentDeliveryTargetTool.UpsertAsync also picks up the longer budget through the shared constants. Tests: * New `ExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_*` covers the not-reflected branch by shrinking the wait budget via internal-only static fields and asserting the new copy. * Existing disable / enable / workflow-disable tests now mock `GetStateVersionAsync` to advance on the second poll, exercising the version-aware fast-exit path instead of timing out the full 15 s. * All 403 tests in Aevatar.GAgents.ChannelRuntime.Tests pass; suite drops from ~50 s back to ~6 s once the mocks are aligned with the version-aware wait. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9e984f981

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-25T07:53:12Z

+            var versionAfter = await queryPort.GetStateVersionAsync(agentId, ct) ?? -1;
+            if (versionAfter <= versionBefore)
+                continue;


Check status even when state version has not advanced

WaitForAgentStatusAsync now skips GetAsync unless GetStateVersionAsync is strictly greater than the captured baseline, but this method is called after dispatching disable_agent/enable_agent. If the projection applies the status change before the baseline read, versionAfter never exceeds versionBefore, so the loop waits the full 15-second budget before returning even though the agent is already in the expected status. This introduces a noticeable latency regression for fast-path updates in production.

Useful? React with 👍 / 👎.

Good catch — pushed 948cba58 removing the version gate from WaitForAgentStatusAsync. The status field is the authoritative signal (enum-like, only flips on lifecycle event commit), so polling it directly handles both the slow-projection case and the codex-flagged fast-projection-before-baseline race. Added ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionUpdatedBeforeBaselineRead to pin the no-deadlock behavior with GetStateVersionAsync returning a flat value. WaitForTombstoneReflectedAsync keeps its version anchor because its caller (DeleteAgentAsync) captures versionBefore before dispatch and the absence-vs-presence signal benefits from ignoring stale replicas.

codecov · 2026-04-25T08:06:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.39%. Comparing base (6131ed7) to head (be4292d).
⚠️ Report is 7 commits behind head on dev.

@@           Coverage Diff           @@
##              dev     #413   +/-   ##
=======================================
  Coverage   70.38%   70.39%           
=======================================
  Files        1175     1175           
  Lines       84452    84452           
  Branches    11124    11124           
=======================================
+ Hits        59441    59446    +5     
+ Misses      20719    20716    -3     
+ Partials     4292     4290    -2

Flag	Coverage Δ
ci	`70.39% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Codex P2 (r3141672095): the version-gated fast-exit on `WaitForAgentStatusAsync` was wrong. Because the helper is called *after* dispatching `disable_agent`/`enable_agent`, a fast Local actor + projection could land the new status before the helper captured `versionBefore`. In that race the loop saw `versionAfter == versionBefore` on every poll and burned the entire 15 s budget despite the read model already being correct - a noticeable latency regression for fast-path updates. The version anchor only buys a stale-replica defense, but the `Status` field is enum-like and only flips when the lifecycle event materializes, so reading it on every attempt is both correct and cheap. Drop the gate; keep the longer wait window and the `EnsureUserAgentCatalogProjectionAsync` priming added in this PR. `WaitForTombstoneReflectedAsync` keeps its version anchor because the caller (`DeleteAgentAsync`) captures version *before* dispatch and the absence-vs-presence signal benefits from ignoring stale-replica reads of the pre-tombstone document. Tests: * New `ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionUpdatedBeforeBaselineRead` pins the codex scenario: `GetStateVersionAsync` returns the same value on every poll while `GetAsync` already surfaces the disabled status. Asserts the call returns in <1 s. A regression that re-introduces the version gate would deadlock until the 15 s budget elapses. * The temporary `GetStateVersionAsync` mocks I added to the existing disable / enable / workflow-disable tests are now unnecessary - the status-only path matches the original mock contract. * All 404 channel runtime tests pass in ~6 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eanzhao

Root cause analysis is solid and the version-aware fast-exit in WaitForTombstoneReflectedAsync is well-reasoned. Three things worth addressing — see inline.

Meta point: PR description says WaitForCreatedAgentAsync is left at the old 5 s budget because creates aren't in user reports. But it's the same pipeline, same projection lag — if production lag legitimately exceeds 5 s for delete/disable, the same lag is silently degrading /daily and /social-media creates today: user gets accepted + "registry projection is not yet confirmed" and the run_immediately branch gets skipped on creates that actually did succeed. Worth aligning to 15 s for symmetry; there's no reason to keep two different timeouts for the same materializer.

eanzhao · 2026-04-25T08:30:49Z

-        for (var attempt = 0; attempt < 10; attempt++)
+        // Capture the readmodel version before waiting so we can distinguish "the
+        // expected status hasn't materialized yet" from "the read model is stale".
+        var versionBefore = await queryPort.GetStateVersionAsync(agentId, ct) ?? -1;


Race: versionBefore is captured after the dispatch already happened.

DisableAgentAsync (line 600) and EnableAgentAsync (line 622) call DispatchAgentLifecycleAsync first and only then call into WaitForAgentStatusAsync, where this line captures versionBefore. Since HandleEventAsync is fire-and-forget (per your own PR description), nothing prevents the actor + projector from already advancing the document version by the time we get to line 780. Then the loop's versionAfter <= versionBefore continue-check waits for another, unrelated event to advance the version — which may never happen, exhausting the full 15 s budget on a flow that actually succeeded.

The fact that the existing disable/enable/workflow-disable tests had to mock GetStateVersionAsync to return 7L then 8L to avoid timing out (PR body: "~50 s → ~6 s once mocks align") is exactly this race surfacing — it's just hidden in production behind the actor not normally being faster than the GetStateVersionAsync round trip.

DeleteAgentAsync (line 509) and both create paths capture versionBefore before dispatch, which is the safer pattern. Suggest the same here: capture in the caller before DispatchAgentLifecycleAsync and pass it as a parameter (mirrors WaitForCreatedAgentAsync's signature).

Already resolved in 948cba58 (the version gate was the bug — dropped it from WaitForAgentStatusAsync). The status field is now the authoritative signal: poll GetAsync and check Status == expected on every attempt; no version anchor inside the helper. Pinned with ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionUpdatedBeforeBaselineRead (mocks GetStateVersionAsync to return a flat value across polls, asserts <1 s). WaitForTombstoneReflectedAsync keeps the version anchor because its caller (DeleteAgentAsync) captures versionBefore before dispatch, and the absence-vs-presence read benefits from filtering stale replicas.

eanzhao · 2026-04-25T08:31:00Z

+    // Internal (not const) so tests can shrink the budget when they need to exercise
+    // the "did not reflect within budget" branch without burning the full 15 s.
+    internal static int ProjectionWaitAttempts = 30;
+    internal static int ProjectionWaitDelayMilliseconds = 500;


Mutable static budget is a process-global — flaky under future parallel test additions.

The one test that mutates these (ExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_WhenTombstoneDoesNotReflectWithinBudget) is safe today because xUnit serializes tests within a class, and AgentBuilderToolTests is the only consumer. The moment anyone adds a second test class that exercises any polling path on AgentBuilderTool, it will see the shrunk 3 × 1ms values whenever the bug test happens to be running in another collection — classic action-at-a-distance flake.

Two cleaner options:

Inject via IOptions<ProjectionWaitOptions> so the budget is per-instance and overridable in DI.

Or pass maxAttempts / delayMilliseconds as parameters into WaitForAgentStatusAsync and WaitForTombstoneReflectedAsync, matching the shape WaitForCreatedAgentAsync already has at line 743-744.

Also: AgentDeliveryTargetTool line 20-21 duplicates these constants (as private const, so its tests can't shrink them — meaning there's no equivalent budget-exhausted test on the delivery-target delete path). Worth extracting to a single shared ProjectionWaitDefaults.

Fixed in 5183b0c5. Switched both tools to per-instance readonly fields populated via constructor parameters with ProjectionWaitDefaults as the shared default source. Tests inject shrunk values directly: new AgentBuilderTool(sp, projectionWaitAttempts: 3, projectionWaitDelayMilliseconds: 1). No process-global mutation, no try/finally reset, no action-at-a-distance flake when collections parallelize. WaitForTombstoneReflectedAsync now also takes maxAttempts/delayMilliseconds parameters with default fallbacks, matching WaitForCreatedAgentAsync's shape. AgentDeliveryTargetTool's previously private const budget is now also constructor-injected so its tests can exercise the budget-exhausted branch the same way.

eanzhao · 2026-04-25T08:31:08Z

+        // committed tombstone event has a subscribed materializer to translate into a
+        // document delete. Without this priming, an Orleans grain that idle-deactivated
+        // may queue the tombstone with no consumer.
+        await EnsureUserAgentCatalogProjectionAsync(ct);


Comment claims this priming rescues the case where "an Orleans grain that idle-deactivated may queue the tombstone with no consumer" — but the priming runs after line 520 already published the event into the stream. If the projection grain was genuinely deactivated at publish time and the stream provider isn't rewindable / implicit-subscription-with-replay, calling EnsureProjection here is too late: the event is already gone.

This matches the existing create_agent flow so it's not a regression introduced by this PR, but the rationale in the comment is only correct under specific stream-provider semantics. Either:

Move the prime call to before HandleEventAsync (defensive — costs one extra activation in the warm-grain case), or

Update the comment to be honest about the assumption ("relies on rewindable / implicit-subscription stream provider so late-attached subscribers replay the queued tombstone").

Worth confirming which is actually true before shipping a 15 s wait that depends on it.

Fixed in 5183b0c5. Moved EnsureUserAgentCatalogProjectionAsync to before every dispatch in DeleteAgentAsync, DisableAgentAsync, EnableAgentAsync (and removed the now-redundant call from inside WaitForAgentStatusAsync). AgentDeliveryTargetTool already primed before dispatch on the upsert/delete paths. Activation cost: at most one extra warm-grain round trip, paid up front; semantics: the projection scope is guaranteed alive when the actor publishes, so we no longer rely on rewindable-stream / late-attached-replay assumptions for correctness. Comment text updated to be honest about the rationale ("a late prime can't recover an event the projector already missed"). The create_agent flow has the same shape and is technically subject to the same constraint, but its run_immediately path is the only place where it would matter — left untouched here to keep this PR scoped to the reported bug; happy to send a follow-up if we want it tightened.

eanzhao

评审反馈:战术补丁 LGTM,但留了几条架构债

用户报告的 bug 确实修了,新增测试也精准锁定了对应回归点(尤其 ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionUpdatedBeforeBaselineRead 漂亮地把 codex review 的 P2 变成了永久 guard)。下面只列违反 CLAUDE.md 原则或本 PR 内部不一致的点。

1. PR 描述与最终代码不同步(合并前请更新 description)

PR body 里仍写 WaitForAgentStatusAsync upgraded to **capture-version** + prime + 15 s window,但 followup commit 948cba58 "drop version gate on WaitForAgentStatusAsync" 已经显式去掉了 version 锚点,改成 status-only 轮询。followup 的修正方向是对的(dispatch 后捕获的 versionBefore 会让 fast projection 烧光 15s 预算),但 description 没跟上。

2. 架构层面:callsite-priming 模式违反 CLAUDE.md「写侧预挂接 projection」

PR Cause #2 自己点破了真正根因 —— projection scope grain idle-deactivated 时 committed event 没有 subscriber。CLAUDE.md「Projection Pipeline」要求:

committed domain event 必须可观察:write-side 完成 committed event 后必须送入 projection 主链;禁止只落 event store 而不进入可观察流。

刷新须通过正式 projection 会话、后台 materializer 或写侧预挂接 projection完成。

现在每个 Tool.cs 端点(create / delete / disable / enable / upsert delivery target / delete delivery target)都自己调一次 EnsureProjectionForActorAsync,projection 生命周期被泄露到所有命令派发器。符合 CLAUDE.md 的两种修法任选其一:

生命周期联动:UserAgentCatalogGAgent 激活时同时激活 projection scope grain(grain lifecycle hook),actor 活着即保证 subscriber 活着。
持久化订阅 + 重放:projection scope 用 durable cursor,grain 失活重激活时从 cursor 回放未消费 committed event(Orleans implicit subscription 的天然能力)。

两种方案任一都让所有 EnsureProjectionForActorAsync 调用消失。本 PR 不必扩大 scope,但建议另开 issue 跟进;否则下一个写新命令端点的开发者还得记得 "先 prime"。

3. 单元测试静态可变状态

ExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_WhenTombstoneDoesNotReflectWithinBudget 用 try/finally 修改 AgentBuilderTool.ProjectionWaitAttempts(internal static int)。如果未来 xUnit 切到并行 collection 或测试顺序变化,可能跨测试漏读旧值。可接受折中,但属于隐患 —— 注释里加一行 [CollectionDefinition(DisableParallelization = true)] 或者把可调旋钮做成 instance field + DI 会更稳。

具体 inline 见下方两条评论。

承诺方面:已完成 delete-agent + disable/enable + delivery-target delete 的 15s + prime + version-aware 升级;未完成承诺(可接受未完成) create_agent / delivery-target upsert 的预算统一(见 inline #2)。

eanzhao · 2026-04-25T08:37:54Z

+        // committed tombstone event has a subscribed materializer to translate into a
+        // document delete. Without this priming, an Orleans grain that idle-deactivated
+        // may queue the tombstone with no consumer.
+        await EnsureUserAgentCatalogProjectionAsync(ct);


违反:priming 顺序与本 PR 内其他文件不一致(实际仍有 race)

本行的 EnsureUserAgentCatalogProjectionAsync(ct) 出现在 await registryActor.HandleEventAsync(...)(L520-522)之后。如果 projection scope grain 此刻 idle-deactivated,顺序是:

tombstone 派发到 stream(此时 subscriber 仍未激活)

才开始 prime projection scope

步骤 1 与步骤 2 之间,事件能否被新激活的 subscriber 消费取决于底层 stream provider 的回放语义,不是 CLAUDE.md「committed domain event 必须可观察」要求的强保证。15s 预算只是把命中概率压低,不能根除。

同 PR 内已有正确范式可对齐

AgentDeliveryTargetTool.UpsertAsync (L193-223):先 prime,后 dispatch ✅

AgentDeliveryTargetTool.DeleteAsync (本 PR 新增,L300-323):先 prime,后 dispatch ✅

本处 AgentBuilderTool.DeleteAgentAsync:先 dispatch,后 prime ❌

注释 // Mirrors create_agent 是真的 —— 但 create_agent (L403-405) 同样也是 dispatch-before-prime,根因相同的 bug 早就存在,不是本 PR 引入的。建议本 PR 顺手把这里和 create_agent 一起改成:

// 先 prime await EnsureUserAgentCatalogProjectionAsync(ct); // 再 dispatch var registryActor = await actorRuntime.GetAsync(...) ?? await actorRuntime.CreateAsync<...>(...); await registryActor.HandleEventAsync(...);

这样和同 PR 的 AgentDeliveryTargetTool 完全对齐,也符合 CLAUDE.md「写侧预挂接 projection」的精神。

已修复 ✅ (commit 5183b0c5)

而且超出了我提议的范围:

DeleteAgentAsync(L513-518)— ✅ priming 移到 dispatch 之前

DisableAgentAsync(L620-624)— ✅ 新增了 priming(原本完全没有,只靠 helper 内部 late-prime)

EnableAgentAsync(L647-651)— ✅ 新增了 priming(同上)

WaitForAgentStatusAsync — ✅ 移除了内部那个 "too-late" 的 priming,注释里也明确写了 "Projection scope priming happens in the caller before the dispatch"

rationale 也写得很正:

Activating up front costs at most one extra warm-grain round trip; eliminates the ambiguous comment.

现在四个 callsite 与 AgentDeliveryTargetTool 全部对齐成 prime-before-dispatch,符合「committed domain event 必须可观察」的要求。

eanzhao · 2026-04-25T08:37:54Z

+    //
+    // Internal (not const) so tests can shrink the budget when they need to exercise
+    // the "did not reflect within budget" branch without burning the full 15 s.
+    internal static int ProjectionWaitAttempts = 30;


违反:同根因不统一治理

本 PR 引入了 ProjectionWaitAttempts = 30 / DelayMilliseconds = 500(15s 预算)作为「shared」常量,但 WaitForCreatedAgentAsync(此文件 L730-732)没切到这两个常量,仍是硬编码:

int maxAttempts = 10, int delayMilliseconds = 500)

即 create_agent 走的还是旧的 5s 预算。

PR description 给出的理由是 create_agent... window only matters during creation; not in user reports —— 但根因完全一样:同样的 projection lag,同样的 idle-deactivation,同样的 fire-and-forget HandleEventAsync。create 没暴露只是因为前面有 API key 创建 + Nyx HTTP 给了 projection 更多 wall-clock 启动窗口才碰巧没碰到 race。

下次报障还是这个类。建议:

private async Task<bool> WaitForCreatedAgentAsync( IUserAgentCatalogQueryPort queryPort, string agentId, long versionBefore, Func<UserAgentCatalogEntry, bool> predicate, CancellationToken ct) // 删掉 maxAttempts / delayMilliseconds 默认参数,内部直接用共享常量

如果 run_immediately=true 需要更长窗口,把那个特殊性放到调用点,而不是默认参数里散开。

部分采纳,可接受 ⚠️(commit 5183b0c5)

作者的取舍:

✅ 把 WaitForTombstoneReflectedAsync 也升级成「外部传入 + 默认 fallback」shape,默认值用 ProjectionWaitDefaults.Attempts/DelayMilliseconds —— shape 对齐了。

❌ 但 WaitForCreatedAgentAsync 自己没改(此文件 L757-761 现在仍是 int maxAttempts = 10, int delayMilliseconds = 500),调用点也仍传 maxAttempts: 10/20 硬编码值。

PR description 里的理由是 "create window only matters during creation; not in user reports",我能接受这个判断 —— create 路径前面有 API key 创建 + Nyx HTTP roundtrip,wall-clock buffer 比 delete 宽很多,5s 在生产里目前够用。

但留个 trip-wire 在这里:下次 create_agent 报障 "projection not yet confirmed" 时,这两行默认值直接改成 ProjectionWaitDefaults.Attempts / ProjectionWaitDefaults.DelayMilliseconds 即可,根因和 delete 完全一样。

…ce budget Three review threads from the second pass: * r3141706674 (race: versionBefore captured after dispatch) — already resolved by 948cba5 dropping the version gate from WaitForAgentStatusAsync. The status field is the authoritative signal. * r3141707010 (priming runs after HandleEventAsync — too late if grain idle-deactivated). The previous comment claim that EnsureUserAgentCatalogProjectionAsync "rescues" a deactivated grain was only correct under specific stream-provider semantics. Move the call to *before* every dispatch (DeleteAgentAsync, DisableAgentAsync, EnableAgentAsync) so the projection scope is alive when the actor publishes. Costs at most one extra warm-grain round trip; eliminates the ambiguous comment. Removes the call from inside WaitForAgentStatusAsync — the helper is post-dispatch and a late prime there cannot recover an event the projector already missed. * r3141706856 (mutable static budget = action-at-a-distance flake risk). Extract `ProjectionWaitDefaults` (shared `internal static` constants) and switch both AgentBuilderTool and AgentDeliveryTargetTool to per-instance readonly fields populated via constructor parameters. Tests inject shrunk values directly: `new AgentBuilderTool(sp, projectionWaitAttempts: 3, projectionWaitDelayMilliseconds: 1)` No process-global mutation, no try/finally reset, no flake when the test surface eventually parallelizes across collections. WaitForTombstoneReflectedAsync now also accepts maxAttempts/delayMs parameters with default fallbacks, matching the existing shape of WaitForCreatedAgentAsync. AgentDeliveryTargetTool's previously `private const` budget — which prevented its tests from exercising the budget-exhausted branch — is now a constructor-injected field too. Tests: * `ExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_*` no longer mutates a static; constructs with `projectionWaitAttempts: 3, projectionWaitDelayMilliseconds: 1` instead. * All 404 channel runtime tests pass in ~7 s. * tools/ci/test_stability_guards.sh, query_projection_priming_guard.sh, projection_route_mapping_guard.sh — all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eanzhao

第二轮 review:修复评估

followup commit 5183b0c5 的修复非常干净。打分:

反馈	状态
inline #1 priming 顺序	✅ 修复 + 扩展到 disable/enable(超出最小要求)
inline #2 `WaitForCreatedAgentAsync` 默认值	⚠️ 没改,但作者给出了能接受的理由(create wall-clock buffer 更宽)
body #3 mutable static 测试隐患	✅ 修复(改成 constructor-injected per-instance readonly fields)
body #1 PR description 与代码不同步	❌ 仍未更新(见下)
body #2 callsite-priming 架构债	⏭ follow-up,不在本 PR 范围

单独说一下 PR description

现在 PR body 与代码已经在 3 个地方不一致:

"Shared internal static int ProjectionWaitAttempts = 30" —— 该静态字段已经被 5183b0c5 删除,改成 ProjectionWaitDefaults + per-instance readonly fields。
"DeleteAgentAsync now: capture versionBefore → dispatch tombstone → EnsureUserAgentCatalogProjectionAsync → ..." —— 实际顺序已改为 prime-before-dispatch。
"WaitForAgentStatusAsync upgraded to capture-version + prime + 15 s window" —— version 锚点早在 948cba58 被去掉,prime 也在 5183b0c5 移到 caller。

合并前请同步 description,否则未来读 PR 历史的人会困惑(尤其 "prime after dispatch" 这种描述会让 reader 以为代码里也是这么做的)。

一个值得肯定的点

AgentDeliveryTargetTool 之前的 private const ProjectionWaitAttempts 也被改成了构造函数注入,让 delivery target 的 budget-exhausted 分支也变成可测试的 —— 这超出了我反馈的范围,是主动的质量提升。

其余意见

架构层面的 callsite-priming 模式仍然违反 CLAUDE.md「写侧预挂接 projection」原则,但这是 follow-up issue 范畴,本 PR 不应扩大 scope。建议合并后另开 issue 跟进 "projection scope co-tied with actor lifecycle" 的设计。

PR description 同步后即可合并(从代码角度看 LGTM)。

eanzhao · 2026-04-25T09:09:32Z

Reviewed latest head 5183b0c5.

Code direction is good: the reported /delete-agent tombstone visibility race is materially improved. delete_agent, disable_agent, enable_agent, and delivery-target delete/upsert now use the longer 15s wait budget, and the dangerous mutable static test hook was replaced with per-instance constructor injection. I also verified locally:

dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo -> 404/404 passed
bash tools/ci/test_stability_guards.sh -> passed
bash tools/ci/query_projection_priming_guard.sh -> passed
bash tools/ci/projection_route_mapping_guard.sh -> passed
bash tools/ci/projection_state_version_guard.sh -> passed
bash tools/ci/projection_state_mirror_current_state_guard.sh -> passed

Remaining issues:

create_agent still has the same late-prime shape. In AgentBuilderTool.cs, daily report dispatches initialize at line 277, optional immediate trigger at 281-283, then only primes projection at 285. Social media dispatches initialize at 415, then primes at 417. Given this PR's own corrected reasoning that "late prime can't recover an event the projector already missed", this leaves the same root-cause class open for create. If this PR is scoped strictly to delete/status, acceptable as follow-up; if the claim is "same root cause fixed", create should prime before dispatch too.
WaitForAgentStatusAsync is now status-only (AgentBuilderTool.cs:815-817). That avoids the bad post-dispatch baseline bug, but it can still accept a stale replica that already had the expected historical status. Since the PR explicitly treats stale read-model replicas as part of the production problem, the stronger shape is: capture versionBefore before disable/enable dispatch, then wait for StateVersion > versionBefore && Status == expected. That preserves the fast-projection case without turning a stale status read into confirmation.
PR description is stale and should be updated before merge. It still says the wait budget is mutable static, that delete order is dispatch tombstone -> EnsureUserAgentCatalogProjectionAsync, and that WaitForAgentStatusAsync is capture-version + prime. The code no longer does those things.

Architecture note: the callsite-level EnsureUserAgentCatalogProjectionAsync pattern is still a workaround. The better direction is a narrow catalog command/observation port or command skeleton that owns prime/bind projection -> dispatch -> accepted receipt -> observe read-model version, and ultimately projection scope lifecycle should be tied to actor/runtime lifecycle so committed events are observable by contract rather than because each tool remembered to prime.

…lifecycle wait Two follow-up threads from the third review pass (#issuecomment-4318611186): * create_agent late-prime: CreateDailyReportAgentAsync and CreateSocialMediaAgentAsync primed the projection AFTER HandleEventAsync, leaving the same race-class open as the originally fixed delete path. Move EnsureUserAgentCatalogProjectionAsync to before each dispatch so the projection scope is alive when the actor publishes initialize/trigger events. Same rationale as DeleteAgentAsync's fix in 5183b0c: a late prime cannot recover an event the projector already missed. * WaitForAgentStatusAsync stale-replica defense: status-only polling could accept a read replica that surfaces a historically expected status (e.g., prior disable->enable->disable cycle's projected Disabled state) before the current dispatch materializes. Capture versionBefore in the caller (before dispatch — the original 948cba5 fix removed it because it was being captured AFTER dispatch in the helper) and require StateVersion > versionBefore AND Status == expected. Mirrors WaitForCreatedAgentAsync's contract. Tests: * DispatchesDisableAndReturnsStatus / DispatchesEnableAndReturnsStatus / DispatchesWorkflowDisableAndReturnsStatus: added GetStateVersionAsync mocks (caller baseline 5L -> helper poll 6L) for the new dual-gate contract. * ReturnsStatusFast_WhenProjectionAdvancesOnFirstPoll (renamed from *_WhenProjectionUpdatedBeforeBaselineRead): rewritten to pin the new dual-gate first-poll exit (versionBefore=42, versionAfter=43 AND status=Disabled -> return immediately, <1s). * New KeepsWaitingWhenStatusMatchesButVersionStale: stale-replica defense — even if GetAsync surfaces an expected-looking Disabled, the wait must keep iterating until version advances. Pinned to a 3-attempt budget injected through the constructor. Verification: * dotnet test Aevatar.GAgents.ChannelRuntime.Tests -> 405/405 pass in ~6s * tools/ci/test_stability_guards.sh -> passed * tools/ci/query_projection_priming_guard.sh -> passed * tools/ci/projection_route_mapping_guard.sh -> passed * tools/ci/projection_state_version_guard.sh -> passed * tools/ci/projection_state_mirror_current_state_guard.sh -> passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eanzhao · 2026-04-25T09:21:54Z

Addressed in 062f4728:

(1) create_agent late-prime — fixed. Both CreateDailyReportAgentAsync and CreateSocialMediaAgentAsync now call EnsureUserAgentCatalogProjectionAsync immediately after capturing versionBefore and before actor.HandleEventAsync / TriggerSkillRunnerExecutionCommand / InitializeWorkflowAgentCommand. Same rationale as the delete-path fix in 5183b0c5 — a late prime can't recover an event the projector already missed.

(2) WaitForAgentStatusAsync stale-replica defense — fixed. Switched to your proposed dual-gate shape: caller captures versionBefore before dispatch (avoiding the bug 948cba58 removed, where the capture happened after dispatch in the helper), and the helper waits for StateVersion > versionBefore && Status == expected. Mirrors WaitForCreatedAgentAsync's contract exactly. Status alone would admit stale replicas surfacing a historically expected status; version alone would admit unrelated state events. Both gates needed.

(3) PR description — fixed. Updated to match the current code state (was already updated in the second pass; this commit adds an entry under ## Review-driven changes and corrects the WaitForAgentStatusAsync description from "status-driven (no version anchor)" to dual-gated).

(Architecture note — agreed, deferred.) Co-tying projection scope lifecycle to actor lifecycle (or moving to durable subscription replay) so callsites don't need to remember to prime is the right direction. That removes both the create-path race-class and the delete-path race-class structurally rather than per-callsite. Worth its own issue post-merge.

Verification:

dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj — 405/405 pass in ~6s
tools/ci/test_stability_guards.sh — passed
tools/ci/query_projection_priming_guard.sh — passed
tools/ci/projection_route_mapping_guard.sh — passed
tools/ci/projection_state_version_guard.sh — passed
tools/ci/projection_state_mirror_current_state_guard.sh — passed

New tests pinning the dual-gate contract:

ReturnsStatusFast_WhenProjectionAdvancesOnFirstPoll — first-poll exit when versionAfter > versionBefore && Status == expected (< 1 s).
KeepsWaitingWhenStatusMatchesButVersionStale — even when GetAsync surfaces an expected-looking status, the wait keeps iterating until version advances. Pinned to a 3-attempt injected budget so a status-only regression is visible (would call GetStateVersionAsync once instead of 4 times).

eanzhao · 2026-04-25T09:46:20Z

Reviewed 062f4728. The three requested fixes are mostly addressed:

create_agent now primes before dispatch on both daily report and social media paths.
disable_agent / enable_agent now capture versionBefore before dispatch and use a version+status dual gate.
PR description has been updated and now matches the current implementation.

Tests/guards I reran locally:

dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo -> 405/405 passed
bash tools/ci/test_stability_guards.sh -> passed
bash tools/ci/query_projection_priming_guard.sh -> passed
bash tools/ci/projection_state_version_guard.sh -> passed
bash tools/ci/projection_state_mirror_current_state_guard.sh -> passed
bash tools/ci/projection_route_mapping_guard.sh -> passed

One remaining bug in the new dual-gate path:

WaitForAgentStatusAsync correctly short-circuits while versionAfter <= versionBefore, but when the budget exhausts it does return await queryPort.GetAsync(agentId, ct) (AgentBuilderTool.cs:850). The callers then do:

var after = await WaitForAgentStatusAsync(...) ?? entry.value;
return SerializeAgentStatus(after, "Agent disabled. Scheduling paused.") / "Agent enabled. Scheduling resumed."

So the stale-replica scenario is still observable as a successful user response after the wait budget is exhausted. The new test ExecuteAsync_DisableAgent_KeepsWaitingWhenStatusMatchesButVersionStale actually sets this up: the terminal fallback GetAsync returns a stale-but-expected-looking Disabled entry, but the test only asserts GetStateVersionAsync call count and never asserts the returned payload. As written, the command can still return status=disabled + "Agent disabled. Scheduling paused." even though no StateVersion > versionBefore was observed.

Suggested fix: make the helper return an observation result, not a nullable entry, e.g. (confirmed, entry). If the dual gate never passes, return confirmed=false and have DisableAgentAsync / EnableAgentAsync use honest accepted/propagating copy rather than success copy. At minimum, do not perform the un-gated final GetAsync in WaitForAgentStatusAsync; it bypasses the contract the helper just enforced.

…gate on budget exhaustion Reviewer comment 4318670031 caught a real bug in 062f472: WaitForAgentStatusAsync's terminal fallback was `return await queryPort.GetAsync(agentId, ct)` — an un-gated read that bypassed the dual gate the helper had just enforced. Callers (DisableAgentAsync / EnableAgentAsync) used `?? entry.value` and then unconditionally serialized the result with the success copy ("Agent disabled. Scheduling paused." / "Agent enabled. Scheduling resumed."). Net effect: a stale-but-expected-looking replica observed by the terminal fallback (or a pre-dispatch entry surfaced via the null-coalesce) would still be reported as confirmed success even though StateVersion never advanced past versionBefore. The new test KeepsWaitingWhenStatusMatchesButVersionStale set up exactly this scenario but only asserted the GetStateVersionAsync call count and never asserted the returned payload — so the bug was invisible to the test even though the test "passed". Fix: change WaitForAgentStatusAsync's return type to `(bool Confirmed, UserAgentCatalogEntry? Entry)`. Drop the un-gated final GetAsync; if the dual gate never passes, return (false, null). DisableAgentAsync / EnableAgentAsync now branch on Confirmed: - Confirmed=true: SerializeAgentStatus(observation.Entry!, "Agent disabled. Scheduling paused.") - Confirmed=false: SerializeAgentStatus(entry.value, "Disable submitted. Run /agent-status in a few seconds to confirm the agent is paused.") The unconfirmed branch surfaces the pre-dispatch entry (so status field stays as the pre-disable value) plus an honest "submitted / propagating" note — symmetric with DeleteAgentAsync's "deleted" vs "accepted" / "Tombstone is propagating" pattern. Tests: * KeepsWaitingWhenStatusMatchesButVersionStale: added outcome assertions that the payload status field stays Running (the pre-dispatch value) and the note contains "Disable submitted" + "/agent-status" rather than "Scheduling paused". This is the assertion that should have been there from the start. * No other test changes — Dispatches*ReturnsStatus and ReturnsStatusFast_WhenProjectionAdvancesOnFirstPoll all exercise the Confirmed=true path and continue to pass. Verification: * dotnet test Aevatar.GAgents.ChannelRuntime.Tests -> 417/417 pass in ~6s * tools/ci/test_stability_guards.sh -> passed * tools/ci/query_projection_priming_guard.sh -> passed * tools/ci/projection_route_mapping_guard.sh -> passed * tools/ci/projection_state_version_guard.sh -> passed * tools/ci/projection_state_mirror_current_state_guard.sh -> passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eanzhao · 2026-04-25T09:55:20Z

Acknowledged — that was a real bug, and the test I added in 062f4728 had path-level assertions but no outcome-level assertions, so the bypass was invisible to it. Fixed in be4292d0:

Helper signature change. WaitForAgentStatusAsync now returns (bool Confirmed, UserAgentCatalogEntry? Entry). The terminal fallback await queryPort.GetAsync(agentId, ct) is gone — when the dual gate never passes, the helper returns (false, null) and refuses to surface a stale-but-expected-looking entry.

Caller branching. DisableAgentAsync / EnableAgentAsync now split on observation.Confirmed:

Confirmed=true — success copy ("Agent disabled. Scheduling paused." / "Agent enabled. Scheduling resumed.") with the projector-observed entry.
Confirmed=false — surface the pre-dispatch entry (so the status field stays at its real pre-disable/pre-enable value) plus a propagating note ("Disable submitted. Run /agent-status in a few seconds to confirm the agent is paused."), symmetric with DeleteAgentAsync's deleted / accepted + "Tombstone is propagating" pattern.

Test. KeepsWaitingWhenStatusMatchesButVersionStale gained the missing outcome assertions:

doc.RootElement.GetProperty("status").GetString().Should().Be(SkillRunnerDefaults.StatusRunning);
var note = doc.RootElement.GetProperty("note").GetString();
note.Should().Contain("Disable submitted")
    .And.Contain("/agent-status")
    .And.NotContain("Scheduling paused");

This is the assertion that should have been there from the start — a regression that re-introduces the un-gated terminal read, or drops the Confirmed/Entry branching, makes this test fail.

Verification:

dotnet test Aevatar.GAgents.ChannelRuntime.Tests → 417/417 pass in ~6 s (count went up from 405 because dev was merged in 64066e14 between commits)
All five projection / test-stability / route-mapping / state-version / state-mirror guards pass

Good catch — this was the right fix to push, not just the "At minimum" version.

eanzhao · 2026-04-25T10:01:32Z

Reviewed be4292d0.

The previous issue is fixed. WaitForAgentStatusAsync now returns (Confirmed, Entry) and no longer performs an un-gated terminal GetAsync; DisableAgentAsync / EnableAgentAsync branch on Confirmed and return honest submitted/propagating copy when the dual gate never passes.

The stale-replica test now has outcome-level assertions: status remains pre-dispatch, the note points users to /agent-status, and it does not claim Scheduling paused. That pins the behavior that was previously missed.

I reran:

dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo -> 417/417 passed
bash tools/ci/test_stability_guards.sh -> passed
bash tools/ci/query_projection_priming_guard.sh -> passed
bash tools/ci/projection_state_version_guard.sh -> passed
bash tools/ci/projection_state_mirror_current_state_guard.sh -> passed
bash tools/ci/projection_route_mapping_guard.sh -> passed

No remaining code-level blockers from my side. The only remaining item is the stated post-merge production regression check.

chatgpt-codex-connector Bot reviewed Apr 25, 2026

View reviewed changes

eanzhao commented Apr 25, 2026

View reviewed changes

eanzhao and others added 2 commits April 25, 2026 17:49

Merge branch 'dev' into fix/2026-04-25_delete-agent-tombstone-visibility

64066e1

eanzhao merged commit cc95470 into dev Apr 25, 2026
12 checks passed

Conversation

eanzhao commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Other slash commands reviewed

Test plan

Review-driven changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eanzhao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eanzhao left a comment

Choose a reason for hiding this comment

评审反馈:战术补丁 LGTM,但留了几条架构债

1. PR 描述与最终代码不同步(合并前请更新 description)

2. 架构层面:callsite-priming 模式违反 CLAUDE.md「写侧预挂接 projection」

3. 单元测试静态可变状态

具体 inline 见下方两条评论。

Uh oh!

Choose a reason for hiding this comment

违反:priming 顺序与本 PR 内其他文件不一致(实际仍有 race)

同 PR 内已有正确范式可对齐

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

违反:同根因不统一治理

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eanzhao left a comment

Choose a reason for hiding this comment

第二轮 review:修复评估

单独说一下 PR description

一个值得肯定的点

其余意见

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

eanzhao commented Apr 25, 2026 •

edited

Loading

codecov Bot commented Apr 25, 2026 •

edited

Loading