Conversation
The Lark `/delete-agent <id> confirm` flow was returning "Delete submitted
but registry tombstone is not yet reflected" and leaving the agent visible
in subsequent `/agents` listings. Three compounding causes:
* The post-tombstone wait polled the read model for only 10 attempts ×
500 ms = 5 s. Production projection latency (actor -> committed event ->
projection scope -> projector -> Elasticsearch) routinely exceeds this
budget, so the polling loop fell through to the "accepted" branch
while the document was still being materialized.
* Unlike `create_agent`, the delete path never called
`EnsureUserAgentCatalogProjectionAsync`. If the projection scope grain
had idle-deactivated, the actor's published `CommittedStateEventPublished`
for the tombstone landed in the stream with no subscribed materializer
to translate it into a `DeleteAsync` on the document store - the
document then persisted indefinitely.
* The polling was absence-only, with no version anchor. Any read-model
blip (stale replica, cached scope) that returned the pre-tombstone
document was treated as "still present" and consumed an attempt.
Mirrors the create_agent pattern so the user-visible delete now:
1. Captures the read-model state version before issuing the tombstone.
2. Calls EnsureUserAgentCatalogProjectionAsync after dispatching the
tombstone so a deactivated scope reactivates before the projector
would otherwise miss the committed event.
3. Waits 30 × 500 ms = 15 s using a version-aware helper that returns
successful as soon as either the document is deleted (GetStateVersion
returns null) or version advances past the captured version with
GetAsync returning null.
4. When the budget is exhausted, returns the same "accepted" payload
but with a copy that points the user at /agents to verify
("Tombstone is propagating. Run /agents in a few seconds...")
rather than implying the tombstone might not have landed.
Same fix applied to /disable-agent + /enable-agent (WaitForAgentStatusAsync
now version-aware and primes the projection scope) and to
AgentDeliveryTargetTool.DeleteAsync. AgentDeliveryTargetTool.UpsertAsync
also picks up the longer budget through the shared constants.
Tests:
* New `ExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_*`
covers the not-reflected branch by shrinking the wait budget via
internal-only static fields and asserting the new copy.
* Existing disable / enable / workflow-disable tests now mock
`GetStateVersionAsync` to advance on the second poll, exercising the
version-aware fast-exit path instead of timing out the full 15 s.
* All 403 tests in Aevatar.GAgents.ChannelRuntime.Tests pass; suite
drops from ~50 s back to ~6 s once the mocks are aligned with the
version-aware wait.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9e984f981
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| var versionAfter = await queryPort.GetStateVersionAsync(agentId, ct) ?? -1; | ||
| if (versionAfter <= versionBefore) | ||
| continue; |
There was a problem hiding this comment.
Check status even when state version has not advanced
WaitForAgentStatusAsync now skips GetAsync unless GetStateVersionAsync is strictly greater than the captured baseline, but this method is called after dispatching disable_agent/enable_agent. If the projection applies the status change before the baseline read, versionAfter never exceeds versionBefore, so the loop waits the full 15-second budget before returning even though the agent is already in the expected status. This introduces a noticeable latency regression for fast-path updates in production.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — pushed 948cba58 removing the version gate from WaitForAgentStatusAsync. The status field is the authoritative signal (enum-like, only flips on lifecycle event commit), so polling it directly handles both the slow-projection case and the codex-flagged fast-projection-before-baseline race. Added ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionUpdatedBeforeBaselineRead to pin the no-deadlock behavior with GetStateVersionAsync returning a flat value. WaitForTombstoneReflectedAsync keeps its version anchor because its caller (DeleteAgentAsync) captures versionBefore before dispatch and the absence-vs-presence signal benefits from ignoring stale replicas.
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## dev #413 +/- ##
=======================================
Coverage 70.38% 70.39%
=======================================
Files 1175 1175
Lines 84452 84452
Branches 11124 11124
=======================================
+ Hits 59441 59446 +5
+ Misses 20719 20716 -3
+ Partials 4292 4290 -2
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Codex P2 (r3141672095): the version-gated fast-exit on `WaitForAgentStatusAsync` was wrong. Because the helper is called *after* dispatching `disable_agent`/`enable_agent`, a fast Local actor + projection could land the new status before the helper captured `versionBefore`. In that race the loop saw `versionAfter == versionBefore` on every poll and burned the entire 15 s budget despite the read model already being correct - a noticeable latency regression for fast-path updates. The version anchor only buys a stale-replica defense, but the `Status` field is enum-like and only flips when the lifecycle event materializes, so reading it on every attempt is both correct and cheap. Drop the gate; keep the longer wait window and the `EnsureUserAgentCatalogProjectionAsync` priming added in this PR. `WaitForTombstoneReflectedAsync` keeps its version anchor because the caller (`DeleteAgentAsync`) captures version *before* dispatch and the absence-vs-presence signal benefits from ignoring stale-replica reads of the pre-tombstone document. Tests: * New `ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionUpdatedBeforeBaselineRead` pins the codex scenario: `GetStateVersionAsync` returns the same value on every poll while `GetAsync` already surfaces the disabled status. Asserts the call returns in <1 s. A regression that re-introduces the version gate would deadlock until the 15 s budget elapses. * The temporary `GetStateVersionAsync` mocks I added to the existing disable / enable / workflow-disable tests are now unnecessary - the status-only path matches the original mock contract. * All 404 channel runtime tests pass in ~6 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao
left a comment
There was a problem hiding this comment.
Root cause analysis is solid and the version-aware fast-exit in WaitForTombstoneReflectedAsync is well-reasoned. Three things worth addressing — see inline.
Meta point: PR description says WaitForCreatedAgentAsync is left at the old 5 s budget because creates aren't in user reports. But it's the same pipeline, same projection lag — if production lag legitimately exceeds 5 s for delete/disable, the same lag is silently degrading /daily and /social-media creates today: user gets accepted + "registry projection is not yet confirmed" and the run_immediately branch gets skipped on creates that actually did succeed. Worth aligning to 15 s for symmetry; there's no reason to keep two different timeouts for the same materializer.
| for (var attempt = 0; attempt < 10; attempt++) | ||
| // Capture the readmodel version before waiting so we can distinguish "the | ||
| // expected status hasn't materialized yet" from "the read model is stale". | ||
| var versionBefore = await queryPort.GetStateVersionAsync(agentId, ct) ?? -1; |
There was a problem hiding this comment.
Race: versionBefore is captured after the dispatch already happened.
DisableAgentAsync (line 600) and EnableAgentAsync (line 622) call DispatchAgentLifecycleAsync first and only then call into WaitForAgentStatusAsync, where this line captures versionBefore. Since HandleEventAsync is fire-and-forget (per your own PR description), nothing prevents the actor + projector from already advancing the document version by the time we get to line 780. Then the loop's versionAfter <= versionBefore continue-check waits for another, unrelated event to advance the version — which may never happen, exhausting the full 15 s budget on a flow that actually succeeded.
The fact that the existing disable/enable/workflow-disable tests had to mock GetStateVersionAsync to return 7L then 8L to avoid timing out (PR body: "~50 s → ~6 s once mocks align") is exactly this race surfacing — it's just hidden in production behind the actor not normally being faster than the GetStateVersionAsync round trip.
DeleteAgentAsync (line 509) and both create paths capture versionBefore before dispatch, which is the safer pattern. Suggest the same here: capture in the caller before DispatchAgentLifecycleAsync and pass it as a parameter (mirrors WaitForCreatedAgentAsync's signature).
There was a problem hiding this comment.
Already resolved in 948cba58 (the version gate was the bug — dropped it from WaitForAgentStatusAsync). The status field is now the authoritative signal: poll GetAsync and check Status == expected on every attempt; no version anchor inside the helper. Pinned with ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionUpdatedBeforeBaselineRead (mocks GetStateVersionAsync to return a flat value across polls, asserts <1 s). WaitForTombstoneReflectedAsync keeps the version anchor because its caller (DeleteAgentAsync) captures versionBefore before dispatch, and the absence-vs-presence read benefits from filtering stale replicas.
| // Internal (not const) so tests can shrink the budget when they need to exercise | ||
| // the "did not reflect within budget" branch without burning the full 15 s. | ||
| internal static int ProjectionWaitAttempts = 30; | ||
| internal static int ProjectionWaitDelayMilliseconds = 500; |
There was a problem hiding this comment.
Mutable static budget is a process-global — flaky under future parallel test additions.
The one test that mutates these (ExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_WhenTombstoneDoesNotReflectWithinBudget) is safe today because xUnit serializes tests within a class, and AgentBuilderToolTests is the only consumer. The moment anyone adds a second test class that exercises any polling path on AgentBuilderTool, it will see the shrunk 3 × 1ms values whenever the bug test happens to be running in another collection — classic action-at-a-distance flake.
Two cleaner options:
- Inject via
IOptions<ProjectionWaitOptions>so the budget is per-instance and overridable in DI. - Or pass
maxAttempts/delayMillisecondsas parameters intoWaitForAgentStatusAsyncandWaitForTombstoneReflectedAsync, matching the shapeWaitForCreatedAgentAsyncalready has at line 743-744.
Also: AgentDeliveryTargetTool line 20-21 duplicates these constants (as private const, so its tests can't shrink them — meaning there's no equivalent budget-exhausted test on the delivery-target delete path). Worth extracting to a single shared ProjectionWaitDefaults.
There was a problem hiding this comment.
Fixed in 5183b0c5. Switched both tools to per-instance readonly fields populated via constructor parameters with ProjectionWaitDefaults as the shared default source. Tests inject shrunk values directly: new AgentBuilderTool(sp, projectionWaitAttempts: 3, projectionWaitDelayMilliseconds: 1). No process-global mutation, no try/finally reset, no action-at-a-distance flake when collections parallelize. WaitForTombstoneReflectedAsync now also takes maxAttempts/delayMilliseconds parameters with default fallbacks, matching WaitForCreatedAgentAsync's shape. AgentDeliveryTargetTool's previously private const budget is now also constructor-injected so its tests can exercise the budget-exhausted branch the same way.
| // committed tombstone event has a subscribed materializer to translate into a | ||
| // document delete. Without this priming, an Orleans grain that idle-deactivated | ||
| // may queue the tombstone with no consumer. | ||
| await EnsureUserAgentCatalogProjectionAsync(ct); |
There was a problem hiding this comment.
Comment claims this priming rescues the case where "an Orleans grain that idle-deactivated may queue the tombstone with no consumer" — but the priming runs after line 520 already published the event into the stream. If the projection grain was genuinely deactivated at publish time and the stream provider isn't rewindable / implicit-subscription-with-replay, calling EnsureProjection here is too late: the event is already gone.
This matches the existing create_agent flow so it's not a regression introduced by this PR, but the rationale in the comment is only correct under specific stream-provider semantics. Either:
- Move the prime call to before
HandleEventAsync(defensive — costs one extra activation in the warm-grain case), or - Update the comment to be honest about the assumption ("relies on rewindable / implicit-subscription stream provider so late-attached subscribers replay the queued tombstone").
Worth confirming which is actually true before shipping a 15 s wait that depends on it.
There was a problem hiding this comment.
Fixed in 5183b0c5. Moved EnsureUserAgentCatalogProjectionAsync to before every dispatch in DeleteAgentAsync, DisableAgentAsync, EnableAgentAsync (and removed the now-redundant call from inside WaitForAgentStatusAsync). AgentDeliveryTargetTool already primed before dispatch on the upsert/delete paths. Activation cost: at most one extra warm-grain round trip, paid up front; semantics: the projection scope is guaranteed alive when the actor publishes, so we no longer rely on rewindable-stream / late-attached-replay assumptions for correctness. Comment text updated to be honest about the rationale ("a late prime can't recover an event the projector already missed"). The create_agent flow has the same shape and is technically subject to the same constraint, but its run_immediately path is the only place where it would matter — left untouched here to keep this PR scoped to the reported bug; happy to send a follow-up if we want it tightened.
eanzhao
left a comment
There was a problem hiding this comment.
评审反馈:战术补丁 LGTM,但留了几条架构债
用户报告的 bug 确实修了,新增测试也精准锁定了对应回归点(尤其 ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionUpdatedBeforeBaselineRead 漂亮地把 codex review 的 P2 变成了永久 guard)。下面只列违反 CLAUDE.md 原则或本 PR 内部不一致的点。
1. PR 描述与最终代码不同步(合并前请更新 description)
PR body 里仍写 WaitForAgentStatusAsync upgraded to **capture-version** + prime + 15 s window,但 followup commit 948cba58 "drop version gate on WaitForAgentStatusAsync" 已经显式去掉了 version 锚点,改成 status-only 轮询。followup 的修正方向是对的(dispatch 后捕获的 versionBefore 会让 fast projection 烧光 15s 预算),但 description 没跟上。
2. 架构层面:callsite-priming 模式违反 CLAUDE.md「写侧预挂接 projection」
PR Cause #2 自己点破了真正根因 —— projection scope grain idle-deactivated 时 committed event 没有 subscriber。CLAUDE.md「Projection Pipeline」要求:
committed domain event 必须可观察:write-side 完成 committed event 后必须送入 projection 主链;禁止只落 event store 而不进入可观察流。
刷新须通过正式 projection 会话、后台 materializer 或写侧预挂接 projection完成。
现在每个 Tool.cs 端点(create / delete / disable / enable / upsert delivery target / delete delivery target)都自己调一次 EnsureProjectionForActorAsync,projection 生命周期被泄露到所有命令派发器。符合 CLAUDE.md 的两种修法任选其一:
- 生命周期联动:
UserAgentCatalogGAgent激活时同时激活 projection scope grain(grain lifecycle hook),actor 活着即保证 subscriber 活着。 - 持久化订阅 + 重放:projection scope 用 durable cursor,grain 失活重激活时从 cursor 回放未消费 committed event(Orleans implicit subscription 的天然能力)。
两种方案任一都让所有 EnsureProjectionForActorAsync 调用消失。本 PR 不必扩大 scope,但建议另开 issue 跟进;否则下一个写新命令端点的开发者还得记得 "先 prime"。
3. 单元测试静态可变状态
ExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_WhenTombstoneDoesNotReflectWithinBudget 用 try/finally 修改 AgentBuilderTool.ProjectionWaitAttempts(internal static int)。如果未来 xUnit 切到并行 collection 或测试顺序变化,可能跨测试漏读旧值。可接受折中,但属于隐患 —— 注释里加一行 [CollectionDefinition(DisableParallelization = true)] 或者把可调旋钮做成 instance field + DI 会更稳。
具体 inline 见下方两条评论。
承诺方面:已完成 delete-agent + disable/enable + delivery-target delete 的 15s + prime + version-aware 升级;未完成承诺(可接受未完成) create_agent / delivery-target upsert 的预算统一(见 inline #2)。
| // committed tombstone event has a subscribed materializer to translate into a | ||
| // document delete. Without this priming, an Orleans grain that idle-deactivated | ||
| // may queue the tombstone with no consumer. | ||
| await EnsureUserAgentCatalogProjectionAsync(ct); |
There was a problem hiding this comment.
违反:priming 顺序与本 PR 内其他文件不一致(实际仍有 race)
本行的 EnsureUserAgentCatalogProjectionAsync(ct) 出现在 await registryActor.HandleEventAsync(...)(L520-522)之后。如果 projection scope grain 此刻 idle-deactivated,顺序是:
- tombstone 派发到 stream(此时 subscriber 仍未激活)
- 才开始 prime projection scope
步骤 1 与步骤 2 之间,事件能否被新激活的 subscriber 消费取决于底层 stream provider 的回放语义,不是 CLAUDE.md「committed domain event 必须可观察」要求的强保证。15s 预算只是把命中概率压低,不能根除。
同 PR 内已有正确范式可对齐
AgentDeliveryTargetTool.UpsertAsync(L193-223):先 prime,后 dispatch ✅AgentDeliveryTargetTool.DeleteAsync(本 PR 新增,L300-323):先 prime,后 dispatch ✅- 本处
AgentBuilderTool.DeleteAgentAsync:先 dispatch,后 prime ❌
注释 // Mirrors create_agent 是真的 —— 但 create_agent (L403-405) 同样也是 dispatch-before-prime,根因相同的 bug 早就存在,不是本 PR 引入的。建议本 PR 顺手把这里和 create_agent 一起改成:
// 先 prime
await EnsureUserAgentCatalogProjectionAsync(ct);
// 再 dispatch
var registryActor = await actorRuntime.GetAsync(...) ?? await actorRuntime.CreateAsync<...>(...);
await registryActor.HandleEventAsync(...);这样和同 PR 的 AgentDeliveryTargetTool 完全对齐,也符合 CLAUDE.md「写侧预挂接 projection」的精神。
There was a problem hiding this comment.
已修复 ✅ (commit 5183b0c5)
而且超出了我提议的范围:
DeleteAgentAsync(L513-518)— ✅ priming 移到 dispatch 之前DisableAgentAsync(L620-624)— ✅ 新增了 priming(原本完全没有,只靠 helper 内部 late-prime)EnableAgentAsync(L647-651)— ✅ 新增了 priming(同上)WaitForAgentStatusAsync— ✅ 移除了内部那个 "too-late" 的 priming,注释里也明确写了 "Projection scope priming happens in the caller before the dispatch"
rationale 也写得很正:
Activating up front costs at most one extra warm-grain round trip; eliminates the ambiguous comment.
现在四个 callsite 与 AgentDeliveryTargetTool 全部对齐成 prime-before-dispatch,符合「committed domain event 必须可观察」的要求。
| // | ||
| // Internal (not const) so tests can shrink the budget when they need to exercise | ||
| // the "did not reflect within budget" branch without burning the full 15 s. | ||
| internal static int ProjectionWaitAttempts = 30; |
There was a problem hiding this comment.
违反:同根因不统一治理
本 PR 引入了 ProjectionWaitAttempts = 30 / DelayMilliseconds = 500(15s 预算)作为「shared」常量,但 WaitForCreatedAgentAsync(此文件 L730-732)没切到这两个常量,仍是硬编码:
int maxAttempts = 10,
int delayMilliseconds = 500)即 create_agent 走的还是旧的 5s 预算。
PR description 给出的理由是 create_agent... window only matters during creation; not in user reports —— 但根因完全一样:同样的 projection lag,同样的 idle-deactivation,同样的 fire-and-forget HandleEventAsync。create 没暴露只是因为前面有 API key 创建 + Nyx HTTP 给了 projection 更多 wall-clock 启动窗口才碰巧没碰到 race。
下次报障还是这个类。建议:
private async Task<bool> WaitForCreatedAgentAsync(
IUserAgentCatalogQueryPort queryPort,
string agentId,
long versionBefore,
Func<UserAgentCatalogEntry, bool> predicate,
CancellationToken ct)
// 删掉 maxAttempts / delayMilliseconds 默认参数,内部直接用共享常量如果 run_immediately=true 需要更长窗口,把那个特殊性放到调用点,而不是默认参数里散开。
There was a problem hiding this comment.
部分采纳,可接受 5183b0c5)
作者的取舍:
- ✅ 把
WaitForTombstoneReflectedAsync也升级成「外部传入 + 默认 fallback」shape,默认值用ProjectionWaitDefaults.Attempts/DelayMilliseconds—— shape 对齐了。 - ❌ 但
WaitForCreatedAgentAsync自己没改(此文件 L757-761 现在仍是int maxAttempts = 10, int delayMilliseconds = 500),调用点也仍传maxAttempts: 10/20硬编码值。
PR description 里的理由是 "create window only matters during creation; not in user reports",我能接受这个判断 —— create 路径前面有 API key 创建 + Nyx HTTP roundtrip,wall-clock buffer 比 delete 宽很多,5s 在生产里目前够用。
但留个 trip-wire 在这里:下次 create_agent 报障 "projection not yet confirmed" 时,这两行默认值直接改成 ProjectionWaitDefaults.Attempts / ProjectionWaitDefaults.DelayMilliseconds 即可,根因和 delete 完全一样。
…ce budget Three review threads from the second pass: * r3141706674 (race: versionBefore captured after dispatch) — already resolved by 948cba5 dropping the version gate from WaitForAgentStatusAsync. The status field is the authoritative signal. * r3141707010 (priming runs after HandleEventAsync — too late if grain idle-deactivated). The previous comment claim that EnsureUserAgentCatalogProjectionAsync "rescues" a deactivated grain was only correct under specific stream-provider semantics. Move the call to *before* every dispatch (DeleteAgentAsync, DisableAgentAsync, EnableAgentAsync) so the projection scope is alive when the actor publishes. Costs at most one extra warm-grain round trip; eliminates the ambiguous comment. Removes the call from inside WaitForAgentStatusAsync — the helper is post-dispatch and a late prime there cannot recover an event the projector already missed. * r3141706856 (mutable static budget = action-at-a-distance flake risk). Extract `ProjectionWaitDefaults` (shared `internal static` constants) and switch both AgentBuilderTool and AgentDeliveryTargetTool to per-instance readonly fields populated via constructor parameters. Tests inject shrunk values directly: `new AgentBuilderTool(sp, projectionWaitAttempts: 3, projectionWaitDelayMilliseconds: 1)` No process-global mutation, no try/finally reset, no flake when the test surface eventually parallelizes across collections. WaitForTombstoneReflectedAsync now also accepts maxAttempts/delayMs parameters with default fallbacks, matching the existing shape of WaitForCreatedAgentAsync. AgentDeliveryTargetTool's previously `private const` budget — which prevented its tests from exercising the budget-exhausted branch — is now a constructor-injected field too. Tests: * `ExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_*` no longer mutates a static; constructs with `projectionWaitAttempts: 3, projectionWaitDelayMilliseconds: 1` instead. * All 404 channel runtime tests pass in ~7 s. * tools/ci/test_stability_guards.sh, query_projection_priming_guard.sh, projection_route_mapping_guard.sh — all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao
left a comment
There was a problem hiding this comment.
第二轮 review:修复评估
followup commit 5183b0c5 的修复非常干净。打分:
| 反馈 | 状态 |
|---|---|
| inline #1 priming 顺序 | ✅ 修复 + 扩展到 disable/enable(超出最小要求) |
inline #2 WaitForCreatedAgentAsync 默认值 |
|
| body #3 mutable static 测试隐患 | ✅ 修复(改成 constructor-injected per-instance readonly fields) |
| body #1 PR description 与代码不同步 | ❌ 仍未更新(见下) |
| body #2 callsite-priming 架构债 | ⏭ follow-up,不在本 PR 范围 |
单独说一下 PR description
现在 PR body 与代码已经在 3 个地方不一致:
- "Shared
internal static int ProjectionWaitAttempts = 30" —— 该静态字段已经被5183b0c5删除,改成ProjectionWaitDefaults+ per-instance readonly fields。 - "
DeleteAgentAsyncnow: captureversionBefore→ dispatch tombstone →EnsureUserAgentCatalogProjectionAsync→ ..." —— 实际顺序已改为 prime-before-dispatch。 - "
WaitForAgentStatusAsyncupgraded to capture-version + prime + 15 s window" —— version 锚点早在948cba58被去掉,prime 也在5183b0c5移到 caller。
合并前请同步 description,否则未来读 PR 历史的人会困惑(尤其 "prime after dispatch" 这种描述会让 reader 以为代码里也是这么做的)。
一个值得肯定的点
AgentDeliveryTargetTool 之前的 private const ProjectionWaitAttempts 也被改成了构造函数注入,让 delivery target 的 budget-exhausted 分支也变成可测试的 —— 这超出了我反馈的范围,是主动的质量提升。
其余意见
- 架构层面的 callsite-priming 模式仍然违反 CLAUDE.md「写侧预挂接 projection」原则,但这是 follow-up issue 范畴,本 PR 不应扩大 scope。建议合并后另开 issue 跟进 "projection scope co-tied with actor lifecycle" 的设计。
PR description 同步后即可合并(从代码角度看 LGTM)。
|
Reviewed latest head Code direction is good: the reported
Remaining issues:
Architecture note: the callsite-level |
…lifecycle wait Two follow-up threads from the third review pass (#issuecomment-4318611186): * create_agent late-prime: CreateDailyReportAgentAsync and CreateSocialMediaAgentAsync primed the projection AFTER HandleEventAsync, leaving the same race-class open as the originally fixed delete path. Move EnsureUserAgentCatalogProjectionAsync to before each dispatch so the projection scope is alive when the actor publishes initialize/trigger events. Same rationale as DeleteAgentAsync's fix in 5183b0c: a late prime cannot recover an event the projector already missed. * WaitForAgentStatusAsync stale-replica defense: status-only polling could accept a read replica that surfaces a historically expected status (e.g., prior disable->enable->disable cycle's projected Disabled state) before the current dispatch materializes. Capture versionBefore in the caller (before dispatch — the original 948cba5 fix removed it because it was being captured AFTER dispatch in the helper) and require StateVersion > versionBefore AND Status == expected. Mirrors WaitForCreatedAgentAsync's contract. Tests: * DispatchesDisableAndReturnsStatus / DispatchesEnableAndReturnsStatus / DispatchesWorkflowDisableAndReturnsStatus: added GetStateVersionAsync mocks (caller baseline 5L -> helper poll 6L) for the new dual-gate contract. * ReturnsStatusFast_WhenProjectionAdvancesOnFirstPoll (renamed from *_WhenProjectionUpdatedBeforeBaselineRead): rewritten to pin the new dual-gate first-poll exit (versionBefore=42, versionAfter=43 AND status=Disabled -> return immediately, <1s). * New KeepsWaitingWhenStatusMatchesButVersionStale: stale-replica defense — even if GetAsync surfaces an expected-looking Disabled, the wait must keep iterating until version advances. Pinned to a 3-attempt budget injected through the constructor. Verification: * dotnet test Aevatar.GAgents.ChannelRuntime.Tests -> 405/405 pass in ~6s * tools/ci/test_stability_guards.sh -> passed * tools/ci/query_projection_priming_guard.sh -> passed * tools/ci/projection_route_mapping_guard.sh -> passed * tools/ci/projection_state_version_guard.sh -> passed * tools/ci/projection_state_mirror_current_state_guard.sh -> passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed in (1) (2) (3) PR description — fixed. Updated to match the current code state (was already updated in the second pass; this commit adds an entry under (Architecture note — agreed, deferred.) Co-tying projection scope lifecycle to actor lifecycle (or moving to durable subscription replay) so callsites don't need to remember to prime is the right direction. That removes both the create-path race-class and the delete-path race-class structurally rather than per-callsite. Worth its own issue post-merge. Verification:
New tests pinning the dual-gate contract:
|
|
Reviewed
Tests/guards I reran locally:
One remaining bug in the new dual-gate path:
So the stale-replica scenario is still observable as a successful user response after the wait budget is exhausted. The new test Suggested fix: make the helper return an observation result, not a nullable entry, e.g. |
…gate on budget exhaustion Reviewer comment 4318670031 caught a real bug in 062f472: WaitForAgentStatusAsync's terminal fallback was `return await queryPort.GetAsync(agentId, ct)` — an un-gated read that bypassed the dual gate the helper had just enforced. Callers (DisableAgentAsync / EnableAgentAsync) used `?? entry.value` and then unconditionally serialized the result with the success copy ("Agent disabled. Scheduling paused." / "Agent enabled. Scheduling resumed."). Net effect: a stale-but-expected-looking replica observed by the terminal fallback (or a pre-dispatch entry surfaced via the null-coalesce) would still be reported as confirmed success even though StateVersion never advanced past versionBefore. The new test KeepsWaitingWhenStatusMatchesButVersionStale set up exactly this scenario but only asserted the GetStateVersionAsync call count and never asserted the returned payload — so the bug was invisible to the test even though the test "passed". Fix: change WaitForAgentStatusAsync's return type to `(bool Confirmed, UserAgentCatalogEntry? Entry)`. Drop the un-gated final GetAsync; if the dual gate never passes, return (false, null). DisableAgentAsync / EnableAgentAsync now branch on Confirmed: - Confirmed=true: SerializeAgentStatus(observation.Entry!, "Agent disabled. Scheduling paused.") - Confirmed=false: SerializeAgentStatus(entry.value, "Disable submitted. Run /agent-status in a few seconds to confirm the agent is paused.") The unconfirmed branch surfaces the pre-dispatch entry (so status field stays as the pre-disable value) plus an honest "submitted / propagating" note — symmetric with DeleteAgentAsync's "deleted" vs "accepted" / "Tombstone is propagating" pattern. Tests: * KeepsWaitingWhenStatusMatchesButVersionStale: added outcome assertions that the payload status field stays Running (the pre-dispatch value) and the note contains "Disable submitted" + "/agent-status" rather than "Scheduling paused". This is the assertion that should have been there from the start. * No other test changes — Dispatches*ReturnsStatus and ReturnsStatusFast_WhenProjectionAdvancesOnFirstPoll all exercise the Confirmed=true path and continue to pass. Verification: * dotnet test Aevatar.GAgents.ChannelRuntime.Tests -> 417/417 pass in ~6s * tools/ci/test_stability_guards.sh -> passed * tools/ci/query_projection_priming_guard.sh -> passed * tools/ci/projection_route_mapping_guard.sh -> passed * tools/ci/projection_state_version_guard.sh -> passed * tools/ci/projection_state_mirror_current_state_guard.sh -> passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Acknowledged — that was a real bug, and the test I added in Helper signature change. Caller branching.
Test. doc.RootElement.GetProperty("status").GetString().Should().Be(SkillRunnerDefaults.StatusRunning);
var note = doc.RootElement.GetProperty("note").GetString();
note.Should().Contain("Disable submitted")
.And.Contain("/agent-status")
.And.NotContain("Scheduling paused");This is the assertion that should have been there from the start — a regression that re-introduces the un-gated terminal read, or drops the Verification:
Good catch — this was the right fix to push, not just the "At minimum" version. |
|
Reviewed The previous issue is fixed. The stale-replica test now has outcome-level assertions: status remains pre-dispatch, the note points users to I reran:
No remaining code-level blockers from my side. The only remaining item is the stated post-merge production regression check. |
Summary
/delete-agent <id> confirmwas returning "Delete submitted but registry tombstone is not yet reflected" and leaving the agent visible in/agents. Three compounding causes, all fixed./disable-agent,/enable-agent, andAgentDeliveryTargetToolupsert/delete onto a 15 s, projection-primed wait.Root cause
CommittedStateEventPublishedfor the tombstone landed in the stream with no subscribed materializer —DeleteAsyncwas never called and the document persisted indefinitely. The activation contract is "be alive when the event arrives", not "replay missed events", so a late prime cannot recover an event the projector already missed.delete_agentdid no priming at all; existingcreate_agentprimed after dispatch (same race, narrower window).OrleansActor.HandleEventAsyncis fire-and-forget (return _streams.GetStream(Id).ProduceAsync(envelope, ct)) —awaitonly signals "envelope queued", not "handler ran". So the post-dispatch wait must be defensive.Fix
agents/Aevatar.GAgents.ChannelRuntime/ProjectionWaitDefaults.cs(new)internal static classwithAttempts = 30/DelayMilliseconds = 500(15 s budget). Used as default values for constructor parameters in both tools below; tests inject shrunk values per-instance via the constructor instead of mutating process-global state.agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs_projectionWaitAttempts/_projectionWaitDelayMillisecondsreadonly fields populated from constructor parameters (defaulted toProjectionWaitDefaults).DeleteAgentAsyncnow: captureversionBefore→EnsureUserAgentCatalogProjectionAsync→ dispatch disable to skill-runner → revoke API key → dispatch tombstone to registry →WaitForTombstoneReflectedAsync(version-aware, exits early on document deletion). Priming runs before every dispatch so the projection scope is alive when the actor publishes; a late prime cannot recover a missed event.DisableAgentAsync/EnableAgentAsyncnow also prime before dispatch (previously relied onWaitForAgentStatusAsync's late prime, which was a no-op for the missed-event case).WaitForAgentStatusAsyncis now version+status dual-gated (mirrorsWaitForCreatedAgentAsync) and returns(bool Confirmed, UserAgentCatalogEntry? Entry): the caller capturesversionBeforebefore dispatch and the helper waits forStateVersion > versionBefore && Status == expected. When the budget exhausts the helper returns(Confirmed: false, Entry: null)— no un-gated terminalGetAsyncfallback that would surface a stale-but-expected-looking entry and let callers report success despite the contract not being satisfied. Both halves of the gate matter: status alone admits stale replicas that surface a historically expected status (e.g., from a prior disable→enable→disable cycle) before this dispatch materializes; version alone admits unrelated state events that advance the version without changing status. An earlier iteration capturedversionBeforeinside the helper after dispatch, which let a fast projection makeversionAfter == versionBeforeand burned the budget — that's why capture must happen before dispatch.DisableAgentAsync/EnableAgentAsyncbranch onobservation.Confirmed: confirmed → success copy ("Agent disabled. Scheduling paused."/"Agent enabled. Scheduling resumed.") with the projector-observed entry; unconfirmed → surface the pre-dispatch entry (preserving the original status field) plus an honest propagating note ("Disable submitted. Run /agent-status in a few seconds to confirm the agent is paused."), symmetric withDeleteAgentAsync'sdeletedvsaccepted/"Tombstone is propagating"pattern."Tombstone is propagating. Run /agents in a few seconds to confirm the agent is gone."(no longer suggests the dispatch may not have landed).agents/Aevatar.GAgents.ChannelRuntime/AgentDeliveryTargetTool.csprivate const, which prevented tests from exercising the budget-exhausted branch).DeleteAsyncmirrors theAgentBuilderToolpattern (prime-before-dispatch + version-aware tombstone wait + 15 s).UpsertAsyncpicks up the longer budget through the same fields.Other slash commands reviewed
/agents,/agent-status,/templates/run-agent/agent-status); no polling, no race./disable-agent,/enable-agentStateVersion > versionBefore && Status == expected)WaitForAgentStatusAsync./daily,/social-media(→ create_agent)ProjectionWaitDefaultsis a one-liner.agent_delivery_targetsupsert / deleteTest plan
dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj— 417 / 417 pass in ~6 sbash tools/ci/test_stability_guards.sh— passedbash tools/ci/query_projection_priming_guard.sh— passedbash tools/ci/projection_route_mapping_guard.sh— passedbash tools/ci/projection_state_version_guard.sh— passedbash tools/ci/projection_state_mirror_current_state_guard.sh— passedExecuteAsync_DeleteAgent_ReturnsAcceptedWithPropagatingHint_WhenTombstoneDoesNotReflectWithinBudgetcovers the fallback branch by injecting a shrunk wait budget through the constructor (projectionWaitAttempts: 3, projectionWaitDelayMilliseconds: 1) and asserts the new copy is shown — no process-global mutation, no try/finally reset.ExecuteAsync_DisableAgent_ReturnsStatusFast_WhenProjectionAdvancesOnFirstPollpins the dual-gate first-poll exit (versionBefore=42, helper poll seesversionAfter=43 && Status==Disabled→ return immediately, < 1 s). A regression that re-introduces status-only polling or moves the version capture back inside the helper makes this test fail.ExecuteAsync_DisableAgent_KeepsWaitingWhenStatusMatchesButVersionStale: stale-replica defense with both path-level and outcome-level assertions. Path:GetStateVersionAsyncis called exactly 4 times (1 caller baseline + 3 helper polls). Outcome: when the dual gate never passes, the user-facing payload must surfacestatus=Running(the pre-dispatch value) and a propagating note ("Disable submitted"+"/agent-status") — never"Scheduling paused". A regression that re-introduces the un-gated terminalGetAsyncOR drops theConfirmed/Entrybranching makes this test fail.GetStateVersionAsyncto return the caller-side baseline5Lthen advance to6Lon the helper poll — exercising the dual-gate fast-exit (suite recovers from ~50 s → ~6 s)./agentsreflects the delete within ~15 s.Review-driven changes
948cba58— drop the helper-side version gate fromWaitForAgentStatusAsync(codex P2:versionBeforecaptured after dispatch let a fast projection burn the full budget). Replaced again in062f4728with a caller-side version gate that does not have this race.5183b0c5— move priming before every dispatch (delete / disable / enable); switch shared budget from mutable static to constructor-injected per-instance readonly fields; extractProjectionWaitDefaults.062f4728— prime beforecreate_agentdispatches as well (daily report + social media — same race-class as the originally fixed delete path); add caller-sideversionBeforecapture fordisable_agent/enable_agentand dual-gateWaitForAgentStatusAsynconStateVersion > versionBefore && Status == expected(defends against stale read replicas surfacing a historically expected status).be4292d0— (this commit) close the dual-gate bypass:WaitForAgentStatusAsync's terminal fallback used to do an un-gatedGetAsyncand callers used?? entry.value+ unconditional success copy, so stale replicas could still produce"Scheduling paused"despite the dual gate having been violated. Helper now returns(Confirmed, Entry); callers branch onConfirmedand surface honest "submitted / propagating" copy when the gate never passes. The new test for stale-replica defense gained an outcome-level assertion that was missing in062f4728.🤖 Generated with Claude Code