Conversation
…n replay Two production issues observed after PR #409 shipped: ## Bug A — `99992364 user id cross tenant` on SkillRunner DM PR #409 switched p2p outbound to `union_id`, which is tenant-stable but still fails when the relay-side ingress and outbound proxy live in different Lark tenants (this deployment's NyxID `s/api-lark-bot` proxy is bound to a different tenant than the user's own bot that subscribed to events). Even the tenant-stable identifier is rejected: `code:99992364 user id cross tenant`. Switch the BuildFromInbound priority to `chat_id` first for ALL conversation types (DM and group). chat_id (`oc_*`) is the literal Lark chat where the inbound was received — when the outbound proxy authenticates as the same Lark app (the most common real configuration), sending back via `receive_id_type=chat_id` targets the same chat WITHOUT traversing any user-id translation. Falls back to union_id then open_id (with FellBack=true breadcrumbs) when chat_id is unavailable. ## Bug B — `Reply token already used` after card payload triggers NyxID 502 PR #409 introduced interactive card replies for /agents and /agent-status. Production showed NyxID's `channel-relay/reply` returning 502 for the card payload, after which the legacy "Interactive relay reply rejected; degrading to text" path re-sent the same relay token as plain text and got `401 Reply token already used` from NyxID — the relay token is single-use and was already consumed by the failed first attempt. The 401 escalated as `relay_reply_rejected`, queued an inbound turn retry, and the bot looked silent on every subsequent DM. Drop the post-dispatch text fallback in `TrySendInteractiveRelayReplyAsync`. Single-use semantics demand exactly one attempt per inbound; when the dispatcher fails, surface the error to the grain-level retry path instead of replaying the consumed token. The dispatcher's INTERNAL pre-flight fallbacks (no producer / composer rejects unsupported) are preserved because those run before the token is consumed. ## Other changes * `LarkBotErrorCodes.UserIdCrossTenant = 99992364` plus actionable hint in `SkillRunnerGAgent.BuildLarkRejectionMessage` and `FeishuCardHumanInteractionPort.BuildLarkRejectionMessage`. The hint surfaces in `last_error` shown by `/agent-status` so operators / users can correlate cross-tenant rejections with the recreate-the-agent recovery (`/agents` → Delete → `/daily`) the same way the existing cross-app hint does. ## Tests * `LarkConversationTargetsTests`: pin the new chat_id-first priority for p2p; pin the union_id and open_id fallbacks both setting `FellBackToPrefixInference=true` so call sites emit Debug breadcrumbs. * `AgentBuilderToolTests.PinsLarkChatId_When_RelayPropagatesIt` (renamed from `PinsLarkUnionId_*`): integration counterpart asserting the typed delivery target on `InitializeSkillRunnerCommand` lands as `(oc_*, "chat_id")` when the relay surfaces both LarkChatId and LarkUnionId. * `ChannelConversationTurnRunnerTests.RunLlmReplyAsync_ShouldNotRetryAsText_ WhenInteractiveDispatcherFails`: critical regression test that pins the NEW contract — when the dispatcher reports failure, the runner must NOT make a second HTTP call to the relay endpoint. Asserts the relay handler stays empty and the result surfaces `ErrorCode=relay_reply_rejected` with the original detail in `ErrorSummary`. * `SkillRunnerGAgentTests.ShouldIncludeRecreateHint_When_LarkRejectsAsCross TenantUserId`: pin the cross_tenant hint contract. Verification: 403 → 405 ChannelRuntime tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f3f53d968
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## dev #412 +/- ##
==========================================
- Coverage 70.38% 70.38% -0.01%
==========================================
Files 1175 1175
Lines 84452 84452
Branches 11124 11124
==========================================
- Hits 59443 59439 -4
- Misses 20718 20721 +3
- Partials 4291 4292 +1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
…lback + #411 preflight Three review concerns from PR #412 plus the GitHub-403 issue, all in one PR per the user's request. ## eanzhao (relay-token replay still happens at grain level) Comment quote: "ToRelayFailure(...) still turns this into a transient relay_reply_rejected ... ConversationGAgent.HandleInboundTurnTransientFailureAsync will then persist InboundTurnRetryScheduledEvent and re-run the same inbound turn with the same relay reply token." Fix: introduce a distinct `relay_reply_token_consumed` error code that `ToRelayFailure` maps to `PermanentFailure` (vs transient `relay_reply_rejected`), so the grain-level retry queue does not re-run the same inbound turn after the dispatcher already consumed the single-use token. The in-turn replay drop from PR #412 was necessary but not sufficient — without the routing change, the 401 cascade just shifts to grain-level replay. Pinned by `RunLlmReplyAsync_ShouldNotRetryAsText_WhenInteractiveDispatcherFails`, which now asserts both `ErrorCode=relay_reply_token_consumed` and `FailureKind=PermanentAdapterError` plus the existing relay-handler-empty contract. ## codex-bot P1 (chat_id-first regresses cross-app same-tenant) Comment quote: "In deployments where the inbound relay bot and outbound proxy use different Lark apps (same-tenant cross-app), the outbound app is typically not a member of the inbound DM chat, so receive_id_type= chat_id fails while union_id was the working identifier in the previous logic." Fix: capture the cross-tenant-safe union_id at agent-create time as a SECONDARY delivery target alongside the chat_id primary. New proto fields `lark_receive_id_fallback` / `lark_receive_id_type_fallback` on `SkillRunnerOutboundConfig`, `UserAgentCatalogEntry/Document/UpsertCommand`, `WorkflowAgentState`/`Init`/`InitializedEvent` (mirrored end-to-end through `UserAgentCatalogProjector` + `UserAgentCatalogQueryPort` per the typed-field-projection-mirror lesson). New `LarkConversationTargets.BuildFromInboundWithFallback` returns `(primary, optional fallback)` — fallback is captured ONLY for p2p with a chat_id primary AND a union_id surfaced at ingress (groups don't need it, non-chat_id primaries are already the safest identifier we have). Runtime fallback retry: `SkillRunnerGAgent.SendOutputAsync` and `FeishuCardHumanInteractionPort.SendMessageAsync` now try the primary, then on Lark `230002 bot not in chat` (`LarkBotErrorCodes.BotNotInChat`) exactly retry once with the fallback typed pair. Other Lark codes (e.g. `99992364 cross_tenant`) propagate immediately so users see the actionable recovery hint for the actual failure mode rather than a misleading retry. Pinned by `SendOutputAsync_ShouldRetryWithFallback_When_PrimaryRejectedAsBotNotInChat` (asserts request order: primary chat_id then fallback union_id) and `SendOutputAsync_ShouldNotRetry_When_PrimaryRejectedWithDifferentLarkCode` (asserts only 230002 triggers the retry). ## Issue #411 (daily_report GitHub proxy 403 at runtime) The new agent API key is allowed_service_ids=api-github but might lack a bound GitHub credential, so every scheduled run hits 401/403 from proxy/s/api-github and the user sees an empty / degraded report with no hint that recreation is needed. Add a preflight in `AgentBuilderTool.CreateDailyReportAgentAsync` that calls `proxy/s/api-github/rate_limit` with the freshly minted key — if NyxID's envelope reports HTTP 401/403, return a structured `github_proxy_access_denied` error from the tool BEFORE persisting the agent, so the user is told to verify GitHub OAuth + API key bindings in NyxID instead of receiving a "scheduled" agent that never produces output. Pinned by `ExecuteAsync_CreateAgent_DailyReport_FailsClosed_When_GithubProxyDeniedForNewKey` which asserts the structured error is returned AND the SkillRunner actor never receives `InitializeSkillRunnerCommand` (no half-initialized agent left in the catalog). ## Verification - 411/411 ChannelRuntime tests pass (was 405 before; +6 covering primary+ fallback BuildFromInbound priority, runtime fallback retry, GitHub preflight, consumed-token PermanentFailure mapping, and a no-fallback contract for non-DM and non-chat_id primaries). - Build: 0 errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
顺手把 #411 一起修了:daily_report 创建时新加 GitHub proxy preflight,调用 `proxy/s/api-github/rate_limit` 用刚拿到的 agent API key;NyxID envelope 报 401/403 就直接返 `github_proxy_access_denied` 结构化错误,不持久化 agent。 {
"error": "github_proxy_access_denied",
"http_status": 403,
"proxy_body": "{\"message\":\"Bad credentials\"}",
"hint": "The new agent API key was created with `allowed_service_ids=api-github` but cannot reach GitHub via NyxID. Verify the GitHub OAuth provider is connected at NyxID and that the key picks up the binding (NyxID `api-keys/{id}/bindings`). Until this is resolved the daily report will return empty/degraded output every run.",
"nyx_provider_slug": "api-lark-bot"
}Test #411 issue 里还提了三条,本 PR 没做:
最关键的"agent 创建后必失败"的流程已经堵住了。 |
…voke Three reviewer concerns from the second pass on PR #412, all production-blocking because they prevent the just-added recovery paths from firing in real deployments. ## r3141700469 — LarkProxyResponse misses Lark code nested in HTTP-400 body Reviewer: "production failures arrive through `NyxIdApiClient.SendAsync` as an HTTP-400 Nyx envelope: `{\"error\": true, \"status\": 400, \"body\": \"{\\\"code\\\":99992364,...}\"}`. `LarkProxyResponse.TryGetError` currently returns true for that shape but leaves `larkCode=null` because it does not parse the nested `body`." Confirmed by reading `NyxIdApiClient.cs:680` — `SendAsync` wraps every HTTP non-2xx as `{"error": true, "status": <http>, "body": "<raw>"}`. The Lark business code lives INSIDE the `body` STRING. The original parser only checked top-level `code`, so every production HTTP-400 path (the common `230002 bot not in chat`, `99992364 cross_tenant`, etc.) fell through with `larkCode=null` — meaning the new `BotNotInChat` retry branch and the `UserIdCrossTenant` recovery hint NEVER fired in the real wrapped path. Fix: `LarkProxyResponse.TryGetError` now parses nested `body` JSON when the top-level Nyx error envelope is present. Returns the Lark code with a detail line like `nyx_status=400 lark_code=99992364 msg=user id cross tenant` so the layered context is preserved in log lines and exception messages. ## r3141699476 — GitHub preflight uses wrong field name Reviewer: "this parser does not catch the actual 403 shape produced by our `NyxIdApiClient.SendAsync`. For non-2xx responses `SendAsync` wraps the response as `{\"error\": true, \"status\": 403, \"body\": ...}` … while the new preflight only reads `code`." Same root cause as r3141700469 — the SendAsync wrapper uses `status`, not `code`. Fix: read both `status` (the SendAsync envelope) AND `code` (any top-level Lark code shape) so the preflight catches the actual 403/401 production envelope. ## r3141699756 — Orphan agent API key on preflight failure Reviewer: "the freshly created NyxID API key is left behind … repeated `/daily` attempts that hit the GitHub preflight will accumulate orphan proxy keys." Fix: best-effort `DeleteApiKeyAsync` immediately before returning the structured error. Failures during revoke are logged at Warning but do NOT propagate — the structured create-time error is the user-facing signal; an orphan key is an ops cleanup concern, not a hard failure that should mask the original preflight diagnosis. ## Tests - `LarkProxyResponse` tests are exercised via the integration tests below; the parser change has 100% coverage through callers. - New `SendOutputAsync_ShouldRetryWithFallback_When_PrimaryRejectedAsBot NotInChat_ViaHttp400Envelope` — uses the actual `SendAsync` HTTP-400 envelope shape `{"error":true,"status":400,"body":"{\"code\":230002,...}"}` and asserts the runtime retry runs against the union_id fallback. - New `SendOutputAsync_ShouldThrowCrossTenantHint_When_LarkCodeNestedInHttp 400Body` — same envelope shape but with `99992364`, asserts the cross-tenant recreate-the-agent hint fires (which it didn't in production before this fix). - Updated `ExecuteAsync_CreateAgent_DailyReport_FailsClosed_When_GithubProxy DeniedForNewKey` — now uses the real `{"error":true,"status":403,"body":...}` envelope shape AND asserts the DELETE on `/api/v1/api-keys/{id}` runs before the structured error is returned (orphan key revocation contract). Verification: 411 → 413 ChannelRuntime tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review: 承诺兑现情况 + 架构观察整体兑现良好,两轮 review 反馈处理得很到位。但有几个值得修的点: ✅ 承诺兑现
二审 review(包括我自己提的 r3141699476 / r3141699756 / r3141700469)也都修了:
|
Reviewer (long-form review §4) flagged that PR #412 silently reversed the branch order in LarkProxyResponse.TryGetError (was: error-envelope first, then top-level code; now: top-level code first, then error envelope). The two production shapes are mutually exclusive today so the change is a no-op on every observed response, but the priority is fixed deliberately for forward-compat against hypothetical hybrid envelopes like {"error":true,"status":200,"code":230002,...} where the top-level Lark business code is the more specific signal. Add the invariant + rationale to the TryGetError docstring so a future reader does not "fix" the order back without understanding why. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review second pass: one remaining test gap承诺的主行为基本已经兑现: 剩下一个建议合并前补上的测试缺口:PR 改了 这个路径和 SkillRunner 不完全等价:Feishu port 从 建议加一条 架构上更好的长期形状已经由 #408 / #414 / #415 覆盖:typed |
Reviewer (PR #412 second-pass review) noted that the 230002 → fallback retry was added to FeishuCardHumanInteractionPort.SendMessageAsync but catalog-backed coverage existed only in SkillRunnerGAgentTests. Without a port-side regression, projector / query-mirror drift on the new LarkReceiveIdFallback / LarkReceiveIdTypeFallback fields could go unnoticed while production cards stop delivering on cross-app same-tenant DMs. Add a regression test that: - Stubs IUserAgentCatalogRuntimeQueryPort with chat_id primary + union_id fallback typed pair. - Returns the real wrapped Nyx envelope shape on the primary POST: {"error":true,"status":400,"body":"{\"code\":230002,\"msg\":\"Bot is not in the chat\"}"}. - Asserts two POSTs, first with receive_id_type=chat_id + receive_id=oc_*, second with receive_id_type=union_id + receive_id=on_*, and msg_type=interactive on the fallback (so the retry preserves the card payload, not just the receive header). A SequencedRecordingHandler helper mirrors the SkillRunnerGAgentTests SequencedHandler — different response per request, full request/body recording for ordered assertions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
补上了 Feishu fallback retry 的回归测试 (
#414 / #415 长期会把这两份 retry 实现合并到 |
|
Review of
One small correction remains: the PR comment says this test covers the This is a test/documentation accuracy issue, not a new behavior blocker in the Feishu port fix itself. |
…or tests Reviewer (PR #412 comment 4318615107) caught that the previous Feishu port fallback regression test stubs IUserAgentCatalogRuntimeQueryPort directly, so it covers `entry → port retry` but not `projector → query mirror → entry`. The implementation does mirror LarkReceiveIdFallback and LarkReceiveIdTypeFallback in both UserAgentCatalogProjector.Materialize and UserAgentCatalogQueryPort.ToEntry, but UserAgentCatalogProjectorTests only asserted the primary LarkReceiveId / LarkReceiveIdType fields — so a silent drop of the fallback mirror would still leave 414/414 green while production cards stop falling back on cross-app same-tenant DMs. Extend the two existing typed-round-trip tests: - ProjectAsync_WithValidCommittedEvent_UpsertsDocument: input state now carries the chat_id primary + union_id fallback typed pair; the document assertions cover both LarkReceiveIdFallback and LarkReceiveIdTypeFallback alongside the existing primary fields. - ToEntry_ShouldRoundTripTypedLarkReceiveTarget_FromDocumentToEntry: input document now carries the fallback pair; the entry assertions cover both fallback fields surviving the document → entry conversion that FeishuCardHumanInteractionPort and SkillRunnerGAgent depend on. Both tests' inline rationale now points at PR #412 explicitly so a future reader knows why the fallback pair is part of the round-trip contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
说得对——上一条 comment 把覆盖范围说大了。
两条测试的 inline 注释里点名 PR #412 + 加上
|
|
Rechecked latest fix (13521e7). The fallback target coverage gap from #412 (comment) is addressed now: Verified locally:
No further blocking issues from this pass. |
Summary
This PR landed in three commits because review caught real production-blocking gaps each time. The current scope spans four behavior changes plus one issue fix, all on the SkillRunner / channel-runtime outbound delivery path. Reviewer (4318563419) called out that the prior PR body was severely stale; this rewrite documents what actually shipped.
Behavior changes
1. p2p outbound: chat_id primary + persisted union_id fallback with runtime retry on
230002LarkConversationTargets.BuildFromInboundnow pickschat_idfirst for ALL conversation types (was:union_idfor p2p,chat_idfor groups). Production showedunion_idgetting rejected with99992364 user id cross tenant— the relay-side ingress ands/api-lark-botoutbound apps live in different Lark tenants.chat_idis the only Lark identifier that survives both cross-app and cross-tenant boundaries when the same Lark app is on both ends of the relay.To avoid regressing cross-app same-tenant deployments (where the outbound app is NOT a member of the inbound DM and chat_id fails with
230002 bot not in chat), the newBuildFromInboundWithFallbackreturns(primary, optional fallback). Fallback is captured ONLY for p2p with a chat_id primary AND a union_id surfaced at ingress; groups skip the fallback (chat_id is tenant-scoped — either the outbound app is in the group or no user-id helps).New runtime retry:
SkillRunnerGAgent.SendOutputAsyncandFeishuCardHumanInteractionPort.SendMessageAsynctry the primary, on Lark230002(LarkBotErrorCodes.BotNotInChat) retry exactly once with the fallback typed pair. Other Lark codes propagate immediately so users see actionable hints for the actual failure mode.Persistence: 14 new fields across 7 proto messages (
SkillRunnerOutboundConfig,UserAgentCatalogEntry,UserAgentCatalogDocument,UserAgentCatalogUpsertCommand,WorkflowAgentState,InitializeWorkflowAgentCommand,WorkflowAgentInitializedEvent), mirrored end-to-end throughUserAgentCatalogProjector.Materialize+UserAgentCatalogQueryPort.ToEntryper the typed-field-projection-mirror lesson.The full Lark identifier failure ladder, in case future debugging needs the table:
open_id(ou_*)99992361 open_id cross appunion_id(on_*)99992364 user id cross tenantchat_id(oc_*) of inbound chat2. Single-use reply token:
relay_reply_token_consumed→PermanentFailure(NOT transient)PR #409's interactive cards triggered NyxID
channel-relay/reply502 → aevatar's legacy "degrade to text" replayed the same token → 401 "Reply token already used" → bot looked silent on every subsequent DM.PR #412 fixed the in-turn replay first; reviewer (r3141663815) caught that this only shifted the 401 cascade from in-turn replay to grain-level replay because
ToRelayFailurewas routing toTransientFailure. Final fix: distinct error coderelay_reply_token_consumed→PermanentFailure, soConversationGAgent.HandleInboundTurnTransientFailureAsyncdoes NOT queue anInboundTurnRetryScheduledEventfor the consumed-token case. Next inbound carries a fresh token; current turn is a write-off.3. Cross-tenant
99992364actionable error messageSkillRunnerGAgent.BuildLarkRejectionMessageandFeishuCardHumanInteractionPort.BuildLarkRejectionMessagenow expand the bare99992364 user id cross tenantLark error into:The string rides
SkillRunnerExecutionFailedEvent.Errorto/agent-statuslast_error, so users see the actionable recovery flow without reading source.4.
LarkProxyResponse.TryGetErrorparses the actualNyxIdApiClient.SendAsyncenvelopeReviewer (r3141700469) caught that the helper only checked top-level
code, butNyxIdApiClient.SendAsync(NyxIdApiClient.cs:680) wraps every HTTP non-2xx as{"error": true, "status": <http>, "body": "<raw downstream JSON>"}— Lark's business code (e.g.99992364,230002) lives INSIDE thebodySTRING. The new parser walks that string when the top-level Nyx envelope is present so thelarkCode-gated branches (BotNotInChat retry, UserIdCrossTenant hint) actually fire on the production path.Detail format:
nyx_status=400 lark_code=99992364 msg=user id cross tenantso log lines preserve both layers.5. Issue #411: GitHub proxy preflight + orphan API key revoke
A
daily_reportSkillRunner created withallowed_service_ids=api-githubwould persist successfully even when NyxID's binding from the new agent API key to the user's GitHub OAuth was missing — every scheduled run hit GitHub 403 and the user saw empty/degraded reports with no signal that recreation was needed.AgentBuilderTool.CreateDailyReportAgentAsyncnow callsproxy/s/api-github/rate_limitwith the freshly-minted key BEFORE persisting the agent. On HTTP 401/403, returns a structuredgithub_proxy_access_deniederror with the recovery hint, AND best-effort revokes the orphan API key (reviewer r3141699756 caught that without revoke, repeated/dailyattempts accumulate orphan proxy keys).The preflight envelope parser uses BOTH
status(the SendAsync wrapper field) ANDcodefor forward-compatibility (reviewer r3141699476).Files
Verification
BuildFromInboundWithFallbackpriority/factory testsRunLlmReplyAsync_ShouldNotRetryAsText(assertsrelayHandler.Requests.Empty+ErrorCode=relay_reply_token_consumed+FailureKind=PermanentAdapterError)BotNotInChatfallback retry (synthetic top-level + real wrapped HTTP-400 envelope)cross_tenanthint (real wrapped HTTP-400 envelope)non-230002 codes don't trigger fallbackAgentBuilderTool.PinsLarkChatId_When_RelayPropagatesIt(integration of chat_id capture)AgentBuilderTool.LogsFallbackBreadcrumb_When_LarkUnionIdMissing(LogDebug breadcrumb on legacy fallback)AgentBuilderTool.DoesNotLogFallback_When_LarkUnionIdPresent(no noise when not falling back)AgentBuilderTool.FailsClosed_When_GithubProxyDeniedForNewKey(preflight + actor-not-initialized + DELETE orphan key)tools/ci/architecture_guards.shreportsPlayground asset drift detected for app.js / app.css— pre-existing onorigin/dev, unrelated.Migration
Same as PR #409: existing agents pinned to
LarkReceiveIdType=open_idorunion_idwon't self-heal because the persisted typed pair is treated as authoritative on the read path. Users see actionablelast_errortext in/agent-statusand recover via/agents→ Delete →/daily(two clicks with the card UI from PR #409). New agents created after this PR get the chat_id primary + union_id fallback automatically.Out of scope (architectural follow-ups, tracked separately)
Reviewer (4318563419) flagged three architecture-quality observations as non-blocking. Each is filed as a separate issue so they don't get lost:
TrySendWithFallbackAsyncis duplicated inSkillRunnerGAgent+FeishuCardHumanInteractionPort(with already-drifting log strings); extract a shared retry helper.LarkReceiveTargetsub-message; a third-level fallback would explode further. Typed sub-message refactor is the long-term shape.ILarkOutboundDispatcher), not in actor/port code, so platform-specific identifier knowledge does not leak. Subsumes Extract shared Lark outbound retry helper (DRY TrySendWithFallbackAsync between SkillRunner and FeishuCardHumanInteractionPort) #414 once landed.A fourth observation from the long-form review §4 (
LarkProxyResponse.TryGetErrorbranch-order rationale) is addressed in this PR byfdf66780: the priority-order invariant + forward-compat reasoning is now in the docstring so future readers do not silently revert it.🤖 Generated with Claude Code