Skip to content

fix(channel-runtime): chat_id-first outbound + fallback retry + consumed-token PermanentFailure + #411 GitHub preflight#412

Merged
eanzhao merged 6 commits intodevfrom
fix/2026-04-25_lark-prefer-chat-id-for-dm
Apr 25, 2026
Merged

fix(channel-runtime): chat_id-first outbound + fallback retry + consumed-token PermanentFailure + #411 GitHub preflight#412
eanzhao merged 6 commits intodevfrom
fix/2026-04-25_lark-prefer-chat-id-for-dm

Conversation

@eanzhao
Copy link
Copy Markdown
Contributor

@eanzhao eanzhao commented Apr 25, 2026

Summary

This PR landed in three commits because review caught real production-blocking gaps each time. The current scope spans four behavior changes plus one issue fix, all on the SkillRunner / channel-runtime outbound delivery path. Reviewer (4318563419) called out that the prior PR body was severely stale; this rewrite documents what actually shipped.

Behavior changes

1. p2p outbound: chat_id primary + persisted union_id fallback with runtime retry on 230002

LarkConversationTargets.BuildFromInbound now picks chat_id first for ALL conversation types (was: union_id for p2p, chat_id for groups). Production showed union_id getting rejected with 99992364 user id cross tenant — the relay-side ingress and s/api-lark-bot outbound apps live in different Lark tenants. chat_id is the only Lark identifier that survives both cross-app and cross-tenant boundaries when the same Lark app is on both ends of the relay.

To avoid regressing cross-app same-tenant deployments (where the outbound app is NOT a member of the inbound DM and chat_id fails with 230002 bot not in chat), the new BuildFromInboundWithFallback returns (primary, optional fallback). Fallback is captured ONLY for p2p with a chat_id primary AND a union_id surfaced at ingress; groups skip the fallback (chat_id is tenant-scoped — either the outbound app is in the group or no user-id helps).

New runtime retry: SkillRunnerGAgent.SendOutputAsync and FeishuCardHumanInteractionPort.SendMessageAsync try the primary, on Lark 230002 (LarkBotErrorCodes.BotNotInChat) retry exactly once with the fallback typed pair. Other Lark codes propagate immediately so users see actionable hints for the actual failure mode.

Persistence: 14 new fields across 7 proto messages (SkillRunnerOutboundConfig, UserAgentCatalogEntry, UserAgentCatalogDocument, UserAgentCatalogUpsertCommand, WorkflowAgentState, InitializeWorkflowAgentCommand, WorkflowAgentInitializedEvent), mirrored end-to-end through UserAgentCatalogProjector.Materialize + UserAgentCatalogQueryPort.ToEntry per the typed-field-projection-mirror lesson.

The full Lark identifier failure ladder, in case future debugging needs the table:

Identifier Same app Different apps, same tenant Different apps, different tenants
open_id (ou_*) 99992361 open_id cross app
union_id (on_*) 99992364 user id cross tenant
chat_id (oc_*) of inbound chat ✅ if outbound app in chat ✅ when same app received the inbound

2. Single-use reply token: relay_reply_token_consumedPermanentFailure (NOT transient)

PR #409's interactive cards triggered NyxID channel-relay/reply 502 → aevatar's legacy "degrade to text" replayed the same token → 401 "Reply token already used" → bot looked silent on every subsequent DM.

PR #412 fixed the in-turn replay first; reviewer (r3141663815) caught that this only shifted the 401 cascade from in-turn replay to grain-level replay because ToRelayFailure was routing to TransientFailure. Final fix: distinct error code relay_reply_token_consumedPermanentFailure, so ConversationGAgent.HandleInboundTurnTransientFailureAsync does NOT queue an InboundTurnRetryScheduledEvent for the consumed-token case. Next inbound carries a fresh token; current turn is a write-off.

3. Cross-tenant 99992364 actionable error message

SkillRunnerGAgent.BuildLarkRejectionMessage and FeishuCardHumanInteractionPort.BuildLarkRejectionMessage now expand the bare 99992364 user id cross tenant Lark error into:

Lark message delivery rejected (code=99992364): user id cross tenant. The outbound Lark app is in a different tenant than the inbound app, so user-id translation is impossible. Delete and recreate the agent (/agents → Delete → /daily) so the new chat_id-preferred outbound path takes effect, or align the NyxID s/api-lark-bot proxy with the channel-bot that received the inbound event.

The string rides SkillRunnerExecutionFailedEvent.Error to /agent-status last_error, so users see the actionable recovery flow without reading source.

4. LarkProxyResponse.TryGetError parses the actual NyxIdApiClient.SendAsync envelope

Reviewer (r3141700469) caught that the helper only checked top-level code, but NyxIdApiClient.SendAsync (NyxIdApiClient.cs:680) wraps every HTTP non-2xx as {"error": true, "status": <http>, "body": "<raw downstream JSON>"} — Lark's business code (e.g. 99992364, 230002) lives INSIDE the body STRING. The new parser walks that string when the top-level Nyx envelope is present so the larkCode-gated branches (BotNotInChat retry, UserIdCrossTenant hint) actually fire on the production path.

Detail format: nyx_status=400 lark_code=99992364 msg=user id cross tenant so log lines preserve both layers.

5. Issue #411: GitHub proxy preflight + orphan API key revoke

A daily_report SkillRunner created with allowed_service_ids=api-github would persist successfully even when NyxID's binding from the new agent API key to the user's GitHub OAuth was missing — every scheduled run hit GitHub 403 and the user saw empty/degraded reports with no signal that recreation was needed.

AgentBuilderTool.CreateDailyReportAgentAsync now calls proxy/s/api-github/rate_limit with the freshly-minted key BEFORE persisting the agent. On HTTP 401/403, returns a structured github_proxy_access_denied error with the recovery hint, AND best-effort revokes the orphan API key (reviewer r3141699756 caught that without revoke, repeated /daily attempts accumulate orphan proxy keys).

The preflight envelope parser uses BOTH status (the SendAsync wrapper field) AND code for forward-compatibility (reviewer r3141699476).

Files

agents/Aevatar.GAgents.ChannelRuntime/LarkConversationTargets.cs       (chat_id-first + BuildFromInboundWithFallback)
agents/Aevatar.GAgents.ChannelRuntime/LarkProxyResponse.cs             (nested body parsing for HTTP-non-2xx envelope)
agents/Aevatar.GAgents.ChannelRuntime/LarkBotErrorCodes.cs             (+ UserIdCrossTenant 99992364, BotNotInChat 230002)
agents/Aevatar.GAgents.ChannelRuntime/SkillRunnerGAgent.cs             (TrySendWithFallbackAsync + cross_tenant hint)
agents/Aevatar.GAgents.ChannelRuntime/FeishuCardHumanInteractionPort.cs (TrySendWithFallbackAsync + cross_tenant hint)
agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs              (PreflightGitHubProxyAsync + BestEffortRevokeApiKeyAsync + delivery target capture)
agents/Aevatar.GAgents.ChannelRuntime/UserAgentCatalogGAgent.cs        (merge fallback fields on upsert)
agents/Aevatar.GAgents.ChannelRuntime/UserAgentCatalogProjector.cs     (mirror fallback to document)
agents/Aevatar.GAgents.ChannelRuntime/UserAgentCatalogQueryPort.cs     (mirror fallback document → entry)
agents/Aevatar.GAgents.ChannelRuntime/WorkflowAgentGAgent.cs           (mirror fallback through state apply + upsert)
agents/Aevatar.GAgents.ChannelRuntime/ChannelConversationTurnRunner.cs (consumed-token PermanentFailure routing + drop post-dispatch text replay)
agents/Aevatar.GAgents.ChannelRuntime/channel_runtime_messages.proto   (14 new fallback fields across 7 messages)

test/Aevatar.GAgents.ChannelRuntime.Tests/LarkConversationTargetsTests.cs       (priority + WithFallback factory pinning)
test/Aevatar.GAgents.ChannelRuntime.Tests/AgentBuilderToolTests.cs              (PinsLarkChatId + GitHub preflight + orphan revoke)
test/Aevatar.GAgents.ChannelRuntime.Tests/ChannelConversationTurnRunnerTests.cs (no-retry-as-text + PermanentFailure mapping)
test/Aevatar.GAgents.ChannelRuntime.Tests/SkillRunnerGAgentTests.cs             (BotNotInChat fallback retry [synthetic + real envelope] + cross_tenant hint [synthetic + real envelope])

Verification

dotnet build agents/Aevatar.GAgents.ChannelRuntime/Aevatar.GAgents.ChannelRuntime.csproj --nologo
dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo
  • Build: 0 errors.
  • Tests: 413/413 pass. New coverage adds:
    • BuildFromInboundWithFallback priority/factory tests
    • RunLlmReplyAsync_ShouldNotRetryAsText (asserts relayHandler.Requests.Empty + ErrorCode=relay_reply_token_consumed + FailureKind=PermanentAdapterError)
    • 2× SkillRunner BotNotInChat fallback retry (synthetic top-level + real wrapped HTTP-400 envelope)
    • 1× SkillRunner cross_tenant hint (real wrapped HTTP-400 envelope)
    • 1× SkillRunner non-230002 codes don't trigger fallback
    • AgentBuilderTool.PinsLarkChatId_When_RelayPropagatesIt (integration of chat_id capture)
    • AgentBuilderTool.LogsFallbackBreadcrumb_When_LarkUnionIdMissing (LogDebug breadcrumb on legacy fallback)
    • AgentBuilderTool.DoesNotLogFallback_When_LarkUnionIdPresent (no noise when not falling back)
    • AgentBuilderTool.FailsClosed_When_GithubProxyDeniedForNewKey (preflight + actor-not-initialized + DELETE orphan key)

tools/ci/architecture_guards.sh reports Playground asset drift detected for app.js / app.css — pre-existing on origin/dev, unrelated.

Migration

Same as PR #409: existing agents pinned to LarkReceiveIdType=open_id or union_id won't self-heal because the persisted typed pair is treated as authoritative on the read path. Users see actionable last_error text in /agent-status and recover via /agents → Delete → /daily (two clicks with the card UI from PR #409). New agents created after this PR get the chat_id primary + union_id fallback automatically.

Out of scope (architectural follow-ups, tracked separately)

Reviewer (4318563419) flagged three architecture-quality observations as non-blocking. Each is filed as a separate issue so they don't get lost:

A fourth observation from the long-form review §4 (LarkProxyResponse.TryGetError branch-order rationale) is addressed in this PR by fdf66780: the priority-order invariant + forward-compat reasoning is now in the docstring so future readers do not silently revert it.

🤖 Generated with Claude Code

…n replay

Two production issues observed after PR #409 shipped:

## Bug A — `99992364 user id cross tenant` on SkillRunner DM

PR #409 switched p2p outbound to `union_id`, which is tenant-stable but still
fails when the relay-side ingress and outbound proxy live in different Lark
tenants (this deployment's NyxID `s/api-lark-bot` proxy is bound to a
different tenant than the user's own bot that subscribed to events). Even
the tenant-stable identifier is rejected: `code:99992364 user id cross
tenant`.

Switch the BuildFromInbound priority to `chat_id` first for ALL conversation
types (DM and group). chat_id (`oc_*`) is the literal Lark chat where the
inbound was received — when the outbound proxy authenticates as the same
Lark app (the most common real configuration), sending back via
`receive_id_type=chat_id` targets the same chat WITHOUT traversing any
user-id translation. Falls back to union_id then open_id (with
FellBack=true breadcrumbs) when chat_id is unavailable.

## Bug B — `Reply token already used` after card payload triggers NyxID 502

PR #409 introduced interactive card replies for /agents and /agent-status.
Production showed NyxID's `channel-relay/reply` returning 502 for the card
payload, after which the legacy "Interactive relay reply rejected; degrading
to text" path re-sent the same relay token as plain text and got
`401 Reply token already used` from NyxID — the relay token is single-use
and was already consumed by the failed first attempt. The 401 escalated as
`relay_reply_rejected`, queued an inbound turn retry, and the bot looked
silent on every subsequent DM.

Drop the post-dispatch text fallback in `TrySendInteractiveRelayReplyAsync`.
Single-use semantics demand exactly one attempt per inbound; when the
dispatcher fails, surface the error to the grain-level retry path instead
of replaying the consumed token. The dispatcher's INTERNAL pre-flight
fallbacks (no producer / composer rejects unsupported) are preserved
because those run before the token is consumed.

## Other changes

* `LarkBotErrorCodes.UserIdCrossTenant = 99992364` plus actionable hint in
  `SkillRunnerGAgent.BuildLarkRejectionMessage` and
  `FeishuCardHumanInteractionPort.BuildLarkRejectionMessage`. The hint
  surfaces in `last_error` shown by `/agent-status` so operators / users
  can correlate cross-tenant rejections with the recreate-the-agent
  recovery (`/agents` → Delete → `/daily`) the same way the existing
  cross-app hint does.

## Tests

* `LarkConversationTargetsTests`: pin the new chat_id-first priority for
  p2p; pin the union_id and open_id fallbacks both setting
  `FellBackToPrefixInference=true` so call sites emit Debug breadcrumbs.
* `AgentBuilderToolTests.PinsLarkChatId_When_RelayPropagatesIt` (renamed
  from `PinsLarkUnionId_*`): integration counterpart asserting the typed
  delivery target on `InitializeSkillRunnerCommand` lands as
  `(oc_*, "chat_id")` when the relay surfaces both LarkChatId and
  LarkUnionId.
* `ChannelConversationTurnRunnerTests.RunLlmReplyAsync_ShouldNotRetryAsText_
  WhenInteractiveDispatcherFails`: critical regression test that pins the
  NEW contract — when the dispatcher reports failure, the runner must NOT
  make a second HTTP call to the relay endpoint. Asserts the relay handler
  stays empty and the result surfaces `ErrorCode=relay_reply_rejected` with
  the original detail in `ErrorSummary`.
* `SkillRunnerGAgentTests.ShouldIncludeRecreateHint_When_LarkRejectsAsCross
  TenantUserId`: pin the cross_tenant hint contract.

Verification: 403 → 405 ChannelRuntime tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f3f53d968

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread agents/Aevatar.GAgents.ChannelRuntime/LarkConversationTargets.cs
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.38%. Comparing base (8fc13a2) to head (13521e7).
⚠️ Report is 7 commits behind head on dev.

@@            Coverage Diff             @@
##              dev     #412      +/-   ##
==========================================
- Coverage   70.38%   70.38%   -0.01%     
==========================================
  Files        1175     1175              
  Lines       84452    84452              
  Branches    11124    11124              
==========================================
- Hits        59443    59439       -4     
- Misses      20718    20721       +3     
- Partials     4291     4292       +1     
Flag Coverage Δ
ci 70.38% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…lback + #411 preflight

Three review concerns from PR #412 plus the GitHub-403 issue, all in one PR per
the user's request.

## eanzhao (relay-token replay still happens at grain level)

Comment quote: "ToRelayFailure(...) still turns this into a transient
relay_reply_rejected ... ConversationGAgent.HandleInboundTurnTransientFailureAsync
will then persist InboundTurnRetryScheduledEvent and re-run the same inbound
turn with the same relay reply token."

Fix: introduce a distinct `relay_reply_token_consumed` error code that
`ToRelayFailure` maps to `PermanentFailure` (vs transient
`relay_reply_rejected`), so the grain-level retry queue does not re-run the
same inbound turn after the dispatcher already consumed the single-use token.
The in-turn replay drop from PR #412 was necessary but not sufficient —
without the routing change, the 401 cascade just shifts to grain-level
replay. Pinned by
`RunLlmReplyAsync_ShouldNotRetryAsText_WhenInteractiveDispatcherFails`,
which now asserts both `ErrorCode=relay_reply_token_consumed` and
`FailureKind=PermanentAdapterError` plus the existing relay-handler-empty
contract.

## codex-bot P1 (chat_id-first regresses cross-app same-tenant)

Comment quote: "In deployments where the inbound relay bot and outbound
proxy use different Lark apps (same-tenant cross-app), the outbound app
is typically not a member of the inbound DM chat, so receive_id_type=
chat_id fails while union_id was the working identifier in the previous
logic."

Fix: capture the cross-tenant-safe union_id at agent-create time as a
SECONDARY delivery target alongside the chat_id primary. New proto fields
`lark_receive_id_fallback` / `lark_receive_id_type_fallback` on
`SkillRunnerOutboundConfig`, `UserAgentCatalogEntry/Document/UpsertCommand`,
`WorkflowAgentState`/`Init`/`InitializedEvent` (mirrored end-to-end through
`UserAgentCatalogProjector` + `UserAgentCatalogQueryPort` per the
typed-field-projection-mirror lesson). New
`LarkConversationTargets.BuildFromInboundWithFallback` returns
`(primary, optional fallback)` — fallback is captured ONLY for p2p with a
chat_id primary AND a union_id surfaced at ingress (groups don't need it,
non-chat_id primaries are already the safest identifier we have).

Runtime fallback retry: `SkillRunnerGAgent.SendOutputAsync` and
`FeishuCardHumanInteractionPort.SendMessageAsync` now try the primary, then
on Lark `230002 bot not in chat` (`LarkBotErrorCodes.BotNotInChat`)
exactly retry once with the fallback typed pair. Other Lark codes (e.g.
`99992364 cross_tenant`) propagate immediately so users see the actionable
recovery hint for the actual failure mode rather than a misleading retry.
Pinned by
`SendOutputAsync_ShouldRetryWithFallback_When_PrimaryRejectedAsBotNotInChat`
(asserts request order: primary chat_id then fallback union_id) and
`SendOutputAsync_ShouldNotRetry_When_PrimaryRejectedWithDifferentLarkCode`
(asserts only 230002 triggers the retry).

## Issue #411 (daily_report GitHub proxy 403 at runtime)

The new agent API key is allowed_service_ids=api-github but might lack a
bound GitHub credential, so every scheduled run hits 401/403 from
proxy/s/api-github and the user sees an empty / degraded report with no
hint that recreation is needed. Add a preflight in
`AgentBuilderTool.CreateDailyReportAgentAsync` that calls
`proxy/s/api-github/rate_limit` with the freshly minted key — if NyxID's
envelope reports HTTP 401/403, return a structured
`github_proxy_access_denied` error from the tool BEFORE persisting the
agent, so the user is told to verify GitHub OAuth + API key bindings in
NyxID instead of receiving a "scheduled" agent that never produces output.
Pinned by
`ExecuteAsync_CreateAgent_DailyReport_FailsClosed_When_GithubProxyDeniedForNewKey`
which asserts the structured error is returned AND the SkillRunner actor
never receives `InitializeSkillRunnerCommand` (no half-initialized agent
left in the catalog).

## Verification

- 411/411 ChannelRuntime tests pass (was 405 before; +6 covering primary+
  fallback BuildFromInbound priority, runtime fallback retry, GitHub
  preflight, consumed-token PermanentFailure mapping, and a no-fallback
  contract for non-DM and non-chat_id primaries).
- Build: 0 errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

顺手把 #411 一起修了:daily_report 创建时新加 GitHub proxy preflight,调用 `proxy/s/api-github/rate_limit` 用刚拿到的 agent API key;NyxID envelope 报 401/403 就直接返 `github_proxy_access_denied` 结构化错误,不持久化 agent。

{
  "error": "github_proxy_access_denied",
  "http_status": 403,
  "proxy_body": "{\"message\":\"Bad credentials\"}",
  "hint": "The new agent API key was created with `allowed_service_ids=api-github` but cannot reach GitHub via NyxID. Verify the GitHub OAuth provider is connected at NyxID and that the key picks up the binding (NyxID `api-keys/{id}/bindings`). Until this is resolved the daily report will return empty/degraded output every run.",
  "nyx_provider_slug": "api-lark-bot"
}

Test ExecuteAsync_CreateAgent_DailyReport_FailsClosed_When_GithubProxyDeniedForNewKey pin 两件事:返结构化错误 + SkillRunner actor 绝不收到 `InitializeSkillRunnerCommand`(catalog 里不会留半拉子 agent)。

#411 issue 里还提了三条,本 PR 没做:

  • ❌ Surface "all GitHub tool calls failed" as SkillRunner failure with explicit LastError — 已经存在(PR fix(channel-runtime): resolve Lark DM receive_id_type and quiet best-effort reaction noise #403 的 throw 路径会让 `SkillRunnerExecutionFailedEvent.Error` 带具体 Lark code/detail 落到 `/agent-status` last_error)。preflight 兜底了大多数情况,剩下的真在 runtime 才坏(OAuth 中途失效)的场景照旧由 trigger handler 走 retry → ExecutionFailed。
  • ❌ Sanitized diagnostic logging for proxy error bodies — 没动,preflight 已经把 401/403 body 透传到结构化错误 `proxy_body` 里了;保留 `NyxIdApiClient.SendAsync` 不变,避免改动 prod 日志路径。
  • ❌ Tighten malformed `nyxid_proxy` calls — 那是 LLM 工具调用层的事,不在 SkillRunner 创建/运行链路上,留独立 issue 跟踪。

最关键的"agent 创建后必失败"的流程已经堵住了。

Comment thread agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs Outdated
Comment thread agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs
Comment thread agents/Aevatar.GAgents.ChannelRuntime/SkillRunnerGAgent.cs
…voke

Three reviewer concerns from the second pass on PR #412, all production-blocking
because they prevent the just-added recovery paths from firing in real
deployments.

## r3141700469 — LarkProxyResponse misses Lark code nested in HTTP-400 body

Reviewer: "production failures arrive through `NyxIdApiClient.SendAsync` as an
HTTP-400 Nyx envelope: `{\"error\": true, \"status\": 400, \"body\":
\"{\\\"code\\\":99992364,...}\"}`. `LarkProxyResponse.TryGetError` currently
returns true for that shape but leaves `larkCode=null` because it does not
parse the nested `body`."

Confirmed by reading `NyxIdApiClient.cs:680` — `SendAsync` wraps every HTTP
non-2xx as `{"error": true, "status": <http>, "body": "<raw>"}`. The Lark
business code lives INSIDE the `body` STRING. The original parser only
checked top-level `code`, so every production HTTP-400 path (the common
`230002 bot not in chat`, `99992364 cross_tenant`, etc.) fell through with
`larkCode=null` — meaning the new `BotNotInChat` retry branch and the
`UserIdCrossTenant` recovery hint NEVER fired in the real wrapped path.

Fix: `LarkProxyResponse.TryGetError` now parses nested `body` JSON when the
top-level Nyx error envelope is present. Returns the Lark code with a detail
line like `nyx_status=400 lark_code=99992364 msg=user id cross tenant` so
the layered context is preserved in log lines and exception messages.

## r3141699476 — GitHub preflight uses wrong field name

Reviewer: "this parser does not catch the actual 403 shape produced by our
`NyxIdApiClient.SendAsync`. For non-2xx responses `SendAsync` wraps the
response as `{\"error\": true, \"status\": 403, \"body\": ...}` … while the
new preflight only reads `code`."

Same root cause as r3141700469 — the SendAsync wrapper uses `status`, not
`code`. Fix: read both `status` (the SendAsync envelope) AND `code` (any
top-level Lark code shape) so the preflight catches the actual 403/401
production envelope.

## r3141699756 — Orphan agent API key on preflight failure

Reviewer: "the freshly created NyxID API key is left behind … repeated
`/daily` attempts that hit the GitHub preflight will accumulate orphan
proxy keys."

Fix: best-effort `DeleteApiKeyAsync` immediately before returning the
structured error. Failures during revoke are logged at Warning but do NOT
propagate — the structured create-time error is the user-facing signal; an
orphan key is an ops cleanup concern, not a hard failure that should mask
the original preflight diagnosis.

## Tests

- `LarkProxyResponse` tests are exercised via the integration tests below;
  the parser change has 100% coverage through callers.
- New `SendOutputAsync_ShouldRetryWithFallback_When_PrimaryRejectedAsBot
  NotInChat_ViaHttp400Envelope` — uses the actual `SendAsync` HTTP-400
  envelope shape `{"error":true,"status":400,"body":"{\"code\":230002,...}"}`
  and asserts the runtime retry runs against the union_id fallback.
- New `SendOutputAsync_ShouldThrowCrossTenantHint_When_LarkCodeNestedInHttp
  400Body` — same envelope shape but with `99992364`, asserts the
  cross-tenant recreate-the-agent hint fires (which it didn't in production
  before this fix).
- Updated `ExecuteAsync_CreateAgent_DailyReport_FailsClosed_When_GithubProxy
  DeniedForNewKey` — now uses the real `{"error":true,"status":403,"body":...}`
  envelope shape AND asserts the DELETE on `/api/v1/api-keys/{id}` runs
  before the structured error is returned (orphan key revocation contract).

Verification: 411 → 413 ChannelRuntime tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

Review: 承诺兑现情况 + 架构观察

整体兑现良好,两轮 review 反馈处理得很到位。但有几个值得修的点:

✅ 承诺兑现

承诺 实现
BuildFromInbound chat_id-first(all conversation types) LarkConversationTargets.cs:96-126
Fallback chat_id → union_id → open_id(带 FellBack=true 面包屑) LarkConversationTargets.cs:106-122
TrySendInteractiveRelayReplyAsync 不再 dispatch 失败后重发 ChannelConversationTurnRunner.cs:514-525
LarkBotErrorCodes.UserIdCrossTenant = 99992364
两个 port 的 BuildLarkRejectionMessage 都加 cross_tenant hint
Single-use token → PermanentFailure 而非 transient relay_reply_rejected ToRelayFailure 路由 relay_reply_token_consumedPermanentFailure

二审 review(包括我自己提的 r3141699476 / r3141699756 / r3141700469)也都修了:PreflightGitHubProxyAsync 同时读 statuscodeBestEffortRevokeApiKeyAsync 清理孤儿 key、LarkProxyResponse 解析嵌套 body。

⚠️ PR 描述严重过期(建议修)

PR title/body 只说 "chat_id-first outbound + drop reply-token replay",但实际实现还包括:

  1. 新增 runtime fallback 重试机制(重大行为变更,描述里完全没提)
    • 230002 bot not in chat 触发 primary → fallback 一次重试
    • 增加 7 个 proto message 的 lark_receive_id_fallback / lark_receive_id_type_fallback 字段
    • SkillRunnerGAgent.TrySendWithFallbackAsync + FeishuCardHumanInteractionPort.TrySendWithFallbackAsync
  2. 完整修了 SkillRunner daily_report GitHub proxy 403s at runtime #411(只在中文 comment 里提了一句"顺手")
    • PreflightGitHubProxyAsync + BestEffortRevokeApiKeyAsync + 完整测试
  3. LarkProxyResponse.TryGetError 行为变更(嵌套 body 解析 + 重排优先级)
  4. "Files" 列表漏了LarkProxyResponse.csAgentBuilderTool.csUserAgentCatalogGAgent.csUserAgentCatalogProjector.csUserAgentCatalogQueryPort.csWorkflowAgentGAgent.cschannel_runtime_messages.proto
  5. 测试数量描述失真:实际新增 8+ 个测试,不是描述里的 "3 + 1 + 1 + 1"

后续 archaeology 看不到 230002 fallback 这个新机制和 #411 的存在,建议更新 PR body。

🏗️ 架构观察(非 blocking,但值得讨论)

1. TrySendWithFallbackAsync 在两处复制(中等)

SkillRunnerGAgent.csFeishuCardHumanInteractionPort.cs 各自有一份近乎相同的 SendOutcome record + TrySendWithFallbackAsync + SendOutboundAsync。后者注释明确写着 "Mirrors SkillRunnerGAgent.TrySendWithFallbackAsync"。

按 CLAUDE.md "删除优先:空转发、重复抽象、无业务价值代码直接删除",建议抽一个小 helper(吃 Func<LarkReceiveTarget, CancellationToken, Task<string>> send delegate + primary/fallback target),把 retry policy 收口到一处。否则未来调 retry 策略要改两份,且这两份会逐渐 drift(已有不同的日志格式:一份 LogInformation 用 "Skill runner ... primary delivery target",另一份用 "Feishu human interaction port primary delivery target")。

2. Proto fallback 形状:14 个新 flat string vs repeated LarkReceiveTarget(中等)

当前在 7 个 proto message 上各加 2 个 flat string(共 14 个新字段)。如果未来要加第三级 fallback(比如 chat_id → union_id → open_id 三段都持久化),又得在 7 个 message 上各加 2 个字段。

按 CLAUDE.md "核心语义强类型:影响业务语义、控制流、稳定读取且仓库内可控的数据,必须建模为 proto field / typed option / typed sub-message",建议:

```proto
message LarkReceiveTarget {
string receive_id = 1;
string receive_id_type = 2;
}

message UserAgentCatalogEntry {
// ...
repeated LarkReceiveTarget delivery_targets = 22; // priority by index, [0] = primary
// 旧 lark_receive_id / lark_receive_id_type 通过 reserved 标记或读迁移废弃
}
```

这样 "primary + N fallbacks" 是 schema-stable 的,且把"两个字段表达一个 identifier"这个隐式耦合提升成显式 sub-message。

不过这是更大的重构,本 PR 范围内保持现状可以接受(与 #409 已有的 flat 形状一致),但建议作为 follow-up issue 跟踪。

3. Fallback 重试位置的架构归属(架构性)

当前 230002 → fallback 的重试逻辑在两处:

  • SkillRunnerGAgent.SendOutputAsync(actor-side)
  • FeishuCardHumanInteractionPort.SendMessageAsync(port-side)

按 CLAUDE.md "Actor 即业务实体" + "读写分离在 Projection Pipeline 层面实现",identifier ladder 这种"传输层目标解析"语义其实更适合放在 outbound dispatch 边界(一个 ILarkOutboundDispatcher),让 actor/port 只负责"把这段内容送到这个 conversation",identifier 选择 + 重试由 dispatcher 内部处理。这样:

  • 第三级 fallback 上线时 actor/port 不动
  • Lark/Feishu 平台特定的 identifier 知识不泄漏到 actor
  • 重试策略(次数、错误码白名单、退避)统一治理

也是 follow-up 性质,不阻塞这个 PR。

4. LarkProxyResponse.TryGetError 优先级反转(轻微)

旧实现:先检查 error envelope,再检查 top-level code
新实现:先检查 top-level code,再检查 error envelope(且嵌套 body 也解析 Lark code)。

对当前观察到的两种 envelope 形状两种顺序结果一致,但反转本身是个隐式行为变更。如果未来 NyxID 出现 `{"error":true, "status":200, "code":230002, ...}` 这类奇形,新顺序下 code 会先命中(top-level Lark business error 路径),旧顺序则进入 error 路径。建议在 docstring 里把"为什么先 top-level code"那条 invariant 显式写出来。

🔁 #411 是否应该独立 PR(轻微 process)

f9d8fbc6 一个 commit 同时做了:(a) round-1 review feedback for #412、(b) 完整实现 #411 preflight。按 CLAUDE.md "提交信息:祈使句,聚焦单一目的",这两件事如果分开 commit/PR,revert 粒度更细。下次类似情况建议拆,不阻塞当前。

总结

承诺都兑了,review 迭代质量高。主要风险点是 PR 描述与实现脱节——尤其是 230002 fallback retry#411 这两个有独立行为/接口语义的改动没在 title/body 出现。建议合并前刷一次描述,把这两块加到 "Other changes",并把 Files 列表补全。

代码层面 TrySendWithFallbackAsync 的复制和 proto fallback 形状是真问题,但都是可独立 follow-up 的,不阻塞这个回归修复。

@eanzhao eanzhao changed the title fix(channel-runtime): chat_id-first outbound + drop reply-token replay after dispatch failure fix(channel-runtime): chat_id-first outbound + fallback retry + consumed-token PermanentFailure + #411 GitHub preflight Apr 25, 2026
Reviewer (long-form review §4) flagged that PR #412 silently reversed the
branch order in LarkProxyResponse.TryGetError (was: error-envelope first,
then top-level code; now: top-level code first, then error envelope). The
two production shapes are mutually exclusive today so the change is a no-op
on every observed response, but the priority is fixed deliberately for
forward-compat against hypothetical hybrid envelopes like
{"error":true,"status":200,"code":230002,...} where the top-level Lark
business code is the more specific signal.

Add the invariant + rationale to the TryGetError docstring so a future
reader does not "fix" the order back without understanding why.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

Review second pass: one remaining test gap

承诺的主行为基本已经兑现:chat_id primary、230002 -> union_id fallback、嵌套 Nyx/Lark envelope 解析、relay_reply_token_consumed -> PermanentFailure#411 create-time GitHub preflight + orphan key revoke 都能在代码和测试里对上。dotnet build agents/Aevatar.GAgents.ChannelRuntime/Aevatar.GAgents.ChannelRuntime.csproj --nologodotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo 本地通过。

剩下一个建议合并前补上的测试缺口:PR 改了 FeishuCardHumanInteractionPort.SendMessageAsync,让 workflow human-interaction outbound 也执行 230002 bot not in chat -> fallback target retry,但当前 fallback retry 覆盖只在 SkillRunnerGAgentTestsFeishuCardHumanInteractionPortTests 目前只覆盖 primary send success,没有覆盖 catalog-backed target 的 fallback 行为。

这个路径和 SkillRunner 不完全等价:Feishu port 从 IUserAgentCatalogRuntimeQueryPort 读取 UserAgentCatalogDocument -> UserAgentCatalogEntry 投影后的 LarkReceiveIdFallback/LarkReceiveIdTypeFallback,再执行重试。也就是说,如果未来 projector/query mirror 或 Feishu 自己的 retry 分支漂了,现有 413 个测试仍可能全绿。

建议加一条 FeishuCardHumanInteractionPort regression test:catalog entry 带 LarkReceiveId=oc_* / chat_idLarkReceiveIdFallback=on_* / union_id;primary response 使用真实 wrapped shape {"error":true,"status":400,"body":"{\"code\":230002,\"msg\":\"Bot is not in the chat\"}"};fallback response success;断言发了两次 POST,第二次 query 是 receive_id_type=union_id 且 body 的 receive_idon_*

架构上更好的长期形状已经由 #408 / #414 / #415 覆盖:typed OutboundTarget sub-message、共享 retry helper、最终收敛到 ILarkOutboundDispatcher。这些可以不阻塞当前生产修复,但 Feishu fallback 的本地回归测试最好在这个 PR 内补齐。

Reviewer (PR #412 second-pass review) noted that the 230002 → fallback
retry was added to FeishuCardHumanInteractionPort.SendMessageAsync but
catalog-backed coverage existed only in SkillRunnerGAgentTests. Without a
port-side regression, projector / query-mirror drift on the new
LarkReceiveIdFallback / LarkReceiveIdTypeFallback fields could go unnoticed
while production cards stop delivering on cross-app same-tenant DMs.

Add a regression test that:
- Stubs IUserAgentCatalogRuntimeQueryPort with chat_id primary +
  union_id fallback typed pair.
- Returns the real wrapped Nyx envelope shape on the primary POST:
  {"error":true,"status":400,"body":"{\"code\":230002,\"msg\":\"Bot is not in the chat\"}"}.
- Asserts two POSTs, first with receive_id_type=chat_id +
  receive_id=oc_*, second with receive_id_type=union_id + receive_id=on_*,
  and msg_type=interactive on the fallback (so the retry preserves the
  card payload, not just the receive header).

A SequencedRecordingHandler helper mirrors the SkillRunnerGAgentTests
SequencedHandler — different response per request, full request/body
recording for ordered assertions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

补上了 Feishu fallback retry 的回归测试 (3abc33e2):

DeliverSuspensionAsync_ShouldRetryWithFallback_When_PrimaryRejectedAsBotNotInChat_ViaHttp400Envelope — pin 三件事,覆盖 review 里点名的 catalog → projector → query mirror → port retry 全链路:

  1. Catalog entry 暴露 LarkReceiveId=oc_dm_chat_1 / chat_id 主 + LarkReceiveIdFallback=on_user_1 / union_id 备(如果未来 UserAgentCatalogProjector.MaterializeUserAgentCatalogQueryPort.ToEntry 漏掉新字段的镜像,这里会立刻红)
  2. Primary 用真实生产 envelope 形状 {"error":true,"status":400,"body":"{\"code\":230002,\"msg\":\"Bot is not in the chat\"}"},验证 LarkProxyResponse 嵌套解析 + 230002 retry 在 Feishu port 路径上同样 fires
  3. 断言:发了 2 次 POST;第 1 次 query 是 receive_id_type=chat_id、body receive_id=oc_dm_chat_1;第 2 次 query 是 receive_id_type=union_id、body receive_id=on_user_1msg_type=interactive(确认 retry 保留卡片 payload,不是只换 receive 头)

SequencedRecordingHandler helper 仿照 SkillRunnerGAgentTests.SequencedHandler:每次请求按队列返回不同响应,全程记录 request + body 用于顺序断言。

dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo:414/414 通过(原 413 + 新增 1)。

#414 / #415 长期会把这两份 retry 实现合并到 ILarkOutboundDispatcher,那时这两条 port 测试会自然简化为对 dispatcher 的契约测试,但当前 PR 内本地回归覆盖已经齐了。

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

Review of 3abc33e2: the new Feishu port regression test covers the runtime behavior I asked for. It exercises the real wrapped HTTP-400 envelope, verifies two POSTs, and verifies the fallback POST keeps msg_type=interactive while switching to receive_id_type=union_id / receive_id=on_user_1. I ran:

  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo — 414/414 pass
  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo --filter FeishuCardHumanInteractionPortTests — 11/11 pass
  • git diff --check origin/dev...HEAD — clean

One small correction remains: the PR comment says this test covers the catalog -> projector -> query mirror -> port retry chain, but the test stubs IUserAgentCatalogRuntimeQueryPort directly, so it only covers entry -> port retry. The implementation does mirror fallback fields in UserAgentCatalogProjector and UserAgentCatalogQueryPort, but UserAgentCatalogProjectorTests still only asserts the primary LarkReceiveId/LarkReceiveIdType fields. Please either update that test to assert LarkReceiveIdFallback/LarkReceiveIdTypeFallback through ProjectAsync and ToEntry, or tone down the comment/PR note so it does not claim projector/query coverage.

This is a test/documentation accuracy issue, not a new behavior blocker in the Feishu port fix itself.

…or tests

Reviewer (PR #412 comment 4318615107) caught that the previous Feishu port
fallback regression test stubs IUserAgentCatalogRuntimeQueryPort directly,
so it covers `entry → port retry` but not `projector → query mirror →
entry`. The implementation does mirror LarkReceiveIdFallback and
LarkReceiveIdTypeFallback in both UserAgentCatalogProjector.Materialize
and UserAgentCatalogQueryPort.ToEntry, but UserAgentCatalogProjectorTests
only asserted the primary LarkReceiveId / LarkReceiveIdType fields — so a
silent drop of the fallback mirror would still leave 414/414 green while
production cards stop falling back on cross-app same-tenant DMs.

Extend the two existing typed-round-trip tests:

- ProjectAsync_WithValidCommittedEvent_UpsertsDocument: input state now
  carries the chat_id primary + union_id fallback typed pair; the document
  assertions cover both LarkReceiveIdFallback and LarkReceiveIdTypeFallback
  alongside the existing primary fields.
- ToEntry_ShouldRoundTripTypedLarkReceiveTarget_FromDocumentToEntry: input
  document now carries the fallback pair; the entry assertions cover both
  fallback fields surviving the document → entry conversion that
  FeishuCardHumanInteractionPort and SkillRunnerGAgent depend on.

Both tests' inline rationale now points at PR #412 explicitly so a future
reader knows why the fallback pair is part of the round-trip contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

说得对——上一条 comment 把覆盖范围说大了。3abc33e2SequencedRecordingHandler 测试 stub 了 IUserAgentCatalogRuntimeQueryPort 直返 UserAgentCatalogEntry,所以只覆盖 entry → port retry,并没真的过 UserAgentCatalogProjector.Materialize / UserAgentCatalogQueryPort.ToEntry 那两层。

13521e76UserAgentCatalogProjectorTests 里两条 typed-round-trip 测试扩展成同时盖 fallback pair:

  • ProjectAsync_WithValidCommittedEvent_UpsertsDocument —— 入参 state 改成 chat_id primary + union_id fallback typed pair;新增 document.LarkReceiveIdFallback.Should().Be("on_user_1")document.LarkReceiveIdTypeFallback.Should().Be("union_id")。如果 Materialize 漏掉这两个字段的镜像,这条测试立刻红。
  • ToEntry_ShouldRoundTripTypedLarkReceiveTarget_FromDocumentToEntry —— 入参 document 改成带 fallback pair;新增 entry.LarkReceiveIdFallback / entry.LarkReceiveIdTypeFallback 的断言。UserAgentCatalogQueryPort.ToEntry 漏镜像同样立刻红。

两条测试的 inline 注释里点名 PR #412 + 230002 bot not in chat retry,让未来读 archaeology 的人知道为什么 fallback pair 必须在 round-trip 契约里。

加上 3abc33e2 的 port 端 retry 测试,现在 catalog → projector → query → port 全链路每一段都有专门的回归断言:

测试 防护点
Materialize(actor state → document) ProjectAsync_WithValidCommittedEvent_UpsertsDocument projector 漏镜像 fallback pair
ToEntry(document → entry) ToEntry_ShouldRoundTripTypedLarkReceiveTarget_FromDocumentToEntry query port 漏镜像 fallback pair
Port retry(entry → 230002 → fallback POST) DeliverSuspensionAsync_ShouldRetryWithFallback_When_PrimaryRejectedAsBotNotInChat_ViaHttp400Envelope Feishu port 漏 retry 或漏读 fallback pair
Skill runner retry(同上 actor 侧) SendOutputAsync_ShouldRetryWithFallback_When_PrimaryRejectedAsBotNotInChat_ViaHttp400Envelope SkillRunner 漏 retry 或漏读 fallback pair

dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo:414/414 仍通过(projector 测试是扩展现有断言而不是新增 case,所以总数不变)。

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

Rechecked latest fix (13521e7). The fallback target coverage gap from #412 (comment) is addressed now: UserAgentCatalogProjectorTests asserts both state -> document and document -> UserAgentCatalogEntry carry LarkReceiveIdFallback / LarkReceiveIdTypeFallback, so the catalog-backed Feishu retry path is covered instead of only the port-stub path.

Verified locally:

  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo --filter "FullyQualifiedName~UserAgentCatalogProjectorTests|FullyQualifiedName~FeishuCardHumanInteractionPortTests" -> 18/18 passed
  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo -> 414/414 passed
  • bash tools/ci/test_stability_guards.sh -> passed
  • bash tools/ci/query_projection_priming_guard.sh -> passed
  • bash tools/ci/projection_state_version_guard.sh -> passed
  • bash tools/ci/projection_state_mirror_current_state_guard.sh -> passed
  • git diff --check origin/dev...HEAD -> clean

No further blocking issues from this pass.

@eanzhao eanzhao merged commit 6131ed7 into dev Apr 25, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant