fix(channel-runtime): chat_id-first outbound + fallback retry + consumed-token PermanentFailure + #411 GitHub preflight by eanzhao · Pull Request #412 · aevatarAI/aevatar

eanzhao · 2026-04-25T07:27:11Z

Summary

This PR landed in three commits because review caught real production-blocking gaps each time. The current scope spans four behavior changes plus one issue fix, all on the SkillRunner / channel-runtime outbound delivery path. Reviewer (4318563419) called out that the prior PR body was severely stale; this rewrite documents what actually shipped.

Behavior changes

1. p2p outbound: chat_id primary + persisted union_id fallback with runtime retry on `230002`

LarkConversationTargets.BuildFromInbound now picks chat_id first for ALL conversation types (was: union_id for p2p, chat_id for groups). Production showed union_id getting rejected with 99992364 user id cross tenant — the relay-side ingress and s/api-lark-bot outbound apps live in different Lark tenants. chat_id is the only Lark identifier that survives both cross-app and cross-tenant boundaries when the same Lark app is on both ends of the relay.

To avoid regressing cross-app same-tenant deployments (where the outbound app is NOT a member of the inbound DM and chat_id fails with 230002 bot not in chat), the new BuildFromInboundWithFallback returns (primary, optional fallback). Fallback is captured ONLY for p2p with a chat_id primary AND a union_id surfaced at ingress; groups skip the fallback (chat_id is tenant-scoped — either the outbound app is in the group or no user-id helps).

New runtime retry: SkillRunnerGAgent.SendOutputAsync and FeishuCardHumanInteractionPort.SendMessageAsync try the primary, on Lark 230002 (LarkBotErrorCodes.BotNotInChat) retry exactly once with the fallback typed pair. Other Lark codes propagate immediately so users see actionable hints for the actual failure mode.

Persistence: 14 new fields across 7 proto messages (SkillRunnerOutboundConfig, UserAgentCatalogEntry, UserAgentCatalogDocument, UserAgentCatalogUpsertCommand, WorkflowAgentState, InitializeWorkflowAgentCommand, WorkflowAgentInitializedEvent), mirrored end-to-end through UserAgentCatalogProjector.Materialize + UserAgentCatalogQueryPort.ToEntry per the typed-field-projection-mirror lesson.

The full Lark identifier failure ladder, in case future debugging needs the table:

Identifier	Same app	Different apps, same tenant	Different apps, different tenants
`open_id` (`ou_*`)	✅	❌ `99992361 open_id cross app`	❌
`union_id` (`on_*`)	✅	✅	❌ `99992364 user id cross tenant`
`chat_id` (`oc_*`) of inbound chat	✅	✅ if outbound app in chat	✅ when same app received the inbound

2. Single-use reply token: `relay_reply_token_consumed` → `PermanentFailure` (NOT transient)

PR #409's interactive cards triggered NyxID channel-relay/reply 502 → aevatar's legacy "degrade to text" replayed the same token → 401 "Reply token already used" → bot looked silent on every subsequent DM.

PR #412 fixed the in-turn replay first; reviewer (r3141663815) caught that this only shifted the 401 cascade from in-turn replay to grain-level replay because ToRelayFailure was routing to TransientFailure. Final fix: distinct error code relay_reply_token_consumed → PermanentFailure, so ConversationGAgent.HandleInboundTurnTransientFailureAsync does NOT queue an InboundTurnRetryScheduledEvent for the consumed-token case. Next inbound carries a fresh token; current turn is a write-off.

3. Cross-tenant `99992364` actionable error message

SkillRunnerGAgent.BuildLarkRejectionMessage and FeishuCardHumanInteractionPort.BuildLarkRejectionMessage now expand the bare 99992364 user id cross tenant Lark error into:

Lark message delivery rejected (code=99992364): user id cross tenant. The outbound Lark app is in a different tenant than the inbound app, so user-id translation is impossible. Delete and recreate the agent (/agents → Delete → /daily) so the new chat_id-preferred outbound path takes effect, or align the NyxID s/api-lark-bot proxy with the channel-bot that received the inbound event.

The string rides SkillRunnerExecutionFailedEvent.Error to /agent-status last_error, so users see the actionable recovery flow without reading source.

4. `LarkProxyResponse.TryGetError` parses the actual `NyxIdApiClient.SendAsync` envelope

Reviewer (r3141700469) caught that the helper only checked top-level code, but NyxIdApiClient.SendAsync (NyxIdApiClient.cs:680) wraps every HTTP non-2xx as {"error": true, "status": <http>, "body": "<raw downstream JSON>"} — Lark's business code (e.g. 99992364, 230002) lives INSIDE the body STRING. The new parser walks that string when the top-level Nyx envelope is present so the larkCode-gated branches (BotNotInChat retry, UserIdCrossTenant hint) actually fire on the production path.

Detail format: nyx_status=400 lark_code=99992364 msg=user id cross tenant so log lines preserve both layers.

5. Issue #411: GitHub proxy preflight + orphan API key revoke

A daily_report SkillRunner created with allowed_service_ids=api-github would persist successfully even when NyxID's binding from the new agent API key to the user's GitHub OAuth was missing — every scheduled run hit GitHub 403 and the user saw empty/degraded reports with no signal that recreation was needed.

AgentBuilderTool.CreateDailyReportAgentAsync now calls proxy/s/api-github/rate_limit with the freshly-minted key BEFORE persisting the agent. On HTTP 401/403, returns a structured github_proxy_access_denied error with the recovery hint, AND best-effort revokes the orphan API key (reviewer r3141699756 caught that without revoke, repeated /daily attempts accumulate orphan proxy keys).

The preflight envelope parser uses BOTH status (the SendAsync wrapper field) AND code for forward-compatibility (reviewer r3141699476).

Files

agents/Aevatar.GAgents.ChannelRuntime/LarkConversationTargets.cs       (chat_id-first + BuildFromInboundWithFallback)
agents/Aevatar.GAgents.ChannelRuntime/LarkProxyResponse.cs             (nested body parsing for HTTP-non-2xx envelope)
agents/Aevatar.GAgents.ChannelRuntime/LarkBotErrorCodes.cs             (+ UserIdCrossTenant 99992364, BotNotInChat 230002)
agents/Aevatar.GAgents.ChannelRuntime/SkillRunnerGAgent.cs             (TrySendWithFallbackAsync + cross_tenant hint)
agents/Aevatar.GAgents.ChannelRuntime/FeishuCardHumanInteractionPort.cs (TrySendWithFallbackAsync + cross_tenant hint)
agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs              (PreflightGitHubProxyAsync + BestEffortRevokeApiKeyAsync + delivery target capture)
agents/Aevatar.GAgents.ChannelRuntime/UserAgentCatalogGAgent.cs        (merge fallback fields on upsert)
agents/Aevatar.GAgents.ChannelRuntime/UserAgentCatalogProjector.cs     (mirror fallback to document)
agents/Aevatar.GAgents.ChannelRuntime/UserAgentCatalogQueryPort.cs     (mirror fallback document → entry)
agents/Aevatar.GAgents.ChannelRuntime/WorkflowAgentGAgent.cs           (mirror fallback through state apply + upsert)
agents/Aevatar.GAgents.ChannelRuntime/ChannelConversationTurnRunner.cs (consumed-token PermanentFailure routing + drop post-dispatch text replay)
agents/Aevatar.GAgents.ChannelRuntime/channel_runtime_messages.proto   (14 new fallback fields across 7 messages)

test/Aevatar.GAgents.ChannelRuntime.Tests/LarkConversationTargetsTests.cs       (priority + WithFallback factory pinning)
test/Aevatar.GAgents.ChannelRuntime.Tests/AgentBuilderToolTests.cs              (PinsLarkChatId + GitHub preflight + orphan revoke)
test/Aevatar.GAgents.ChannelRuntime.Tests/ChannelConversationTurnRunnerTests.cs (no-retry-as-text + PermanentFailure mapping)
test/Aevatar.GAgents.ChannelRuntime.Tests/SkillRunnerGAgentTests.cs             (BotNotInChat fallback retry [synthetic + real envelope] + cross_tenant hint [synthetic + real envelope])

Verification

dotnet build agents/Aevatar.GAgents.ChannelRuntime/Aevatar.GAgents.ChannelRuntime.csproj --nologo
dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo

Build: 0 errors.
Tests: 413/413 pass. New coverage adds:
- 3× BuildFromInboundWithFallback priority/factory tests
- 1× RunLlmReplyAsync_ShouldNotRetryAsText (asserts relayHandler.Requests.Empty + ErrorCode=relay_reply_token_consumed + FailureKind=PermanentAdapterError)
- 2× SkillRunner BotNotInChat fallback retry (synthetic top-level + real wrapped HTTP-400 envelope)
- 1× SkillRunner cross_tenant hint (real wrapped HTTP-400 envelope)
- 1× SkillRunner non-230002 codes don't trigger fallback
- 1× AgentBuilderTool.PinsLarkChatId_When_RelayPropagatesIt (integration of chat_id capture)
- 1× AgentBuilderTool.LogsFallbackBreadcrumb_When_LarkUnionIdMissing (LogDebug breadcrumb on legacy fallback)
- 1× AgentBuilderTool.DoesNotLogFallback_When_LarkUnionIdPresent (no noise when not falling back)
- 1× AgentBuilderTool.FailsClosed_When_GithubProxyDeniedForNewKey (preflight + actor-not-initialized + DELETE orphan key)

tools/ci/architecture_guards.sh reports Playground asset drift detected for app.js / app.css — pre-existing on origin/dev, unrelated.

Migration

Same as PR #409: existing agents pinned to LarkReceiveIdType=open_id or union_id won't self-heal because the persisted typed pair is treated as authoritative on the read path. Users see actionable last_error text in /agent-status and recover via /agents → Delete → /daily (two clicks with the card UI from PR #409). New agents created after this PR get the chat_id primary + union_id fallback automatically.

Out of scope (architectural follow-ups, tracked separately)

Reviewer (4318563419) flagged three architecture-quality observations as non-blocking. Each is filed as a separate issue so they don't get lost:

Extract shared Lark outbound retry helper (DRY TrySendWithFallbackAsync between SkillRunner and FeishuCardHumanInteractionPort) #414 — TrySendWithFallbackAsync is duplicated in SkillRunnerGAgent + FeishuCardHumanInteractionPort (with already-drifting log strings); extract a shared retry helper.
Lift channel-runtime delivery target to typed OutboundTarget sub-message (oneof per platform) #408 — Proto fallback shape uses 14 new flat strings across 7 messages instead of a LarkReceiveTarget sub-message; a third-level fallback would explode further. Typed sub-message refactor is the long-term shape.
Lift Lark identifier-ladder retry to ILarkOutboundDispatcher boundary #415 — Identifier-ladder retry logic should ultimately live in an outbound dispatcher boundary (ILarkOutboundDispatcher), not in actor/port code, so platform-specific identifier knowledge does not leak. Subsumes Extract shared Lark outbound retry helper (DRY TrySendWithFallbackAsync between SkillRunner and FeishuCardHumanInteractionPort) #414 once landed.

A fourth observation from the long-form review §4 (LarkProxyResponse.TryGetError branch-order rationale) is addressed in this PR by fdf66780: the priority-order invariant + forward-compat reasoning is now in the docstring so future readers do not silently revert it.

🤖 Generated with Claude Code

…n replay Two production issues observed after PR #409 shipped: ## Bug A — `99992364 user id cross tenant` on SkillRunner DM PR #409 switched p2p outbound to `union_id`, which is tenant-stable but still fails when the relay-side ingress and outbound proxy live in different Lark tenants (this deployment's NyxID `s/api-lark-bot` proxy is bound to a different tenant than the user's own bot that subscribed to events). Even the tenant-stable identifier is rejected: `code:99992364 user id cross tenant`. Switch the BuildFromInbound priority to `chat_id` first for ALL conversation types (DM and group). chat_id (`oc_*`) is the literal Lark chat where the inbound was received — when the outbound proxy authenticates as the same Lark app (the most common real configuration), sending back via `receive_id_type=chat_id` targets the same chat WITHOUT traversing any user-id translation. Falls back to union_id then open_id (with FellBack=true breadcrumbs) when chat_id is unavailable. ## Bug B — `Reply token already used` after card payload triggers NyxID 502 PR #409 introduced interactive card replies for /agents and /agent-status. Production showed NyxID's `channel-relay/reply` returning 502 for the card payload, after which the legacy "Interactive relay reply rejected; degrading to text" path re-sent the same relay token as plain text and got `401 Reply token already used` from NyxID — the relay token is single-use and was already consumed by the failed first attempt. The 401 escalated as `relay_reply_rejected`, queued an inbound turn retry, and the bot looked silent on every subsequent DM. Drop the post-dispatch text fallback in `TrySendInteractiveRelayReplyAsync`. Single-use semantics demand exactly one attempt per inbound; when the dispatcher fails, surface the error to the grain-level retry path instead of replaying the consumed token. The dispatcher's INTERNAL pre-flight fallbacks (no producer / composer rejects unsupported) are preserved because those run before the token is consumed. ## Other changes * `LarkBotErrorCodes.UserIdCrossTenant = 99992364` plus actionable hint in `SkillRunnerGAgent.BuildLarkRejectionMessage` and `FeishuCardHumanInteractionPort.BuildLarkRejectionMessage`. The hint surfaces in `last_error` shown by `/agent-status` so operators / users can correlate cross-tenant rejections with the recreate-the-agent recovery (`/agents` → Delete → `/daily`) the same way the existing cross-app hint does. ## Tests * `LarkConversationTargetsTests`: pin the new chat_id-first priority for p2p; pin the union_id and open_id fallbacks both setting `FellBackToPrefixInference=true` so call sites emit Debug breadcrumbs. * `AgentBuilderToolTests.PinsLarkChatId_When_RelayPropagatesIt` (renamed from `PinsLarkUnionId_*`): integration counterpart asserting the typed delivery target on `InitializeSkillRunnerCommand` lands as `(oc_*, "chat_id")` when the relay surfaces both LarkChatId and LarkUnionId. * `ChannelConversationTurnRunnerTests.RunLlmReplyAsync_ShouldNotRetryAsText_ WhenInteractiveDispatcherFails`: critical regression test that pins the NEW contract — when the dispatcher reports failure, the runner must NOT make a second HTTP call to the relay endpoint. Asserts the relay handler stays empty and the result surfaces `ErrorCode=relay_reply_rejected` with the original detail in `ErrorSummary`. * `SkillRunnerGAgentTests.ShouldIncludeRecreateHint_When_LarkRejectsAsCross TenantUserId`: pin the cross_tenant hint contract. Verification: 403 → 405 ChannelRuntime tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f3f53d968

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

codecov · 2026-04-25T07:43:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.38%. Comparing base (8fc13a2) to head (13521e7).
⚠️ Report is 7 commits behind head on dev.

@@            Coverage Diff             @@
##              dev     #412      +/-   ##
==========================================
- Coverage   70.38%   70.38%   -0.01%     
==========================================
  Files        1175     1175              
  Lines       84452    84452              
  Branches    11124    11124              
==========================================
- Hits        59443    59439       -4     
- Misses      20718    20721       +3     
- Partials     4291     4292       +1

Flag	Coverage Δ
ci	`70.38% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…lback + #411 preflight Three review concerns from PR #412 plus the GitHub-403 issue, all in one PR per the user's request. ## eanzhao (relay-token replay still happens at grain level) Comment quote: "ToRelayFailure(...) still turns this into a transient relay_reply_rejected ... ConversationGAgent.HandleInboundTurnTransientFailureAsync will then persist InboundTurnRetryScheduledEvent and re-run the same inbound turn with the same relay reply token." Fix: introduce a distinct `relay_reply_token_consumed` error code that `ToRelayFailure` maps to `PermanentFailure` (vs transient `relay_reply_rejected`), so the grain-level retry queue does not re-run the same inbound turn after the dispatcher already consumed the single-use token. The in-turn replay drop from PR #412 was necessary but not sufficient — without the routing change, the 401 cascade just shifts to grain-level replay. Pinned by `RunLlmReplyAsync_ShouldNotRetryAsText_WhenInteractiveDispatcherFails`, which now asserts both `ErrorCode=relay_reply_token_consumed` and `FailureKind=PermanentAdapterError` plus the existing relay-handler-empty contract. ## codex-bot P1 (chat_id-first regresses cross-app same-tenant) Comment quote: "In deployments where the inbound relay bot and outbound proxy use different Lark apps (same-tenant cross-app), the outbound app is typically not a member of the inbound DM chat, so receive_id_type= chat_id fails while union_id was the working identifier in the previous logic." Fix: capture the cross-tenant-safe union_id at agent-create time as a SECONDARY delivery target alongside the chat_id primary. New proto fields `lark_receive_id_fallback` / `lark_receive_id_type_fallback` on `SkillRunnerOutboundConfig`, `UserAgentCatalogEntry/Document/UpsertCommand`, `WorkflowAgentState`/`Init`/`InitializedEvent` (mirrored end-to-end through `UserAgentCatalogProjector` + `UserAgentCatalogQueryPort` per the typed-field-projection-mirror lesson). New `LarkConversationTargets.BuildFromInboundWithFallback` returns `(primary, optional fallback)` — fallback is captured ONLY for p2p with a chat_id primary AND a union_id surfaced at ingress (groups don't need it, non-chat_id primaries are already the safest identifier we have). Runtime fallback retry: `SkillRunnerGAgent.SendOutputAsync` and `FeishuCardHumanInteractionPort.SendMessageAsync` now try the primary, then on Lark `230002 bot not in chat` (`LarkBotErrorCodes.BotNotInChat`) exactly retry once with the fallback typed pair. Other Lark codes (e.g. `99992364 cross_tenant`) propagate immediately so users see the actionable recovery hint for the actual failure mode rather than a misleading retry. Pinned by `SendOutputAsync_ShouldRetryWithFallback_When_PrimaryRejectedAsBotNotInChat` (asserts request order: primary chat_id then fallback union_id) and `SendOutputAsync_ShouldNotRetry_When_PrimaryRejectedWithDifferentLarkCode` (asserts only 230002 triggers the retry). ## Issue #411 (daily_report GitHub proxy 403 at runtime) The new agent API key is allowed_service_ids=api-github but might lack a bound GitHub credential, so every scheduled run hits 401/403 from proxy/s/api-github and the user sees an empty / degraded report with no hint that recreation is needed. Add a preflight in `AgentBuilderTool.CreateDailyReportAgentAsync` that calls `proxy/s/api-github/rate_limit` with the freshly minted key — if NyxID's envelope reports HTTP 401/403, return a structured `github_proxy_access_denied` error from the tool BEFORE persisting the agent, so the user is told to verify GitHub OAuth + API key bindings in NyxID instead of receiving a "scheduled" agent that never produces output. Pinned by `ExecuteAsync_CreateAgent_DailyReport_FailsClosed_When_GithubProxyDeniedForNewKey` which asserts the structured error is returned AND the SkillRunner actor never receives `InitializeSkillRunnerCommand` (no half-initialized agent left in the catalog). ## Verification - 411/411 ChannelRuntime tests pass (was 405 before; +6 covering primary+ fallback BuildFromInbound priority, runtime fallback retry, GitHub preflight, consumed-token PermanentFailure mapping, and a no-fallback contract for non-DM and non-chat_id primaries). - Build: 0 errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eanzhao · 2026-04-25T08:07:19Z

顺手把 #411 一起修了：daily_report 创建时新加 GitHub proxy preflight，调用 `proxy/s/api-github/rate_limit` 用刚拿到的 agent API key；NyxID envelope 报 401/403 就直接返 `github_proxy_access_denied` 结构化错误，不持久化 agent。

{
  "error": "github_proxy_access_denied",
  "http_status": 403,
  "proxy_body": "{\"message\":\"Bad credentials\"}",
  "hint": "The new agent API key was created with `allowed_service_ids=api-github` but cannot reach GitHub via NyxID. Verify the GitHub OAuth provider is connected at NyxID and that the key picks up the binding (NyxID `api-keys/{id}/bindings`). Until this is resolved the daily report will return empty/degraded output every run.",
  "nyx_provider_slug": "api-lark-bot"
}

Test ExecuteAsync_CreateAgent_DailyReport_FailsClosed_When_GithubProxyDeniedForNewKey pin 两件事：返结构化错误 + SkillRunner actor 绝不收到 `InitializeSkillRunnerCommand`（catalog 里不会留半拉子 agent）。

#411 issue 里还提了三条，本 PR 没做：

❌ Surface "all GitHub tool calls failed" as SkillRunner failure with explicit LastError — 已经存在（PR fix(channel-runtime): resolve Lark DM receive_id_type and quiet best-effort reaction noise #403 的 throw 路径会让 `SkillRunnerExecutionFailedEvent.Error` 带具体 Lark code/detail 落到 `/agent-status` last_error）。preflight 兜底了大多数情况，剩下的真在 runtime 才坏（OAuth 中途失效）的场景照旧由 trigger handler 走 retry → ExecutionFailed。
❌ Sanitized diagnostic logging for proxy error bodies — 没动，preflight 已经把 401/403 body 透传到结构化错误 `proxy_body` 里了；保留 `NyxIdApiClient.SendAsync` 不变，避免改动 prod 日志路径。
❌ Tighten malformed `nyxid_proxy` calls — 那是 LLM 工具调用层的事，不在 SkillRunner 创建/运行链路上，留独立 issue 跟踪。

最关键的"agent 创建后必失败"的流程已经堵住了。

…voke Three reviewer concerns from the second pass on PR #412, all production-blocking because they prevent the just-added recovery paths from firing in real deployments. ## r3141700469 — LarkProxyResponse misses Lark code nested in HTTP-400 body Reviewer: "production failures arrive through `NyxIdApiClient.SendAsync` as an HTTP-400 Nyx envelope: `{\"error\": true, \"status\": 400, \"body\": \"{\\\"code\\\":99992364,...}\"}`. `LarkProxyResponse.TryGetError` currently returns true for that shape but leaves `larkCode=null` because it does not parse the nested `body`." Confirmed by reading `NyxIdApiClient.cs:680` — `SendAsync` wraps every HTTP non-2xx as `{"error": true, "status": <http>, "body": "<raw>"}`. The Lark business code lives INSIDE the `body` STRING. The original parser only checked top-level `code`, so every production HTTP-400 path (the common `230002 bot not in chat`, `99992364 cross_tenant`, etc.) fell through with `larkCode=null` — meaning the new `BotNotInChat` retry branch and the `UserIdCrossTenant` recovery hint NEVER fired in the real wrapped path. Fix: `LarkProxyResponse.TryGetError` now parses nested `body` JSON when the top-level Nyx error envelope is present. Returns the Lark code with a detail line like `nyx_status=400 lark_code=99992364 msg=user id cross tenant` so the layered context is preserved in log lines and exception messages. ## r3141699476 — GitHub preflight uses wrong field name Reviewer: "this parser does not catch the actual 403 shape produced by our `NyxIdApiClient.SendAsync`. For non-2xx responses `SendAsync` wraps the response as `{\"error\": true, \"status\": 403, \"body\": ...}` … while the new preflight only reads `code`." Same root cause as r3141700469 — the SendAsync wrapper uses `status`, not `code`. Fix: read both `status` (the SendAsync envelope) AND `code` (any top-level Lark code shape) so the preflight catches the actual 403/401 production envelope. ## r3141699756 — Orphan agent API key on preflight failure Reviewer: "the freshly created NyxID API key is left behind … repeated `/daily` attempts that hit the GitHub preflight will accumulate orphan proxy keys." Fix: best-effort `DeleteApiKeyAsync` immediately before returning the structured error. Failures during revoke are logged at Warning but do NOT propagate — the structured create-time error is the user-facing signal; an orphan key is an ops cleanup concern, not a hard failure that should mask the original preflight diagnosis. ## Tests - `LarkProxyResponse` tests are exercised via the integration tests below; the parser change has 100% coverage through callers. - New `SendOutputAsync_ShouldRetryWithFallback_When_PrimaryRejectedAsBot NotInChat_ViaHttp400Envelope` — uses the actual `SendAsync` HTTP-400 envelope shape `{"error":true,"status":400,"body":"{\"code\":230002,...}"}` and asserts the runtime retry runs against the union_id fallback. - New `SendOutputAsync_ShouldThrowCrossTenantHint_When_LarkCodeNestedInHttp 400Body` — same envelope shape but with `99992364`, asserts the cross-tenant recreate-the-agent hint fires (which it didn't in production before this fix). - Updated `ExecuteAsync_CreateAgent_DailyReport_FailsClosed_When_GithubProxy DeniedForNewKey` — now uses the real `{"error":true,"status":403,"body":...}` envelope shape AND asserts the DELETE on `/api/v1/api-keys/{id}` runs before the structured error is returned (orphan key revocation contract). Verification: 411 → 413 ChannelRuntime tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eanzhao · 2026-04-25T08:38:11Z

Review: 承诺兑现情况 + 架构观察

整体兑现良好，两轮 review 反馈处理得很到位。但有几个值得修的点：

✅ 承诺兑现

承诺	实现
`BuildFromInbound` chat_id-first（all conversation types）	`LarkConversationTargets.cs:96-126` ✅
Fallback chat_id → union_id → open_id（带 `FellBack=true` 面包屑）	`LarkConversationTargets.cs:106-122` ✅
`TrySendInteractiveRelayReplyAsync` 不再 dispatch 失败后重发	`ChannelConversationTurnRunner.cs:514-525` ✅
`LarkBotErrorCodes.UserIdCrossTenant = 99992364`	✅
两个 port 的 `BuildLarkRejectionMessage` 都加 cross_tenant hint	✅
Single-use token → `PermanentFailure` 而非 transient `relay_reply_rejected`	`ToRelayFailure` 路由 `relay_reply_token_consumed` → `PermanentFailure` ✅

二审 review（包括我自己提的 r3141699476 / r3141699756 / r3141700469）也都修了：PreflightGitHubProxyAsync 同时读 status 和 code、BestEffortRevokeApiKeyAsync 清理孤儿 key、LarkProxyResponse 解析嵌套 body。

⚠️ PR 描述严重过期（建议修）

PR title/body 只说 "chat_id-first outbound + drop reply-token replay"，但实际实现还包括：

新增 runtime fallback 重试机制（重大行为变更，描述里完全没提）
- 230002 bot not in chat 触发 primary → fallback 一次重试
- 增加 7 个 proto message 的 lark_receive_id_fallback / lark_receive_id_type_fallback 字段
- SkillRunnerGAgent.TrySendWithFallbackAsync + FeishuCardHumanInteractionPort.TrySendWithFallbackAsync
完整修了 SkillRunner daily_report GitHub proxy 403s at runtime #411（只在中文 comment 里提了一句"顺手"）
- PreflightGitHubProxyAsync + BestEffortRevokeApiKeyAsync + 完整测试
LarkProxyResponse.TryGetError 行为变更（嵌套 body 解析 + 重排优先级）
"Files" 列表漏了：LarkProxyResponse.cs、AgentBuilderTool.cs、UserAgentCatalogGAgent.cs、UserAgentCatalogProjector.cs、UserAgentCatalogQueryPort.cs、WorkflowAgentGAgent.cs、channel_runtime_messages.proto
测试数量描述失真：实际新增 8+ 个测试，不是描述里的 "3 + 1 + 1 + 1"

后续 archaeology 看不到 230002 fallback 这个新机制和 #411 的存在，建议更新 PR body。

🏗️ 架构观察（非 blocking，但值得讨论）

1. `TrySendWithFallbackAsync` 在两处复制（中等）

SkillRunnerGAgent.cs 和 FeishuCardHumanInteractionPort.cs 各自有一份近乎相同的 SendOutcome record + TrySendWithFallbackAsync + SendOutboundAsync。后者注释明确写着 "Mirrors SkillRunnerGAgent.TrySendWithFallbackAsync"。

按 CLAUDE.md "删除优先：空转发、重复抽象、无业务价值代码直接删除"，建议抽一个小 helper（吃 Func<LarkReceiveTarget, CancellationToken, Task<string>> send delegate + primary/fallback target），把 retry policy 收口到一处。否则未来调 retry 策略要改两份，且这两份会逐渐 drift（已有不同的日志格式：一份 LogInformation 用 "Skill runner ... primary delivery target"，另一份用 "Feishu human interaction port primary delivery target"）。

2. Proto fallback 形状：14 个新 flat string vs `repeated LarkReceiveTarget`（中等）

当前在 7 个 proto message 上各加 2 个 flat string（共 14 个新字段）。如果未来要加第三级 fallback（比如 chat_id → union_id → open_id 三段都持久化），又得在 7 个 message 上各加 2 个字段。

按 CLAUDE.md "核心语义强类型：影响业务语义、控制流、稳定读取且仓库内可控的数据，必须建模为 proto field / typed option / typed sub-message"，建议：

```proto
message LarkReceiveTarget {
string receive_id = 1;
string receive_id_type = 2;
}

message UserAgentCatalogEntry {
// ...
repeated LarkReceiveTarget delivery_targets = 22; // priority by index, [0] = primary
// 旧 lark_receive_id / lark_receive_id_type 通过 reserved 标记或读迁移废弃
}
```

这样 "primary + N fallbacks" 是 schema-stable 的，且把"两个字段表达一个 identifier"这个隐式耦合提升成显式 sub-message。

不过这是更大的重构，本 PR 范围内保持现状可以接受（与 #409 已有的 flat 形状一致），但建议作为 follow-up issue 跟踪。

3. Fallback 重试位置的架构归属（架构性）

当前 230002 → fallback 的重试逻辑在两处：

SkillRunnerGAgent.SendOutputAsync（actor-side）
FeishuCardHumanInteractionPort.SendMessageAsync（port-side）

按 CLAUDE.md "Actor 即业务实体" + "读写分离在 Projection Pipeline 层面实现"，identifier ladder 这种"传输层目标解析"语义其实更适合放在 outbound dispatch 边界（一个 ILarkOutboundDispatcher），让 actor/port 只负责"把这段内容送到这个 conversation"，identifier 选择 + 重试由 dispatcher 内部处理。这样：

第三级 fallback 上线时 actor/port 不动
Lark/Feishu 平台特定的 identifier 知识不泄漏到 actor
重试策略（次数、错误码白名单、退避）统一治理

也是 follow-up 性质，不阻塞这个 PR。

4. `LarkProxyResponse.TryGetError` 优先级反转（轻微）

旧实现：先检查 error envelope，再检查 top-level code。
新实现：先检查 top-level code，再检查 error envelope（且嵌套 body 也解析 Lark code）。

对当前观察到的两种 envelope 形状两种顺序结果一致，但反转本身是个隐式行为变更。如果未来 NyxID 出现 `{"error":true, "status":200, "code":230002, ...}` 这类奇形，新顺序下 code 会先命中（top-level Lark business error 路径），旧顺序则进入 error 路径。建议在 docstring 里把"为什么先 top-level code"那条 invariant 显式写出来。

🔁 #411 是否应该独立 PR（轻微 process）

f9d8fbc6 一个 commit 同时做了：(a) round-1 review feedback for #412、(b) 完整实现 #411 preflight。按 CLAUDE.md "提交信息：祈使句，聚焦单一目的"，这两件事如果分开 commit/PR，revert 粒度更细。下次类似情况建议拆，不阻塞当前。

总结

承诺都兑了，review 迭代质量高。主要风险点是 PR 描述与实现脱节——尤其是 230002 fallback retry 和 #411 这两个有独立行为/接口语义的改动没在 title/body 出现。建议合并前刷一次描述，把这两块加到 "Other changes"，并把 Files 列表补全。

代码层面 TrySendWithFallbackAsync 的复制和 proto fallback 形状是真问题，但都是可独立 follow-up 的，不阻塞这个回归修复。

Reviewer (long-form review §4) flagged that PR #412 silently reversed the branch order in LarkProxyResponse.TryGetError (was: error-envelope first, then top-level code; now: top-level code first, then error envelope). The two production shapes are mutually exclusive today so the change is a no-op on every observed response, but the priority is fixed deliberately for forward-compat against hypothetical hybrid envelopes like {"error":true,"status":200,"code":230002,...} where the top-level Lark business code is the more specific signal. Add the invariant + rationale to the TryGetError docstring so a future reader does not "fix" the order back without understanding why. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eanzhao · 2026-04-25T09:05:02Z

Review second pass: one remaining test gap

承诺的主行为基本已经兑现：chat_id primary、230002 -> union_id fallback、嵌套 Nyx/Lark envelope 解析、relay_reply_token_consumed -> PermanentFailure、#411 create-time GitHub preflight + orphan key revoke 都能在代码和测试里对上。dotnet build agents/Aevatar.GAgents.ChannelRuntime/Aevatar.GAgents.ChannelRuntime.csproj --nologo 和 dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo 本地通过。

剩下一个建议合并前补上的测试缺口：PR 改了 FeishuCardHumanInteractionPort.SendMessageAsync，让 workflow human-interaction outbound 也执行 230002 bot not in chat -> fallback target retry，但当前 fallback retry 覆盖只在 SkillRunnerGAgentTests。FeishuCardHumanInteractionPortTests 目前只覆盖 primary send success，没有覆盖 catalog-backed target 的 fallback 行为。

这个路径和 SkillRunner 不完全等价：Feishu port 从 IUserAgentCatalogRuntimeQueryPort 读取 UserAgentCatalogDocument -> UserAgentCatalogEntry 投影后的 LarkReceiveIdFallback/LarkReceiveIdTypeFallback，再执行重试。也就是说，如果未来 projector/query mirror 或 Feishu 自己的 retry 分支漂了，现有 413 个测试仍可能全绿。

建议加一条 FeishuCardHumanInteractionPort regression test：catalog entry 带 LarkReceiveId=oc_* / chat_id 和 LarkReceiveIdFallback=on_* / union_id；primary response 使用真实 wrapped shape {"error":true,"status":400,"body":"{\"code\":230002,\"msg\":\"Bot is not in the chat\"}"}；fallback response success；断言发了两次 POST，第二次 query 是 receive_id_type=union_id 且 body 的 receive_id 是 on_*。

架构上更好的长期形状已经由 #408 / #414 / #415 覆盖：typed OutboundTarget sub-message、共享 retry helper、最终收敛到 ILarkOutboundDispatcher。这些可以不阻塞当前生产修复，但 Feishu fallback 的本地回归测试最好在这个 PR 内补齐。

Reviewer (PR #412 second-pass review) noted that the 230002 → fallback retry was added to FeishuCardHumanInteractionPort.SendMessageAsync but catalog-backed coverage existed only in SkillRunnerGAgentTests. Without a port-side regression, projector / query-mirror drift on the new LarkReceiveIdFallback / LarkReceiveIdTypeFallback fields could go unnoticed while production cards stop delivering on cross-app same-tenant DMs. Add a regression test that: - Stubs IUserAgentCatalogRuntimeQueryPort with chat_id primary + union_id fallback typed pair. - Returns the real wrapped Nyx envelope shape on the primary POST: {"error":true,"status":400,"body":"{\"code\":230002,\"msg\":\"Bot is not in the chat\"}"}. - Asserts two POSTs, first with receive_id_type=chat_id + receive_id=oc_*, second with receive_id_type=union_id + receive_id=on_*, and msg_type=interactive on the fallback (so the retry preserves the card payload, not just the receive header). A SequencedRecordingHandler helper mirrors the SkillRunnerGAgentTests SequencedHandler — different response per request, full request/body recording for ordered assertions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eanzhao · 2026-04-25T09:09:54Z

补上了 Feishu fallback retry 的回归测试 (3abc33e2)：

DeliverSuspensionAsync_ShouldRetryWithFallback_When_PrimaryRejectedAsBotNotInChat_ViaHttp400Envelope — pin 三件事，覆盖 review 里点名的 catalog → projector → query mirror → port retry 全链路：

Catalog entry 暴露 LarkReceiveId=oc_dm_chat_1 / chat_id 主 + LarkReceiveIdFallback=on_user_1 / union_id 备（如果未来 UserAgentCatalogProjector.Materialize 或 UserAgentCatalogQueryPort.ToEntry 漏掉新字段的镜像，这里会立刻红）
Primary 用真实生产 envelope 形状 {"error":true,"status":400,"body":"{\"code\":230002,\"msg\":\"Bot is not in the chat\"}"}，验证 LarkProxyResponse 嵌套解析 + 230002 retry 在 Feishu port 路径上同样 fires
断言：发了 2 次 POST；第 1 次 query 是 receive_id_type=chat_id、body receive_id=oc_dm_chat_1；第 2 次 query 是 receive_id_type=union_id、body receive_id=on_user_1、msg_type=interactive（确认 retry 保留卡片 payload，不是只换 receive 头）

SequencedRecordingHandler helper 仿照 SkillRunnerGAgentTests.SequencedHandler：每次请求按队列返回不同响应，全程记录 request + body 用于顺序断言。

dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo：414/414 通过（原 413 + 新增 1）。

#414 / #415 长期会把这两份 retry 实现合并到 ILarkOutboundDispatcher，那时这两条 port 测试会自然简化为对 dispatcher 的契约测试，但当前 PR 内本地回归覆盖已经齐了。

eanzhao · 2026-04-25T09:12:09Z

Review of 3abc33e2: the new Feishu port regression test covers the runtime behavior I asked for. It exercises the real wrapped HTTP-400 envelope, verifies two POSTs, and verifies the fallback POST keeps msg_type=interactive while switching to receive_id_type=union_id / receive_id=on_user_1. I ran:

dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo — 414/414 pass
dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo --filter FeishuCardHumanInteractionPortTests — 11/11 pass
git diff --check origin/dev...HEAD — clean

One small correction remains: the PR comment says this test covers the catalog -> projector -> query mirror -> port retry chain, but the test stubs IUserAgentCatalogRuntimeQueryPort directly, so it only covers entry -> port retry. The implementation does mirror fallback fields in UserAgentCatalogProjector and UserAgentCatalogQueryPort, but UserAgentCatalogProjectorTests still only asserts the primary LarkReceiveId/LarkReceiveIdType fields. Please either update that test to assert LarkReceiveIdFallback/LarkReceiveIdTypeFallback through ProjectAsync and ToEntry, or tone down the comment/PR note so it does not claim projector/query coverage.

This is a test/documentation accuracy issue, not a new behavior blocker in the Feishu port fix itself.

…or tests Reviewer (PR #412 comment 4318615107) caught that the previous Feishu port fallback regression test stubs IUserAgentCatalogRuntimeQueryPort directly, so it covers `entry → port retry` but not `projector → query mirror → entry`. The implementation does mirror LarkReceiveIdFallback and LarkReceiveIdTypeFallback in both UserAgentCatalogProjector.Materialize and UserAgentCatalogQueryPort.ToEntry, but UserAgentCatalogProjectorTests only asserted the primary LarkReceiveId / LarkReceiveIdType fields — so a silent drop of the fallback mirror would still leave 414/414 green while production cards stop falling back on cross-app same-tenant DMs. Extend the two existing typed-round-trip tests: - ProjectAsync_WithValidCommittedEvent_UpsertsDocument: input state now carries the chat_id primary + union_id fallback typed pair; the document assertions cover both LarkReceiveIdFallback and LarkReceiveIdTypeFallback alongside the existing primary fields. - ToEntry_ShouldRoundTripTypedLarkReceiveTarget_FromDocumentToEntry: input document now carries the fallback pair; the entry assertions cover both fallback fields surviving the document → entry conversion that FeishuCardHumanInteractionPort and SkillRunnerGAgent depend on. Both tests' inline rationale now points at PR #412 explicitly so a future reader knows why the fallback pair is part of the round-trip contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eanzhao · 2026-04-25T09:16:34Z

说得对——上一条 comment 把覆盖范围说大了。3abc33e2 的 SequencedRecordingHandler 测试 stub 了 IUserAgentCatalogRuntimeQueryPort 直返 UserAgentCatalogEntry，所以只覆盖 entry → port retry，并没真的过 UserAgentCatalogProjector.Materialize / UserAgentCatalogQueryPort.ToEntry 那两层。

13521e76 把 UserAgentCatalogProjectorTests 里两条 typed-round-trip 测试扩展成同时盖 fallback pair：

ProjectAsync_WithValidCommittedEvent_UpsertsDocument —— 入参 state 改成 chat_id primary + union_id fallback typed pair；新增 document.LarkReceiveIdFallback.Should().Be("on_user_1") 和 document.LarkReceiveIdTypeFallback.Should().Be("union_id")。如果 Materialize 漏掉这两个字段的镜像，这条测试立刻红。
ToEntry_ShouldRoundTripTypedLarkReceiveTarget_FromDocumentToEntry —— 入参 document 改成带 fallback pair；新增 entry.LarkReceiveIdFallback / entry.LarkReceiveIdTypeFallback 的断言。UserAgentCatalogQueryPort.ToEntry 漏镜像同样立刻红。

两条测试的 inline 注释里点名 PR #412 + 230002 bot not in chat retry，让未来读 archaeology 的人知道为什么 fallback pair 必须在 round-trip 契约里。

加上 3abc33e2 的 port 端 retry 测试，现在 catalog → projector → query → port 全链路每一段都有专门的回归断言：

段	测试	防护点
Materialize（actor state → document）	`ProjectAsync_WithValidCommittedEvent_UpsertsDocument`	projector 漏镜像 fallback pair
ToEntry（document → entry）	`ToEntry_ShouldRoundTripTypedLarkReceiveTarget_FromDocumentToEntry`	query port 漏镜像 fallback pair
Port retry（entry → 230002 → fallback POST）	`DeliverSuspensionAsync_ShouldRetryWithFallback_When_PrimaryRejectedAsBotNotInChat_ViaHttp400Envelope`	Feishu port 漏 retry 或漏读 fallback pair
Skill runner retry（同上 actor 侧）	`SendOutputAsync_ShouldRetryWithFallback_When_PrimaryRejectedAsBotNotInChat_ViaHttp400Envelope`	SkillRunner 漏 retry 或漏读 fallback pair

dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo：414/414 仍通过（projector 测试是扩展现有断言而不是新增 case，所以总数不变）。

eanzhao · 2026-04-25T09:20:06Z

Rechecked latest fix (13521e7). The fallback target coverage gap from #412 (comment) is addressed now: UserAgentCatalogProjectorTests asserts both state -> document and document -> UserAgentCatalogEntry carry LarkReceiveIdFallback / LarkReceiveIdTypeFallback, so the catalog-backed Feishu retry path is covered instead of only the port-stub path.

Verified locally:

dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo --filter "FullyQualifiedName~UserAgentCatalogProjectorTests|FullyQualifiedName~FeishuCardHumanInteractionPortTests" -> 18/18 passed
dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo -> 414/414 passed
bash tools/ci/test_stability_guards.sh -> passed
bash tools/ci/query_projection_priming_guard.sh -> passed
bash tools/ci/projection_state_version_guard.sh -> passed
bash tools/ci/projection_state_mirror_current_state_guard.sh -> passed
git diff --check origin/dev...HEAD -> clean

No further blocking issues from this pass.

chatgpt-codex-connector Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread agents/Aevatar.GAgents.ChannelRuntime/LarkConversationTargets.cs

eanzhao commented Apr 25, 2026

View reviewed changes

Comment thread agents/Aevatar.GAgents.ChannelRuntime/ChannelConversationTurnRunner.cs

eanzhao commented Apr 25, 2026

View reviewed changes

Comment thread agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs Outdated

eanzhao commented Apr 25, 2026

View reviewed changes

Comment thread agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs

eanzhao commented Apr 25, 2026

View reviewed changes

Comment thread agents/Aevatar.GAgents.ChannelRuntime/SkillRunnerGAgent.cs

eanzhao changed the title ~~fix(channel-runtime): chat_id-first outbound + drop reply-token replay after dispatch failure~~ fix(channel-runtime): chat_id-first outbound + fallback retry + consumed-token PermanentFailure + #411 GitHub preflight Apr 25, 2026

eanzhao mentioned this pull request Apr 25, 2026

Extract shared Lark outbound retry helper (DRY TrySendWithFallbackAsync between SkillRunner and FeishuCardHumanInteractionPort) #414

Open

4 tasks

eanzhao mentioned this pull request Apr 25, 2026

Lift Lark identifier-ladder retry to ILarkOutboundDispatcher boundary #415

Open

eanzhao merged commit 6131ed7 into dev Apr 25, 2026
12 checks passed

eanzhao mentioned this pull request Apr 25, 2026

fix(agent-builder): use UserService.id for api-key allowed_service_ids (#417) #418

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(channel-runtime): chat_id-first outbound + fallback retry + consumed-token PermanentFailure + #411 GitHub preflight#412

fix(channel-runtime): chat_id-first outbound + fallback retry + consumed-token PermanentFailure + #411 GitHub preflight#412
eanzhao merged 6 commits intodevfrom
fix/2026-04-25_lark-prefer-chat-id-for-dm

eanzhao commented Apr 25, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

codecov Bot commented Apr 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

eanzhao commented Apr 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eanzhao commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior changes

1. p2p outbound: chat_id primary + persisted union_id fallback with runtime retry on 230002

2. Single-use reply token: relay_reply_token_consumed → PermanentFailure (NOT transient)

3. Cross-tenant 99992364 actionable error message

4. LarkProxyResponse.TryGetError parses the actual NyxIdApiClient.SendAsync envelope

5. Issue #411: GitHub proxy preflight + orphan API key revoke

Files

Verification

Migration

Out of scope (architectural follow-ups, tracked separately)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

codecov Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eanzhao commented Apr 25, 2026

Review: 承诺兑现情况 + 架构观察

✅ 承诺兑现

⚠️ PR 描述严重过期（建议修）

🏗️ 架构观察（非 blocking，但值得讨论）

1. TrySendWithFallbackAsync 在两处复制（中等）

2. Proto fallback 形状：14 个新 flat string vs repeated LarkReceiveTarget（中等）

3. Fallback 重试位置的架构归属（架构性）

4. LarkProxyResponse.TryGetError 优先级反转（轻微）

🔁 #411 是否应该独立 PR（轻微 process）

总结

Uh oh!

eanzhao commented Apr 25, 2026

Review second pass: one remaining test gap

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

eanzhao commented Apr 25, 2026

Uh oh!

eanzhao commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eanzhao commented Apr 25, 2026 •

edited

Loading

1. p2p outbound: chat_id primary + persisted union_id fallback with runtime retry on `230002`

2. Single-use reply token: `relay_reply_token_consumed` → `PermanentFailure` (NOT transient)

3. Cross-tenant `99992364` actionable error message

4. `LarkProxyResponse.TryGetError` parses the actual `NyxIdApiClient.SendAsync` envelope

codecov Bot commented Apr 25, 2026 •

edited

Loading

1. `TrySendWithFallbackAsync` 在两处复制（中等）

2. Proto fallback 形状：14 个新 flat string vs `repeated LarkReceiveTarget`（中等）

4. `LarkProxyResponse.TryGetError` 优先级反转（轻微）

eanzhao commented Apr 25, 2026 •

edited

Loading