Skip to content

fix(channel-runtime): cross-app safe Lark outbound via tenant-stable union_id + card UX#409

Merged
eanzhao merged 3 commits intodevfrom
fix/2026-04-25_lark-cross-app-union-id
Apr 25, 2026
Merged

fix(channel-runtime): cross-app safe Lark outbound via tenant-stable union_id + card UX#409
eanzhao merged 3 commits intodevfrom
fix/2026-04-25_lark-cross-app-union-id

Conversation

@eanzhao
Copy link
Copy Markdown
Contributor

@eanzhao eanzhao commented Apr 25, 2026

Symptom

After PR #403 shipped, every SkillRunner outbound DM was rejected by Lark with code:99992361 open_id cross app (production log aevatar-console-backend-75fd899df8-s6mp7). The user now sees the Daily Report agent created successfully and the immediate ack, but every scheduled run fails on send. PR #403 fixed the typed receive_id_type plumbing — this PR fixes the actual identifier that gets sent.

NyxID API request failed: POST .../proxy/s/api-lark-bot/.../messages?receive_id_type=open_id -> 400
Skill runner skill-runner-94d754dfdfbb416aa5a676cecd0d7a71 execution failed
System.InvalidOperationException: Lark message delivery rejected:
  {"code":99992361,"msg":"open_id cross app", ... "How to Obtain OpenID"}

Root cause

Lark open_id is app-scoped: each Lark app issues its own ou_* for the same user, and an open_id minted by app A is invalid as a receive_id on app B. In our deployment:

  • Inbound side: Lark Open Platform delivers events to NyxID's relay-side Lark app. The ou_* in the event payload is scoped to that app.
  • Outbound side: SkillRunner sends through s/api-lark-bot proxy — the customer's Lark app, a different app from the relay-side ingress.

PR #403's BuildFromInbound("p2p", ...) pinned (senderId, "open_id") on the typed delivery target. Lark rejects the cross-app open_id every time; the retry loop runs out of attempts and the agent flips to error.

The right cross-app identifier is Lark union_id (on_*): tenant-stable, shared across apps in the same tenant, accepted by Lark as receive_id_type=union_id from any app. Lark events always carry it under event.sender.sender_id.union_id (and event.operator.union_id for card actions). NyxID forwards the original event verbatim under raw_platform_data (NyxID build_callback_payload), so aevatar can recover it without a NyxID change.

Fix scope

1. Surface union_id and chat_id at ingress

  • TransportExtras proto: add nyx_lark_union_id (cross-app stable user identifier) and nyx_lark_chat_id (tenant-scoped group/channel id). Lark-only.
  • NyxIdRelayTransport.ParseInboundAsync: parse raw_platform_data for Lark traffic.
    • im.message.receive_v1: read event.sender.sender_id.union_id and event.message.chat_id.
    • card.action.trigger: read event.operator.union_id and event.context.open_chat_id.
  • ChannelMetadataKeys + ChannelConversationTurnRunner.BuildReplyMetadata: add LarkUnionId / LarkChatId keys and propagate from TransportExtras.
  • AgentBuilderTool.CreateDailyReportAgentAsync + CreateSocialMediaAgentAsync: pass LarkUnionId and LarkChatId to BuildFromInbound.

2. Prefer cross-app safe targets in BuildFromInbound

p2p:
  if union_id present  → (union_id, "union_id", FellBack=false)   [cross-app safe]
  else if sender present → (sender, "open_id", FellBack=true)     [legacy fallback, breadcrumb]
  else                   → (empty, empty, FellBack=true)          [confused inbound]

group / channel / thread:
  if lark_chat_id present → (lark_chat_id, "chat_id", FellBack=false)  [cross-app safe]
  else                    → (conversation_id, "chat_id", FellBack=false) [legacy]

The legacy open_id path keeps the relay-app self-send case working in dev / single-app deployments; the FellBack flag makes the cross-app fallback observable so operators can correlate Lark 99992361 rejections with missing-union_id ingress.

3. /agents and /agent-status as interactive cards (顺手)

User asked to lift the bot's text replies to Lark cards with action buttons:

  • /agents: summary card + one section per agent + a Status button per agent (carries the full agent_id so users do not have to retype the long skill-runner id) + footer shortcuts (Create Daily / Create Social Media / Templates).
  • /agent-status <id>: status card + lifecycle action buttons (Run / Disable / Enable / Delete-with-confirm-pre-set / Back to Agents). Delete is marked danger; Lark renders it red.

Buttons reuse the existing agent_builder_action arguments handled by AgentBuilderCardFlow's card event branch — click routing is unchanged.

Files

agents/Aevatar.GAgents.Channel.Abstractions/protos/chat_activity.proto
agents/channels/Aevatar.GAgents.Channel.NyxIdRelay/NyxIdRelayTransport.cs
agents/Aevatar.GAgents.ChannelRuntime/ChannelMetadataKeys.cs
agents/Aevatar.GAgents.ChannelRuntime/ChannelConversationTurnRunner.cs
agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs
agents/Aevatar.GAgents.ChannelRuntime/LarkConversationTargets.cs
agents/Aevatar.GAgents.ChannelRuntime/NyxRelayAgentBuilderFlow.cs   (cards)

test/Aevatar.GAgents.ChannelRuntime.Tests/NyxIdRelayTransportTests.cs
test/Aevatar.GAgents.ChannelRuntime.Tests/LarkConversationTargetsTests.cs
test/Aevatar.GAgents.ChannelRuntime.Tests/AgentBuilderToolTests.cs
test/Aevatar.GAgents.ChannelRuntime.Tests/NyxRelayAgentBuilderFlowTests.cs

4. Migration: existing broken agents won't self-heal — give them a recreate hint

This fix captures union_id at agent-CREATE time, so SkillRunner / WorkflowAgent state persisted before this PR is permanently pinned to LarkReceiveIdType=open_id against a relay-app-scoped ou_*. Their next scheduled run will keep firing 99992361 open_id cross app. There is no automatic self-heal — scheduled runs do not have a fresh ingress event to mine union_id from.

To make the failure self-documenting, SkillRunnerGAgent.SendOutputAsync and FeishuCardHumanInteractionPort.SendMessageAsync now detect Lark code=99992361 (LarkBotErrorCodes.OpenIdCrossApp) and expand the thrown exception into:

Lark message delivery rejected (code=99992361): open_id cross app. This agent was created before cross-app union_id ingress existed; delete and recreate it (/agents → Delete → /daily) to pick up the cross-app safe target.

The Error field on SkillRunnerExecutionFailedEvent carries this verbatim, so /agent-status's last_error is now actionable. Combined with the new /agents / /agent-status card UI in this PR, the recovery flow is two clicks (Delete row + new /daily) — no operator-side migration tool needed.

Other migration options considered and rejected:

  • ❌ State-side union_id backfill: SkillRunner has no fresh ingress at scheduled-run time; would need a separate inbound-event hook to repair.
  • ❌ Catalog-wide repair: same problem, plus requires a customer-app users:read scope or batch_get_id translation we don't have.
  • ❌ Auto-disable agents on first 99992361: would mask transient errors and surprise users who already recreated.

Verification

dotnet build agents/Aevatar.GAgents.ChannelRuntime/Aevatar.GAgents.ChannelRuntime.csproj --nologo
dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo
bash tools/ci/test_stability_guards.sh
  • Build: 0 errors.
  • Tests: 402/402 passed (covers ingress parser, BuildFromInbound union_id-preferred / open_id fallback, AgentBuilderTool integration, fallback breadcrumb LogDebug, card output structure, Lark 99992361 recreate hint).
  • Stability guard: passed.

tools/ci/architecture_guards.sh reports Playground asset drift detected for app.js / app.css — verified pre-existing on origin/dev (PR #407), unrelated to this change.

Out of scope

  • The cross-app diagnosis applies to other Lark outbound paths too. FeishuCardHumanInteractionPort and the immediate-ack reaction in ChannelConversationTurnRunner should ALSO prefer union_id; left for a follow-up PR since the SkillRunner DM path is the production blocker today.
  • The /run-agent, /disable-agent, /enable-agent, /delete-agent text-command replies still return plain text (their card-flow counterparts in AgentBuilderCardFlow already render cards). Symmetric card output for those is a follow-up.
  • lark_* fields living on platform-agnostic protos (SkillRunnerOutboundConfig, UserAgentCatalogEntry, WorkflowAgentState, TransportExtras) is tracked under #408 — refactor to typed OutboundTarget oneof when a second platform's delivery code lands.

🤖 Generated with Claude Code

SkillRunner outbound DMs were rejected by Lark with `code:99992361 open_id
cross app` after PR #403 made the typed delivery target prefer the inbound
sender open_id (`ou_*`) for p2p. The relay-side ingress and the customer's
`api-lark-bot` are different Lark apps, and Lark `open_id` is app-scoped —
the open_id minted by the relay app is invalid as a `receive_id_type=open_id`
target on the customer app, producing `99992361` on every scheduled run.

Cross-app outbound delivery requires a tenant-stable identifier. Lark's
`union_id` (`on_*`) is the right choice: it is shared across apps in the
same tenant, so an outbound app sees the SAME `on_*` for the user as the
relay-side ingress. The Lark event payload always carries it under
`event.sender.sender_id.union_id` (and `event.operator.union_id` for card
actions); NyxID forwards the original event verbatim under
`raw_platform_data`, so aevatar can recover it without a NyxID change.

Changes
-------

* `TransportExtras` proto: add `nyx_lark_union_id` (cross-app stable user
  identifier) and `nyx_lark_chat_id` (tenant-scoped group/channel id).
  Both populated only for Lark traffic.

* `NyxIdRelayTransport`: parse `raw_platform_data` for Lark inbounds. For
  `im.message.receive_v1` the helper reads
  `event.sender.sender_id.union_id` and `event.message.chat_id`; for
  `card.action.trigger` it reads `event.operator.union_id` and
  `event.context.open_chat_id`. Other platforms leave the fields empty.

* `ChannelMetadataKeys` + `ChannelConversationTurnRunner`: add
  `LarkUnionId` / `LarkChatId` keys and propagate from `TransportExtras`
  through the AgentBuilder metadata so the tool layer sees them on every
  inbound.

* `AgentBuilderTool` (daily_report and social_media flows): pass
  `LarkUnionId` and `LarkChatId` to `LarkConversationTargets.BuildFromInbound`
  so the typed delivery target captured at agent creation pins the
  cross-app safe pair.

* `LarkConversationTargets.BuildFromInbound`: prefer `union_id`
  (`receive_id_type=union_id`) for p2p when the relay surfaces it; prefer
  the inbound Lark `chat_id` (`receive_id_type=chat_id`) for groups.
  Fallback to the prior open_id / conversation_id paths is preserved but
  flagged with `FellBackToPrefixInference=true` so call sites emit a Debug
  breadcrumb and operators can correlate Lark rejections with missing
  ingress fields.

Card UX for /agents and /agent-status
-------------------------------------

While in the same area, lift `/agents` and `/agent-status` text replies
to interactive Lark cards with action buttons. Each per-agent row in
`/agents` exposes a Status button (carries the full `agent_id` so users do
not have to retype the long skill-runner id). The agent-status card has
Run / Disable / Enable / Delete (Delete marked danger and pre-confirmed)
plus a Back-to-Agents button. Buttons reuse the existing
`agent_builder_action` arguments handled by `AgentBuilderCardFlow`'s card
event branch, so click routing is unchanged.

Tests
-----

* `NyxIdRelayTransportTests`: pin union_id / chat_id surfacing for
  message-receive and card-action events; pin empty contracts for
  non-Lark platforms and missing `raw_platform_data`.
* `LarkConversationTargetsTests`: pin union_id-preferred p2p, open_id
  fallback with FellBack=true, group chat_id-preferred path.
* `AgentBuilderToolTests`: integration test asserting the typed
  `LarkReceiveId` / `LarkReceiveIdType` become `(on_*, "union_id")` when
  `ChannelMetadataKeys.LarkUnionId` is present in the request context.
* `NyxRelayAgentBuilderFlowTests`: pin the new card output for /agents
  (summary + per-agent Status buttons + footer shortcuts) and
  /agent-status (lifecycle action buttons + danger-flagged Delete).

Verification: 393/393 ChannelRuntime tests pass;
`tools/ci/test_stability_guards.sh` passes; remaining
`tools/ci/architecture_guards.sh` failures are pre-existing playground
asset drift on origin/dev, unrelated to this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb8a2f65d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// path so users have one consistent UX whether they typed `/agents` or arrived via card.
content.Actions.Add(BuildButton("Create Daily Report", "open_daily_report_form", isPrimary: false));
content.Actions.Add(BuildButton("Create Social Media", "open_social_media_form", isPrimary: false));
content.Actions.Add(BuildButton("Templates", "list_templates", isPrimary: false));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle templates action from /agents card

The new /agents card adds a Templates button that emits agent_builder_action=list_templates, but AgentBuilderCardFlow.TryResolve(...) has no ListTemplatesAction case in its card-action switch, so clicking this button will not execute the tool call and falls through as an unhandled card action. This makes a visible CTA in the new card UI non-functional for users who rely on button navigation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in e661d5b7. Added case ListTemplatesAction: to AgentBuilderCardFlow.TryResolveAsync's card-action switch, dispatching AgentBuilderFlowDecision.ToolCall(ListTemplatesAction, """{"action":"list_templates"}""") (mirrors the existing text-flow path). New unit test AgentBuilderCardFlowTests.TryResolveAsync_TemplatesCardButton_DispatchesListTemplatesTool pins the contract so a refactor cannot silently drop the routing.

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我对照生产日志和这版 diff 看了一遍,结论是:PR 对“SkillRunner 跑完后结果推不到 Lark”的主因定位基本准确。日志里 skill-runner-94d754dfdfbb416aa5a676cecd0d7a71 已经完成初始化、触发执行、进入 LLM/tool 调用,最后是在 POST .../api-lark-bot/open-apis/im/v1/messages?receive_id_type=open_id 被 Lark 返回 99992361 open_id cross app 拒掉;失败通知也走同一 target 再次失败。所以这不是 runner 完全没启动,而是 outbound DM target 用了 relay app scoped open_id

这版把 ingress 里的 Lark union_id/chat_id 提到 TransportExtras,并在创建 daily/social agent 时优先持久化 (on_*, union_id),方向是对的,应该能解决新建 p2p SkillRunner 的跨 app 发送问题。我也在临时 worktree 跑了 dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo,398/398 通过;bash tools/ci/test_stability_guards.sh 也通过。

我只留了一个需要补齐的点:fallback observability 现在没有真正落到日志里。除此之外,日志还暴露了 api-github 403 和一次 nyxid_proxy Missing path,这属于 runner 执行质量/权限问题,不是本 PR 的 Lark 推送修复范围,建议单独追踪。issue 搜索上,99992361/open_id cross app 目前基本只有这个 PR 命中;#403 是上一层 receive_id_type 修复,#350 是已关闭的 interactive reply 架构问题,#408 追踪 typed OutboundTarget 架构债。

@@ -239,7 +239,9 @@ private async Task<string> CreateDailyReportAgentAsync(
var deliveryTarget = LarkConversationTargets.BuildFromInbound(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里拿到的 deliveryTarget.FellBackToPrefixInference 在持久化前就丢掉了。BuildFromInbound() 在缺少 LarkUnionId 时会返回 (ou_*, open_id, FellBack=true),但 InitializeSkillRunnerCommand.OutboundConfig 只保存 receive id/type;后面 SkillRunnerGAgent.SendOutputAsync -> LarkConversationTargets.Resolve() 看到 typed fields 都非空,会返回 FellBack=false,所以 PR 描述里的“fallback breadcrumb / correlate missing-union_id ingress”实际不会出现。建议在 daily/social 创建这里发现 FellBackToPrefixInference 时直接 LogDebug/LogWarning,或者把 fallback 原因显式持久化;否则生产上 union_id 没被 Nyx relay 带出来时,仍会静默退回 open_id 并继续打出同样的 99992361

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

说得对。FellBackToPrefixInferenceOutboundConfig 里被丢了,下游 Resolve() 看到 typed 字段都非空就返回 FellBack=false,PR 描述里的“fallback breadcrumb”确实不会触发。e661d5b7 把 BuildFromInbound 调用抽成 AgentBuilderTool.ResolveDeliveryTarget(...),在 fallback 分支立即 LogDebug 一次(带 agent_id / has-union_id / has-lark-chat_id / has-sender / resolved-type / 99992361 关联提示),不持久化 fallback 状态——观测点放在唯一拥有完整信息的位置(agent-create time),下游 SkillRunner 路径继续把 typed pair 当 authoritative。

新增两个测试 pin 行为:

  • LogsFallbackBreadcrumb_When_LarkUnionIdMissing:缺 union_id 时 fires 一次 LogDebug,包含上述 correlation 字段
  • DoesNotLogFallback_When_LarkUnionIdPresent:union_id 在场时绝不 log(避免正常 ingress 流量把信号淹没)

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

补一个生产落地提醒:这版修的是“新建 agent 时捕获并持久化 union_id”。日志里的那个既有 skill-runner-94d754dfdfbb416aa5a676cecd0d7a71 如果状态里已经存了 LarkReceiveId=ou_* / LarkReceiveIdType=open_id,部署后 SendOutputAsync 仍会把 typed fields 原样拿去发,不会自动恢复成 union_id

所以如果目标是让现有失败的 SkillRunner 也立刻恢复,需要明确一个运维/迁移动作:例如让用户重建或重新初始化 agent、加一次 catalog/state repair、或者在状态里补写 union_id。否则 PR 合进去后,新 agent 会好,旧 agent 可能继续打 99992361 open_id cross app

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.37%. Comparing base (55ccbaf) to head (bf86c9b).
⚠️ Report is 4 commits behind head on dev.

@@            Coverage Diff             @@
##              dev     #409      +/-   ##
==========================================
- Coverage   70.39%   70.37%   -0.02%     
==========================================
  Files        1175     1175              
  Lines       84452    84452              
  Branches    11124    11124              
==========================================
- Hits        59448    59436      -12     
- Misses      20714    20723       +9     
- Partials     4290     4293       +3     
Flag Coverage Δ
ci 70.37% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…eadcrumb

Two inline review comments after the initial PR #409 push.

* Templates card button no-op (codex-bot, P2): the new /agents card surfaces
  a `Templates` button emitting `agent_builder_action=list_templates`, but
  AgentBuilderCardFlow.TryResolveAsync had no `ListTemplatesAction` case in
  its card-action switch, so the click fell through unhandled and the
  button looked dead. Added the case to dispatch a `list_templates` tool
  call (mirroring the existing text-flow path).

* Fallback breadcrumb invisible at runtime (eanzhao,
  pulls/409#discussion_r3141562097): BuildFromInbound returns
  (ou_*, open_id, FellBack=true) when the relay omits LarkUnionId, but the
  flag was dropped before persistence — `OutboundConfig` only stores the
  receive id/type, so the downstream `LarkConversationTargets.Resolve` at
  SkillRunner send time sees populated typed fields and reports
  FellBack=false. The "fallback breadcrumb / correlate missing-union_id
  ingress" promised in the PR description never fired. Extracted
  `AgentBuilderTool.ResolveDeliveryTarget(...)` to the
  `IAgentTool.ExecuteAsync` flow so it logs once at agent-create time
  whenever the typed pair was inferred from the legacy fallback. The log
  carries agent_id, has-union_id / has-lark-chat_id / has-sender flags,
  and the resolved receive_id_type so operators can correlate Lark
  `99992361` rejections with missing-union_id ingress.

Tests:

* AgentBuilderCardFlowTests: card_action with
  `agent_builder_action=list_templates` dispatches a tool call instead of
  no-op'ing.
* AgentBuilderToolTests:
  - LogsFallbackBreadcrumb: when LarkUnionId / LarkChatId are absent, a
    single LogDebug fires with the fallback context and the `99992361`
    correlation hint.
  - DoesNotLogFallback: when LarkUnionId is present, no breadcrumb fires
    (otherwise the signal would drown in normal traffic).

Verification: 398 → 401 ChannelRuntime tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

复看最新提交 e661d5b,这轮 fix 把两个点都补上了:

  • /agents 卡片里的 Templates 按钮现在会走 agent_builder_action=list_templates,不再 no-op,并且有 AgentBuilderCardFlowTests 覆盖。
  • fallback breadcrumb 已经前移到 AgentBuilderTool.ResolveDeliveryTarget(...) 的 agent-create 时刻,能在 LarkUnionId 缺失时记录 agent_id / hasUnionId / hasLarkChatId / hasSenderId / resolvedReceiveIdType,避开 typed fields 持久化后无法再看出 fallback 的问题。

我在最新 PR head 上验证了:

dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo
bash tools/ci/test_stability_guards.sh

结果:401/401 通过,稳定性门禁通过。没有新的代码级 blocker。之前那条“现有已创建 SkillRunner 仍需重建/状态修复/运维处理”的提醒仍然成立,但这是 rollout 事项,不影响这轮代码修复合入。

…s app`

Reviewer concern (PR #409 pulls/409#review-4175198266): the cross-app fix
captures union_id at agent-CREATE time, so existing SkillRunner /
WorkflowAgent state pinned to `LarkReceiveIdType=open_id` will keep firing
`99992361 open_id cross app` after this PR ships — the persisted typed
pair is treated as authoritative and there is no self-heal path.

The right operational answer is "delete the broken agent and recreate it",
which is now one click via the new `/agents` card → Delete row button →
`/daily` (or `/social-media`). But the bare Lark error string in
`/agent-status`'s `last_error` field gives no signal to operators or end
users that recreation is the fix.

Surface that signal at the ONLY layer that has the structured Lark code:
when `LarkProxyResponse.TryGetError` returns code 99992361, both
`SkillRunnerGAgent.SendOutputAsync` and
`FeishuCardHumanInteractionPort.SendMessageAsync` now expand the
`InvalidOperationException` message into:

  "Lark message delivery rejected (code=99992361): open_id cross app.
   This agent was created before cross-app union_id ingress existed;
   delete and recreate it (`/agents` → Delete → `/daily`) to pick up
   the cross-app safe target."

The `Error` field on `SkillRunnerExecutionFailedEvent` carries this
message verbatim, so `/agent-status`'s `last_error` becomes
self-documenting. Other Lark codes keep the existing concise format.

Tests: SkillRunnerGAgentTests.SendOutputAsync_ShouldIncludeRecreateHint_
When_LarkRejectsAsCrossAppOpenId pins the hint contract.

Verification: 401 → 402 ChannelRuntime tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

这版把 ingress 里的 Lark union_id/chat_id 提到 TransportExtras,并在创建 daily/social agent 时优先持久化 (on_*, union_id),方向是对的,应该能解决新建 p2p SkillRunner

(回复 PR review 4175196647) 是的,新建路径已经覆盖:AgentBuilderTool.ResolveDeliveryTarget(...) 同时跑 daily_report 和 social_media 两条 create flow,把 LarkUnionId / LarkChatId 从请求 metadata 喂进 BuildFromInbound,p2p 命中 (on_*, union_id),群命中 (oc_*, chat_id)AgentBuilderToolTests.PinsLarkUnionId_When_RelayPropagatesIt pin 了 daily 路径,新加的 LogsFallbackBreadcrumb / DoesNotLogFallback 覆盖 union_id 缺失时的 LogDebug 行为。

所以如果目标是让现有失败的 SkillRunner 也立刻恢复,需要明确一个运维/迁移动作:例如让用户重建或重新初始化 agent、加一次 catalog/state repair、或者在状态里补写 union_id。

(回复 PR review 4175198266) 同意——这个 PR 修不了已经持久化 LarkReceiveIdType=open_id 的旧 agent。具体处理方案:

采取的:让旧 agent 失败时给出"recreate"明确指引

bf86c9b1SkillRunnerGAgent.SendOutputAsyncFeishuCardHumanInteractionPort.SendMessageAsync 里识别 Lark code=99992361(新增 LarkBotErrorCodes.OpenIdCrossApp),把 InvalidOperationException 消息扩成:

Lark message delivery rejected (code=99992361): open_id cross app. This agent was created before cross-app union_id ingress existed; delete and recreate it (`/agents` → Delete → `/daily`) to pick up the cross-app safe target.

SkillRunnerExecutionFailedEvent.Error 原样落地,/agent-statuslast_error 里直接就是这个 actionable 提示。配合 PR #409/agents / /agent-status 改成卡片 + 按钮,整个恢复流程就是:用户看到 last_error 里的提示 → 点 /agents 卡片里那行的 Delete → confirm → 重新发 /daily。无需运维介入。新测试 SkillRunnerGAgentTests.SendOutputAsync_ShouldIncludeRecreateHint_When_LarkRejectsAsCrossAppOpenId pin 了消息内容。

没采取的,及理由:

  • state 里补写 union_id:SkillRunner 调度运行时没有 fresh ingress event,拿不到 union_id;要么记一条新指令让下次 ingress 触发回填,要么给 UserAgentCatalog 加一个 union_id resolver。两种都比 recreate 复杂得多,对单租户已知用户的场景也是过度工程。
  • catalog 一次性 repair 工具:同上,没有 union_id 来源。除非通过 /contact/v3/users 反查(需要 customer 应用的 user-read scope,且要注入服务账号 token),代价 >> 收益。
  • batch revoke 旧 SkillRunner:会误伤已经手动 recreate 过的 agent;Delete 决策应该留给用户。

合并后的预期落地序列:

  1. PR 合 dev → 自动部署。
  2. 新发的 /daily 从今天起都是 union_id-pinned,跨 app 安全。
  3. 旧的 broken agent (skill-runner-94d754dfdfbb...) 下一次 scheduled run 仍会失败,但 /agent-statuslast_error 现在直接告诉用户 "delete and recreate"。
  4. 用户照做即可。

@eanzhao eanzhao merged commit 8fc13a2 into dev Apr 25, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant