Conversation
SkillRunner outbound DMs were rejected by Lark with `code:99992361 open_id cross app` after PR #403 made the typed delivery target prefer the inbound sender open_id (`ou_*`) for p2p. The relay-side ingress and the customer's `api-lark-bot` are different Lark apps, and Lark `open_id` is app-scoped — the open_id minted by the relay app is invalid as a `receive_id_type=open_id` target on the customer app, producing `99992361` on every scheduled run. Cross-app outbound delivery requires a tenant-stable identifier. Lark's `union_id` (`on_*`) is the right choice: it is shared across apps in the same tenant, so an outbound app sees the SAME `on_*` for the user as the relay-side ingress. The Lark event payload always carries it under `event.sender.sender_id.union_id` (and `event.operator.union_id` for card actions); NyxID forwards the original event verbatim under `raw_platform_data`, so aevatar can recover it without a NyxID change. Changes ------- * `TransportExtras` proto: add `nyx_lark_union_id` (cross-app stable user identifier) and `nyx_lark_chat_id` (tenant-scoped group/channel id). Both populated only for Lark traffic. * `NyxIdRelayTransport`: parse `raw_platform_data` for Lark inbounds. For `im.message.receive_v1` the helper reads `event.sender.sender_id.union_id` and `event.message.chat_id`; for `card.action.trigger` it reads `event.operator.union_id` and `event.context.open_chat_id`. Other platforms leave the fields empty. * `ChannelMetadataKeys` + `ChannelConversationTurnRunner`: add `LarkUnionId` / `LarkChatId` keys and propagate from `TransportExtras` through the AgentBuilder metadata so the tool layer sees them on every inbound. * `AgentBuilderTool` (daily_report and social_media flows): pass `LarkUnionId` and `LarkChatId` to `LarkConversationTargets.BuildFromInbound` so the typed delivery target captured at agent creation pins the cross-app safe pair. * `LarkConversationTargets.BuildFromInbound`: prefer `union_id` (`receive_id_type=union_id`) for p2p when the relay surfaces it; prefer the inbound Lark `chat_id` (`receive_id_type=chat_id`) for groups. Fallback to the prior open_id / conversation_id paths is preserved but flagged with `FellBackToPrefixInference=true` so call sites emit a Debug breadcrumb and operators can correlate Lark rejections with missing ingress fields. Card UX for /agents and /agent-status ------------------------------------- While in the same area, lift `/agents` and `/agent-status` text replies to interactive Lark cards with action buttons. Each per-agent row in `/agents` exposes a Status button (carries the full `agent_id` so users do not have to retype the long skill-runner id). The agent-status card has Run / Disable / Enable / Delete (Delete marked danger and pre-confirmed) plus a Back-to-Agents button. Buttons reuse the existing `agent_builder_action` arguments handled by `AgentBuilderCardFlow`'s card event branch, so click routing is unchanged. Tests ----- * `NyxIdRelayTransportTests`: pin union_id / chat_id surfacing for message-receive and card-action events; pin empty contracts for non-Lark platforms and missing `raw_platform_data`. * `LarkConversationTargetsTests`: pin union_id-preferred p2p, open_id fallback with FellBack=true, group chat_id-preferred path. * `AgentBuilderToolTests`: integration test asserting the typed `LarkReceiveId` / `LarkReceiveIdType` become `(on_*, "union_id")` when `ChannelMetadataKeys.LarkUnionId` is present in the request context. * `NyxRelayAgentBuilderFlowTests`: pin the new card output for /agents (summary + per-agent Status buttons + footer shortcuts) and /agent-status (lifecycle action buttons + danger-flagged Delete). Verification: 393/393 ChannelRuntime tests pass; `tools/ci/test_stability_guards.sh` passes; remaining `tools/ci/architecture_guards.sh` failures are pre-existing playground asset drift on origin/dev, unrelated to this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cb8a2f65d7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // path so users have one consistent UX whether they typed `/agents` or arrived via card. | ||
| content.Actions.Add(BuildButton("Create Daily Report", "open_daily_report_form", isPrimary: false)); | ||
| content.Actions.Add(BuildButton("Create Social Media", "open_social_media_form", isPrimary: false)); | ||
| content.Actions.Add(BuildButton("Templates", "list_templates", isPrimary: false)); |
There was a problem hiding this comment.
Handle templates action from /agents card
The new /agents card adds a Templates button that emits agent_builder_action=list_templates, but AgentBuilderCardFlow.TryResolve(...) has no ListTemplatesAction case in its card-action switch, so clicking this button will not execute the tool call and falls through as an unhandled card action. This makes a visible CTA in the new card UI non-functional for users who rely on button navigation.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in e661d5b7. Added case ListTemplatesAction: to AgentBuilderCardFlow.TryResolveAsync's card-action switch, dispatching AgentBuilderFlowDecision.ToolCall(ListTemplatesAction, """{"action":"list_templates"}""") (mirrors the existing text-flow path). New unit test AgentBuilderCardFlowTests.TryResolveAsync_TemplatesCardButton_DispatchesListTemplatesTool pins the contract so a refactor cannot silently drop the routing.
eanzhao
left a comment
There was a problem hiding this comment.
我对照生产日志和这版 diff 看了一遍,结论是:PR 对“SkillRunner 跑完后结果推不到 Lark”的主因定位基本准确。日志里 skill-runner-94d754dfdfbb416aa5a676cecd0d7a71 已经完成初始化、触发执行、进入 LLM/tool 调用,最后是在 POST .../api-lark-bot/open-apis/im/v1/messages?receive_id_type=open_id 被 Lark 返回 99992361 open_id cross app 拒掉;失败通知也走同一 target 再次失败。所以这不是 runner 完全没启动,而是 outbound DM target 用了 relay app scoped open_id。
这版把 ingress 里的 Lark union_id/chat_id 提到 TransportExtras,并在创建 daily/social agent 时优先持久化 (on_*, union_id),方向是对的,应该能解决新建 p2p SkillRunner 的跨 app 发送问题。我也在临时 worktree 跑了 dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo,398/398 通过;bash tools/ci/test_stability_guards.sh 也通过。
我只留了一个需要补齐的点:fallback observability 现在没有真正落到日志里。除此之外,日志还暴露了 api-github 403 和一次 nyxid_proxy Missing path,这属于 runner 执行质量/权限问题,不是本 PR 的 Lark 推送修复范围,建议单独追踪。issue 搜索上,99992361/open_id cross app 目前基本只有这个 PR 命中;#403 是上一层 receive_id_type 修复,#350 是已关闭的 interactive reply 架构问题,#408 追踪 typed OutboundTarget 架构债。
| @@ -239,7 +239,9 @@ private async Task<string> CreateDailyReportAgentAsync( | |||
| var deliveryTarget = LarkConversationTargets.BuildFromInbound( | |||
There was a problem hiding this comment.
这里拿到的 deliveryTarget.FellBackToPrefixInference 在持久化前就丢掉了。BuildFromInbound() 在缺少 LarkUnionId 时会返回 (ou_*, open_id, FellBack=true),但 InitializeSkillRunnerCommand.OutboundConfig 只保存 receive id/type;后面 SkillRunnerGAgent.SendOutputAsync -> LarkConversationTargets.Resolve() 看到 typed fields 都非空,会返回 FellBack=false,所以 PR 描述里的“fallback breadcrumb / correlate missing-union_id ingress”实际不会出现。建议在 daily/social 创建这里发现 FellBackToPrefixInference 时直接 LogDebug/LogWarning,或者把 fallback 原因显式持久化;否则生产上 union_id 没被 Nyx relay 带出来时,仍会静默退回 open_id 并继续打出同样的 99992361。
There was a problem hiding this comment.
说得对。FellBackToPrefixInference 在 OutboundConfig 里被丢了,下游 Resolve() 看到 typed 字段都非空就返回 FellBack=false,PR 描述里的“fallback breadcrumb”确实不会触发。e661d5b7 把 BuildFromInbound 调用抽成 AgentBuilderTool.ResolveDeliveryTarget(...),在 fallback 分支立即 LogDebug 一次(带 agent_id / has-union_id / has-lark-chat_id / has-sender / resolved-type / 99992361 关联提示),不持久化 fallback 状态——观测点放在唯一拥有完整信息的位置(agent-create time),下游 SkillRunner 路径继续把 typed pair 当 authoritative。
新增两个测试 pin 行为:
LogsFallbackBreadcrumb_When_LarkUnionIdMissing:缺 union_id 时 fires 一次 LogDebug,包含上述 correlation 字段DoesNotLogFallback_When_LarkUnionIdPresent:union_id 在场时绝不 log(避免正常 ingress 流量把信号淹没)
eanzhao
left a comment
There was a problem hiding this comment.
补一个生产落地提醒:这版修的是“新建 agent 时捕获并持久化 union_id”。日志里的那个既有 skill-runner-94d754dfdfbb416aa5a676cecd0d7a71 如果状态里已经存了 LarkReceiveId=ou_* / LarkReceiveIdType=open_id,部署后 SendOutputAsync 仍会把 typed fields 原样拿去发,不会自动恢复成 union_id。
所以如果目标是让现有失败的 SkillRunner 也立刻恢复,需要明确一个运维/迁移动作:例如让用户重建或重新初始化 agent、加一次 catalog/state repair、或者在状态里补写 union_id。否则 PR 合进去后,新 agent 会好,旧 agent 可能继续打 99992361 open_id cross app。
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## dev #409 +/- ##
==========================================
- Coverage 70.39% 70.37% -0.02%
==========================================
Files 1175 1175
Lines 84452 84452
Branches 11124 11124
==========================================
- Hits 59448 59436 -12
- Misses 20714 20723 +9
- Partials 4290 4293 +3
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
…eadcrumb Two inline review comments after the initial PR #409 push. * Templates card button no-op (codex-bot, P2): the new /agents card surfaces a `Templates` button emitting `agent_builder_action=list_templates`, but AgentBuilderCardFlow.TryResolveAsync had no `ListTemplatesAction` case in its card-action switch, so the click fell through unhandled and the button looked dead. Added the case to dispatch a `list_templates` tool call (mirroring the existing text-flow path). * Fallback breadcrumb invisible at runtime (eanzhao, pulls/409#discussion_r3141562097): BuildFromInbound returns (ou_*, open_id, FellBack=true) when the relay omits LarkUnionId, but the flag was dropped before persistence — `OutboundConfig` only stores the receive id/type, so the downstream `LarkConversationTargets.Resolve` at SkillRunner send time sees populated typed fields and reports FellBack=false. The "fallback breadcrumb / correlate missing-union_id ingress" promised in the PR description never fired. Extracted `AgentBuilderTool.ResolveDeliveryTarget(...)` to the `IAgentTool.ExecuteAsync` flow so it logs once at agent-create time whenever the typed pair was inferred from the legacy fallback. The log carries agent_id, has-union_id / has-lark-chat_id / has-sender flags, and the resolved receive_id_type so operators can correlate Lark `99992361` rejections with missing-union_id ingress. Tests: * AgentBuilderCardFlowTests: card_action with `agent_builder_action=list_templates` dispatches a tool call instead of no-op'ing. * AgentBuilderToolTests: - LogsFallbackBreadcrumb: when LarkUnionId / LarkChatId are absent, a single LogDebug fires with the fallback context and the `99992361` correlation hint. - DoesNotLogFallback: when LarkUnionId is present, no breadcrumb fires (otherwise the signal would drown in normal traffic). Verification: 398 → 401 ChannelRuntime tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao
left a comment
There was a problem hiding this comment.
复看最新提交 e661d5b,这轮 fix 把两个点都补上了:
/agents卡片里的Templates按钮现在会走agent_builder_action=list_templates,不再 no-op,并且有AgentBuilderCardFlowTests覆盖。- fallback breadcrumb 已经前移到
AgentBuilderTool.ResolveDeliveryTarget(...)的 agent-create 时刻,能在LarkUnionId缺失时记录agent_id / hasUnionId / hasLarkChatId / hasSenderId / resolvedReceiveIdType,避开 typed fields 持久化后无法再看出 fallback 的问题。
我在最新 PR head 上验证了:
dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo
bash tools/ci/test_stability_guards.sh结果:401/401 通过,稳定性门禁通过。没有新的代码级 blocker。之前那条“现有已创建 SkillRunner 仍需重建/状态修复/运维处理”的提醒仍然成立,但这是 rollout 事项,不影响这轮代码修复合入。
…s app` Reviewer concern (PR #409 pulls/409#review-4175198266): the cross-app fix captures union_id at agent-CREATE time, so existing SkillRunner / WorkflowAgent state pinned to `LarkReceiveIdType=open_id` will keep firing `99992361 open_id cross app` after this PR ships — the persisted typed pair is treated as authoritative and there is no self-heal path. The right operational answer is "delete the broken agent and recreate it", which is now one click via the new `/agents` card → Delete row button → `/daily` (or `/social-media`). But the bare Lark error string in `/agent-status`'s `last_error` field gives no signal to operators or end users that recreation is the fix. Surface that signal at the ONLY layer that has the structured Lark code: when `LarkProxyResponse.TryGetError` returns code 99992361, both `SkillRunnerGAgent.SendOutputAsync` and `FeishuCardHumanInteractionPort.SendMessageAsync` now expand the `InvalidOperationException` message into: "Lark message delivery rejected (code=99992361): open_id cross app. This agent was created before cross-app union_id ingress existed; delete and recreate it (`/agents` → Delete → `/daily`) to pick up the cross-app safe target." The `Error` field on `SkillRunnerExecutionFailedEvent` carries this message verbatim, so `/agent-status`'s `last_error` becomes self-documenting. Other Lark codes keep the existing concise format. Tests: SkillRunnerGAgentTests.SendOutputAsync_ShouldIncludeRecreateHint_ When_LarkRejectsAsCrossAppOpenId pins the hint contract. Verification: 401 → 402 ChannelRuntime tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(回复 PR review 4175196647) 是的,新建路径已经覆盖:
(回复 PR review 4175198266) 同意——这个 PR 修不了已经持久化 采取的:让旧 agent 失败时给出"recreate"明确指引
没采取的,及理由:
合并后的预期落地序列:
|
Symptom
After PR #403 shipped, every SkillRunner outbound DM was rejected by Lark with
code:99992361 open_id cross app(production logaevatar-console-backend-75fd899df8-s6mp7). The user now sees the Daily Report agent created successfully and the immediate ack, but every scheduled run fails on send. PR #403 fixed the typedreceive_id_typeplumbing — this PR fixes the actual identifier that gets sent.Root cause
Lark
open_idis app-scoped: each Lark app issues its ownou_*for the same user, and an open_id minted by app A is invalid as areceive_idon app B. In our deployment:ou_*in the event payload is scoped to that app.s/api-lark-botproxy — the customer's Lark app, a different app from the relay-side ingress.PR #403's
BuildFromInbound("p2p", ...)pinned(senderId, "open_id")on the typed delivery target. Lark rejects the cross-app open_id every time; the retry loop runs out of attempts and the agent flips toerror.The right cross-app identifier is Lark
union_id(on_*): tenant-stable, shared across apps in the same tenant, accepted by Lark asreceive_id_type=union_idfrom any app. Lark events always carry it underevent.sender.sender_id.union_id(andevent.operator.union_idfor card actions). NyxID forwards the original event verbatim underraw_platform_data(NyxIDbuild_callback_payload), so aevatar can recover it without a NyxID change.Fix scope
1. Surface
union_idandchat_idat ingressTransportExtrasproto: addnyx_lark_union_id(cross-app stable user identifier) andnyx_lark_chat_id(tenant-scoped group/channel id). Lark-only.NyxIdRelayTransport.ParseInboundAsync: parseraw_platform_datafor Lark traffic.im.message.receive_v1: readevent.sender.sender_id.union_idandevent.message.chat_id.card.action.trigger: readevent.operator.union_idandevent.context.open_chat_id.ChannelMetadataKeys+ChannelConversationTurnRunner.BuildReplyMetadata: addLarkUnionId/LarkChatIdkeys and propagate fromTransportExtras.AgentBuilderTool.CreateDailyReportAgentAsync+CreateSocialMediaAgentAsync: passLarkUnionIdandLarkChatIdtoBuildFromInbound.2. Prefer cross-app safe targets in
BuildFromInboundThe legacy open_id path keeps the relay-app self-send case working in dev / single-app deployments; the FellBack flag makes the cross-app fallback observable so operators can correlate Lark
99992361rejections with missing-union_id ingress.3. /agents and /agent-status as interactive cards (顺手)
User asked to lift the bot's text replies to Lark cards with action buttons:
/agents: summary card + one section per agent + a Status button per agent (carries the fullagent_idso users do not have to retype the long skill-runner id) + footer shortcuts (Create Daily / Create Social Media / Templates)./agent-status <id>: status card + lifecycle action buttons (Run / Disable / Enable / Delete-with-confirm-pre-set / Back to Agents). Delete is marked danger; Lark renders it red.Buttons reuse the existing
agent_builder_actionarguments handled byAgentBuilderCardFlow's card event branch — click routing is unchanged.Files
4. Migration: existing broken agents won't self-heal — give them a recreate hint
This fix captures
union_idat agent-CREATE time, so SkillRunner / WorkflowAgent state persisted before this PR is permanently pinned toLarkReceiveIdType=open_idagainst a relay-app-scopedou_*. Their next scheduled run will keep firing99992361 open_id cross app. There is no automatic self-heal — scheduled runs do not have a fresh ingress event to mineunion_idfrom.To make the failure self-documenting,
SkillRunnerGAgent.SendOutputAsyncandFeishuCardHumanInteractionPort.SendMessageAsyncnow detect Larkcode=99992361(LarkBotErrorCodes.OpenIdCrossApp) and expand the thrown exception into:The
Errorfield onSkillRunnerExecutionFailedEventcarries this verbatim, so/agent-status'slast_erroris now actionable. Combined with the new/agents//agent-statuscard UI in this PR, the recovery flow is two clicks (Delete row + new/daily) — no operator-side migration tool needed.Other migration options considered and rejected:
union_idbackfill: SkillRunner has no fresh ingress at scheduled-run time; would need a separate inbound-event hook to repair.users:readscope orbatch_get_idtranslation we don't have.99992361: would mask transient errors and surprise users who already recreated.Verification
99992361recreate hint).tools/ci/architecture_guards.shreportsPlayground asset drift detected for app.js / app.css— verified pre-existing onorigin/dev(PR #407), unrelated to this change.Out of scope
FeishuCardHumanInteractionPortand the immediate-ack reaction inChannelConversationTurnRunnershould ALSO prefer union_id; left for a follow-up PR since the SkillRunner DM path is the production blocker today./run-agent,/disable-agent,/enable-agent,/delete-agenttext-command replies still return plain text (their card-flow counterparts inAgentBuilderCardFlowalready render cards). Symmetric card output for those is a follow-up.lark_*fields living on platform-agnostic protos (SkillRunnerOutboundConfig,UserAgentCatalogEntry,WorkflowAgentState,TransportExtras) is tracked under #408 — refactor to typedOutboundTargetoneof when a second platform's delivery code lands.🤖 Generated with Claude Code