Skip to content

Lark relay callbacks never reach aevatar — no POST /api/webhooks/nyxid-relay on inbound messages #398

@eanzhao

Description

@eanzhao

Symptom

While reproducing #396 we observed a second, unrelated failure mode on the same bot:

  1. User DMs the Lark bot (e.g. asks it to post a card).
  2. Bot never responds.
  3. aevatar-console-backend-api logs during the user's interaction window contain only Kubernetes liveness probes:
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
      Request starting HTTP/1.1 GET http://10.2.9.37:8080/ - - -
…
      Request finished HTTP/1.1 GET http://10.2.9.37:8080/ - 200 - application/json;+charset=utf-8 0.3861ms

No POST /api/webhooks/nyxid-relay, no auth failure, no parse error — the callback never arrives at the aevatar pod we're watching.

Contrast: when the same user later clicks a card button, the webhook does fire (that case is the unsupported_card_action log and #396).

Scope

This issue is explicitly not #396 / PR #397. Those address the case where the webhook is called and the payload is dropped because it's a card_action. Here the webhook is never called at all, so no amount of downstream routing work will help.

Hypotheses (prioritized)

  1. NyxID-side bot registration points at the wrong aevatar host / is missing. The channel bot relay stores a callback_url per bot (see POST /api/v1/webhooks/channel/lark/{bot_id} handler on NyxID origin/main). If that URL is stale (e.g. points at an old *.eanzhao.com or a previous aevatar deploy), every inbound Lark event lands somewhere else or 404s upstream.
  2. Lark → NyxID delivery is broken. Lark treats non-200 (even 202) as failure and starts retrying / disabling the subscription; the Lark developer console may show a stream of failures. Also, the Lark subscription may have been reset after an app-secret rotation.
  3. Multi-pod routing. We only have logs from a single pod; if the deploy runs with >1 replica and the ingress is round-robin, the webhook may be hitting a different pod than the one we're tailing. The logs from any single pod would look exactly like the sample above for requests that land on the other replica.
  4. Bot registered under a different scope / api_key_id than the one Lark fires for. The X-NyxID-Signature check in NyxIdRelayAuthValidator would reject with 401, but even that would show a Request starting POST … line — the fact that we see nothing argues against this one; still worth confirming.
  5. Stale or deactivated channel bot record on NyxID. GET /api/v1/channel-bots?bot_id=<id> will show whether the record exists and is active.

(1), (2), (3) are the most likely given the evidence.

Investigation steps

On NyxID (/Users/zhaoyiqi/Code/NyxID, origin/main branch — see reference_nyxid_channel_relay_branch):

  • GET /api/v1/channel-bots?bot_id=<id> — confirm bot exists, is active, and note its callback_url.
  • Tail NyxID logs for POST /api/v1/webhooks/channel/lark/{bot_id} during a repro — are Lark events reaching NyxID at all?
  • If NyxID is receiving the events, tail its outbound relay logs for POST <callback_url> to confirm where it's trying to forward.

On aevatar:

  • kubectl get pods -n <ns> count; if >1, kubectl logs -n <ns> -l app=<app> --all-containers --tail=200 -f (or equivalent) and retry the repro so we can see which pod (if any) the webhook lands on.
  • Confirm the registered callback_url (whatever NyxID has) actually resolves to the running service.

On Lark developer console:

  • Check the event subscription status for this bot — if events are failing upstream of NyxID, Lark will surface retries / disabled state.

Related

Acceptance

  • Root cause identified from the hypothesis list (or a new one) with evidence from NyxID + aevatar logs.
  • If infra/config: callback URL restored and a round-trip verified via a real Lark DM.
  • If code: fix lands behind a test that pins the contract that was violated.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions