Symptom
While reproducing #396 we observed a second, unrelated failure mode on the same bot:
- User DMs the Lark bot (e.g. asks it to post a card).
- Bot never responds.
aevatar-console-backend-api logs during the user's interaction window contain only Kubernetes liveness probes:
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://10.2.9.37:8080/ - - -
…
Request finished HTTP/1.1 GET http://10.2.9.37:8080/ - 200 - application/json;+charset=utf-8 0.3861ms
No POST /api/webhooks/nyxid-relay, no auth failure, no parse error — the callback never arrives at the aevatar pod we're watching.
Contrast: when the same user later clicks a card button, the webhook does fire (that case is the unsupported_card_action log and #396).
Scope
This issue is explicitly not #396 / PR #397. Those address the case where the webhook is called and the payload is dropped because it's a card_action. Here the webhook is never called at all, so no amount of downstream routing work will help.
Hypotheses (prioritized)
- NyxID-side bot registration points at the wrong aevatar host / is missing. The channel bot relay stores a
callback_url per bot (see POST /api/v1/webhooks/channel/lark/{bot_id} handler on NyxID origin/main). If that URL is stale (e.g. points at an old *.eanzhao.com or a previous aevatar deploy), every inbound Lark event lands somewhere else or 404s upstream.
- Lark → NyxID delivery is broken. Lark treats non-200 (even 202) as failure and starts retrying / disabling the subscription; the Lark developer console may show a stream of failures. Also, the Lark subscription may have been reset after an app-secret rotation.
- Multi-pod routing. We only have logs from a single pod; if the deploy runs with >1 replica and the ingress is round-robin, the webhook may be hitting a different pod than the one we're tailing. The logs from any single pod would look exactly like the sample above for requests that land on the other replica.
- Bot registered under a different scope / api_key_id than the one Lark fires for. The
X-NyxID-Signature check in NyxIdRelayAuthValidator would reject with 401, but even that would show a Request starting POST … line — the fact that we see nothing argues against this one; still worth confirming.
- Stale or deactivated channel bot record on NyxID.
GET /api/v1/channel-bots?bot_id=<id> will show whether the record exists and is active.
(1), (2), (3) are the most likely given the evidence.
Investigation steps
On NyxID (/Users/zhaoyiqi/Code/NyxID, origin/main branch — see reference_nyxid_channel_relay_branch):
GET /api/v1/channel-bots?bot_id=<id> — confirm bot exists, is active, and note its callback_url.
- Tail NyxID logs for
POST /api/v1/webhooks/channel/lark/{bot_id} during a repro — are Lark events reaching NyxID at all?
- If NyxID is receiving the events, tail its outbound relay logs for
POST <callback_url> to confirm where it's trying to forward.
On aevatar:
kubectl get pods -n <ns> count; if >1, kubectl logs -n <ns> -l app=<app> --all-containers --tail=200 -f (or equivalent) and retry the repro so we can see which pod (if any) the webhook lands on.
- Confirm the registered
callback_url (whatever NyxID has) actually resolves to the running service.
On Lark developer console:
- Check the event subscription status for this bot — if events are failing upstream of NyxID, Lark will surface retries / disabled state.
Related
Acceptance
- Root cause identified from the hypothesis list (or a new one) with evidence from NyxID + aevatar logs.
- If infra/config: callback URL restored and a round-trip verified via a real Lark DM.
- If code: fix lands behind a test that pins the contract that was violated.
Symptom
While reproducing #396 we observed a second, unrelated failure mode on the same bot:
aevatar-console-backend-apilogs during the user's interaction window contain only Kubernetes liveness probes:No
POST /api/webhooks/nyxid-relay, no auth failure, no parse error — the callback never arrives at the aevatar pod we're watching.Contrast: when the same user later clicks a card button, the webhook does fire (that case is the
unsupported_card_actionlog and #396).Scope
This issue is explicitly not #396 / PR #397. Those address the case where the webhook is called and the payload is dropped because it's a card_action. Here the webhook is never called at all, so no amount of downstream routing work will help.
Hypotheses (prioritized)
callback_urlper bot (seePOST /api/v1/webhooks/channel/lark/{bot_id}handler on NyxIDorigin/main). If that URL is stale (e.g. points at an old*.eanzhao.comor a previous aevatar deploy), every inbound Lark event lands somewhere else or 404s upstream.X-NyxID-Signaturecheck inNyxIdRelayAuthValidatorwould reject with 401, but even that would show aRequest starting POST …line — the fact that we see nothing argues against this one; still worth confirming.GET /api/v1/channel-bots?bot_id=<id>will show whether the record exists and isactive.(1), (2), (3) are the most likely given the evidence.
Investigation steps
On NyxID (
/Users/zhaoyiqi/Code/NyxID,origin/mainbranch — seereference_nyxid_channel_relay_branch):GET /api/v1/channel-bots?bot_id=<id>— confirm bot exists, isactive, and note itscallback_url.POST /api/v1/webhooks/channel/lark/{bot_id}during a repro — are Lark events reaching NyxID at all?POST <callback_url>to confirm where it's trying to forward.On aevatar:
kubectl get pods -n <ns>count; if >1,kubectl logs -n <ns> -l app=<app> --all-containers --tail=200 -f(or equivalent) and retry the repro so we can see which pod (if any) the webhook lands on.callback_url(whatever NyxID has) actually resolves to the running service.On Lark developer console:
Related
reference_nyxid_channel_relay_branch— reminder that channel bot relay lives on NyxIDorigin/main, notdev.reference_nyxid_relay_callback_protocol— header / HMAC / reply_token protocol we'd expect once delivery is restored.Acceptance