Lark relay callbacks never reach aevatar — no POST /api/webhooks/nyxid-relay on inbound messages

## Symptom

While reproducing #396 we observed a second, unrelated failure mode on the same bot:

1. User DMs the Lark bot (e.g. asks it to post a card).
2. Bot never responds.
3. `aevatar-console-backend-api` logs during the user's interaction window contain **only** Kubernetes liveness probes:

```
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
      Request starting HTTP/1.1 GET http://10.2.9.37:8080/ - - -
…
      Request finished HTTP/1.1 GET http://10.2.9.37:8080/ - 200 - application/json;+charset=utf-8 0.3861ms
```

No `POST /api/webhooks/nyxid-relay`, no auth failure, no parse error — the callback never arrives at the aevatar pod we're watching.

Contrast: when the same user later clicks a card button, the webhook *does* fire (that case is the `unsupported_card_action` log and #396).

## Scope

This issue is explicitly **not** #396 / PR #397. Those address the case where the webhook is called and the payload is dropped because it's a card_action. Here the webhook is never called at all, so no amount of downstream routing work will help.

## Hypotheses (prioritized)

1. **NyxID-side bot registration points at the wrong aevatar host / is missing.** The channel bot relay stores a `callback_url` per bot (see `POST /api/v1/webhooks/channel/lark/{bot_id}` handler on NyxID `origin/main`). If that URL is stale (e.g. points at an old `*.eanzhao.com` or a previous aevatar deploy), every inbound Lark event lands somewhere else or 404s upstream.
2. **Lark → NyxID delivery is broken.** Lark treats non-200 (even 202) as failure and starts retrying / disabling the subscription; the Lark developer console may show a stream of failures. Also, the Lark subscription may have been reset after an app-secret rotation.
3. **Multi-pod routing.** We only have logs from a single pod; if the deploy runs with >1 replica and the ingress is round-robin, the webhook may be hitting a different pod than the one we're tailing. The logs from any single pod would look exactly like the sample above for requests that land on the other replica.
4. **Bot registered under a different scope / api_key_id than the one Lark fires for.** The `X-NyxID-Signature` check in `NyxIdRelayAuthValidator` would reject with 401, but even that would show a `Request starting POST …` line — the fact that we see *nothing* argues against this one; still worth confirming.
5. **Stale or deactivated channel bot record on NyxID.** `GET /api/v1/channel-bots?bot_id=<id>` will show whether the record exists and is `active`.

(1), (2), (3) are the most likely given the evidence.

## Investigation steps

On NyxID (`/Users/zhaoyiqi/Code/NyxID`, `origin/main` branch — see `reference_nyxid_channel_relay_branch`):
- `GET /api/v1/channel-bots?bot_id=<id>` — confirm bot exists, is `active`, and note its `callback_url`.
- Tail NyxID logs for `POST /api/v1/webhooks/channel/lark/{bot_id}` during a repro — are Lark events reaching NyxID at all?
- If NyxID is receiving the events, tail its outbound relay logs for `POST <callback_url>` to confirm where it's trying to forward.

On aevatar:
- `kubectl get pods -n <ns>` count; if >1, `kubectl logs -n <ns> -l app=<app> --all-containers --tail=200 -f` (or equivalent) and retry the repro so we can see which pod (if any) the webhook lands on.
- Confirm the registered `callback_url` (whatever NyxID has) actually resolves to the running service.

On Lark developer console:
- Check the event subscription status for this bot — if events are failing upstream of NyxID, Lark will surface retries / disabled state.

## Related

- #396 / PR #397 — dropped card_action payloads at the webhook layer (orthogonal).
- `reference_nyxid_channel_relay_branch` — reminder that channel bot relay lives on NyxID `origin/main`, not `dev`.
- `reference_nyxid_relay_callback_protocol` — header / HMAC / reply_token protocol we'd expect once delivery is restored.

## Acceptance

- Root cause identified from the hypothesis list (or a new one) with evidence from NyxID + aevatar logs.
- If infra/config: callback URL restored and a round-trip verified via a real Lark DM.
- If code: fix lands behind a test that pins the contract that was violated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lark relay callbacks never reach aevatar — no POST /api/webhooks/nyxid-relay on inbound messages #398

Symptom

Scope

Hypotheses (prioritized)

Investigation steps

Related

Acceptance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Lark relay callbacks never reach aevatar — no POST /api/webhooks/nyxid-relay on inbound messages #398

Description

Symptom

Scope

Hypotheses (prioritized)

Investigation steps

Related

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions