Skip to content

fix(agent-builder): use UserService.id for api-key allowed_service_ids (#417)#418

Merged
eanzhao merged 3 commits intodevfrom
fix/2026-04-25_api-key-service-id-type
Apr 25, 2026
Merged

fix(agent-builder): use UserService.id for api-key allowed_service_ids (#417)#418
eanzhao merged 3 commits intodevfrom
fix/2026-04-25_api-key-service-id-type

Conversation

@eanzhao
Copy link
Copy Markdown
Contributor

@eanzhao eanzhao commented Apr 25, 2026

Summary

  • Fixes fix: api-key allowed_service_ids uses catalog DownstreamService.id, NyxID enforcement expects UserService.id → /daily fails with spurious github_proxy_access_denied #417: /daily (and any agent_builder flow that mints a child NyxID API key) consistently fails with github_proxy_access_denied even when the user's GitHub OAuth is healthy. Root cause is that we populate the new key's allowed_service_ids with catalog DownstreamService.id values, but NyxID's proxy enforcement (proxy.rs:1030) compares against per-user UserService.id. The mismatch is silently accepted at create-time and surfaces as 403 ApiKeyScopeForbidden on every proxy call.
  • Routes service resolution through GET /api/v1/user-services and populates allowed_service_ids with each per-user UserService.id. When the same slug has multiple rows (mixed personal + org bindings, stale rows, etc.), prefers the most eligible one — only emits service_inactive / service_org_viewer_only errors when no eligible row exists.
  • Pins allow_all_services = false in the api-key payload so NyxID's enforcement actually consults allowed_service_ids (the field defaults to true, which short-circuits enforcement and silently grants broad scope; see "Why this was invisible" below).
  • Retains BestEffortRevokeApiKeyAsync: the GitHub preflight failure path still revokes the freshly minted key so retries don't accumulate orphans. Hint reworded to point at re-authorizing the GitHub provider, not at api-key bindings (the previous hint was based on the misdiagnosis that SkillRunner daily_report GitHub proxy 403s at runtime #411 fixed).

Why this was invisible until production

Session-token-minted API keys default to allow_all_services=true, which short-circuits the enforcement check (NyxID proxy.rs:1030). A developer reproducing POST /api-keys + /proxy/s/api-github/rate_limit from a CLI never tripped the bug. The agent path mints child keys via the channel-relay delegation token; NyxID forces those children to inherit allow_all_services=false from the parent, which is what activates the enforcement check and makes the ID mismatch fatal. See #417 review comment for the full repro matrix.

This PR also makes the narrow scope first-class — BuildCreateApiKeyPayload sends allow_all_services = false explicitly, so the resolver's output is enforced regardless of what the parent's setting happens to be. As a side benefit this also turns on NyxID's validate_service_ids at create-time (key_service.rs:183), so a malformed UserService.id fails fast at POST /api-keys instead of silently passing through and 403'ing every later proxy call.

Changes

  • src/Aevatar.AI.ToolProviders.NyxId/NyxIdApiClient.cs — add ListUserServicesAsync (GET /api/v1/user-services).
  • agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs:
    • Rewrite ResolveProxyServiceIdsAsync to use /user-services and return per-user UserService.id. When the same slug has duplicate rows, prefer the most eligible (is_active && !(org && allowed != true)) instead of freezing the first row seen. Specific service_inactive / service_org_viewer_only errors are only emitted when no eligible row exists for a slug.
    • Failure contract is now pre-formed JSON envelopes with stable error keys (service_not_connected, service_inactive, service_org_viewer_only, user_services_unavailable, user_services_parse_failed, no_required_slugs). Both call sites return the envelope verbatim instead of double-wrapping.
    • BuildCreateApiKeyPayload adds allow_all_services = false (per review #4175529548). allow_all_nodes left at the NyxID default — this flow does not restrict node routing.
    • PreflightGitHubProxyAsync doc comment + user-facing hint rewritten to point at re-authorizing the GitHub provider at NyxID, not at api-key bindings. BestEffortRevokeApiKeyAsync retained for the preflight failure path so retries don't accumulate orphan keys.
  • test/Aevatar.GAgents.ChannelRuntime.Tests/AgentBuilderToolTests.cs:
    • Update all 10 stub sites from /proxy/services?per_page=100 to /user-services with the new row shape (id, slug, is_active, credential_source).
    • Update _FailsClosed_When_RequiredProxyServices_AreMissing to assert the structured service_not_connected envelope instead of free-text.
    • Update _FailsClosed_When_GithubProxyDeniedForNewKey to pin the new hint wording AND that the api-key IS revoked (DELETE on /api/v1/api-keys/key-403).
    • Existing allow_all_services assertions flipped from TryGetProperty(...).Should().BeFalse() (field absent) to GetProperty(...).GetBoolean().Should().BeFalse() (field present and false).
    • Add four new tests:
      • _FailsClosed_When_RequiredSlug_IsInactiveis_active: false row → service_inactive.
      • _FailsClosed_When_OrgSharedSlug_IsViewerOnly — org-shared allowed: falseservice_org_viewer_only.
      • _AllowedServiceIds_AreUserServiceIds_NotCatalogIds — regression pin: stub id distinct from catalog_service_id, assert the api-key payload carries the per-user id.
      • _PicksEligibleRow_When_DuplicateSlugRowsExist — duplicate slug rows with the ineligible one first; assert the resolver still picks the eligible one.

Test plan

  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj — 421/421 passing.
  • bash tools/ci/architecture_guards.sh — clean (only flag is missing buf for proto lint, unrelated).
  • Production smoke after merge: /daily alice from a Lark DM should now succeed end-to-end with a healthy GitHub OAuth connection. With a revoked GitHub token the response should be the github_proxy_access_denied error pointing at re-authorization (not at api-key bindings).

Follow-ups

…llowed_service_ids (#417)

The agent-create flow minted child API keys whose `allowed_service_ids`
carried `DownstreamService.id` (catalog UUIDs) sourced from
`GET /proxy/services`. NyxID proxy enforcement (proxy.rs:1030) compares
the API key's `allowed_service_ids` against `UserService.id` (per-user
instance UUIDs), so every proxied call returned 403
`ApiKeyScopeForbidden`. Session-token-minted keys default to
`allow_all_services=true`, which short-circuits the check — that's why
the bug was invisible in CLI repros and only surfaced for agent-path
keys (which inherit `allow_all_services=false` from the channel-relay
parent).

The fix routes through `GET /api/v1/user-services` and populates
`allowed_service_ids` with each row's per-user `id`, plus filters out
inactive rows and org-shared rows the caller cannot proxy through (the
latter would otherwise resurface as the less-actionable
`org_role_insufficient`). Failure modes return structured error JSON
with stable error keys (`service_not_connected`, `service_inactive`,
`service_org_viewer_only`) instead of the previous free-text wrap.

Also retires `BestEffortRevokeApiKeyAsync` and rewords the GitHub
preflight hint: under #411 we revoked the freshly-minted key on a
preflight 403 because we mis-attributed the failure to the api-key.
With #417 fixed, preflight only fires on actual GitHub-side issues
(OAuth grant revoked, scopes downgraded, account temp-banned) — the
api-key is healthy at that point. Hint now points users at
re-authorizing the GitHub provider, not at api-key bindings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65b2d1c18a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1077 to +1078
if (resolutionsBySlug.ContainsKey(slug))
continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prefer an eligible service row when duplicate slugs exist

In ResolveProxyServiceIdsAsync, the first row seen for a slug is frozen as the only candidate. If /user-services returns multiple rows for the same slug (for example an org-shared allowed:false row before a personal active row), this path will emit service_org_viewer_only/service_inactive and abort even though a valid service is available later in the payload. That creates false-negative agent creation failures for users with mixed service bindings.

Useful? React with 👍 / 👎.

Comment on lines 255 to 256
if (preflight is not null)
{
await BestEffortRevokeApiKeyAsync(nyxClient, token, apiKeyId!, "github_preflight_failed", ct);
return preflight;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Revoke API keys when GitHub preflight aborts agent creation

After the key is minted, the preflight failure branch now returns immediately without deleting that key. In the revoked/downgraded GitHub-token scenario, each /daily retry will create another unused NyxID API key while still failing before actor creation, causing unbounded orphan key accumulation and stale credential surface area.

Useful? React with 👍 / 👎.

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

Review: does this fix #417, and is it architecturally sound?

TL;DR — fix is correct, but ship with two outstanding concerns Codex already flagged plus one additional one below.

✅ Does this solve #417?

Yes. Root cause is correctly identified and the fix is targeted:

  • ListUserServicesAsyncGET /api/v1/user-services is the right endpoint. The returned id is the per-user UserService.id that NyxID's enforcement (proxy.rs:1030) actually checks against allowed_service_ids.
  • The new regression test _AllowedServiceIds_AreUserServiceIds_NotCatalogIds pins exactly the bug — stubs distinct id vs catalog_service_id and asserts the api-key payload carries the per-user id. This is the right shape of regression pin and would have caught the original bug pre-merge.
  • Splitting the failure modes into service_not_connected / service_inactive / service_org_viewer_only is a meaningful upgrade. The old free-text "Missing required Nyx proxy services" message conflated three distinct causes that need three distinct user actions.
  • The BuildGitHubAuthorizationResponseAsync upstream check is left untouched, so the "OAuth never connected" path still surfaces oauth_required and never falls through to github_proxy_access_denied.

⚠️ Outstanding concerns (in priority order)

1. P2 — orphan API keys on preflight failure (Codex flagged, I concur)

The PR drops BestEffortRevokeApiKeyAsync and justifies it as "the api-key is healthy". That conflates correctly scoped with useful. After this PR, the failure flow is:

  1. BuildGitHubAuthorizationResponseAsync passes (OAuth was connected at step T₀).
  2. POST /api-keys mints a key with the right allowed_service_ids.
  3. PreflightGitHubProxyAsync calls /rate_limit, gets 401/403 because the user revoked the OAuth grant between T₀ and T₁.
  4. We return the structured error. The api-key stays in NyxID, with no consumer.

Every retry of /daily while OAuth is revoked creates another orphan key. PR #412's reviewer (r3141699756) caught this exact concern under #411 and added the revoke. The reasoning that fixed it then is still valid now — the OAuth-revocation window is precisely the case the preflight was kept for, so it's the case we should expect to hit. Either re-introduce the best-effort revoke (limited to preflight failure, not OAuth-not-connected), or note this as an accepted trade-off and confirm NyxID has a TTL on unused keys.

2. P1 — first-match-wins for duplicate slugs (Codex flagged, I concur)

if (resolutionsBySlug.ContainsKey(slug)) continue; freezes the first row seen. Stable order is server-determined. If NyxID returns [org-shared allowed:false, personal active:true] for the same slug (legacy migration leftovers + a new personal connect), the user gets service_org_viewer_only even though a usable row exists. The doc comment acknowledges this but treats it as acceptable — I'd argue it's not, because the failure surfaces an actionable-but-wrong instruction ("ask an admin to widen org role scope") when the user has already connected a personal credential.

Suggest preferring rows in this order: is_active && credential_source.type == "personal"is_active && org && allowed:true → otherwise. Cheap to do, no new test surface needed beyond a single _PrefersActivePersonalRow_OverInactiveOrgRow case.

3. (mine) Helper-function contract leak: EnumerateProxyServiceItems reused for /user-services

EnumerateProxyServiceItems (line 1621) walks services, custom_services, data. Its name and original purpose are scoped to /proxy/services (the catalog endpoint). The PR reuses it for /user-services based on the coincidence that both endpoints happen to nest under services. The new doc-comment acknowledges this — "reusing EnumerateProxyServiceItems is safe — but we accept only rows that look like UserService instances by checking presence of slug" — which is exactly the kind of "safe by coincidence" coupling that breaks silently when one of the two endpoints evolves independently at NyxID. If NyxID ever adds custom_services (org-shared overlay?) or data (paginated wrapper?) under /user-services, this enumerator picks them up with no contract check.

This violates CLAUDE.md's "API 字段单一语义 / 一个字段只表达一个含义" applied at the helper level: one helper doing two endpoint shapes with no protocol contract.

Cheapest fix: rename to EnumerateServiceListItems (drop "Proxy" from the name) and inline-document that both endpoints are expected to use this nesting convention. Better fix: a dedicated EnumerateUserServiceItems that only walks services (per the actual /user-services response shape the issue body documents) so a future schema drift fails loudly. The minor duplication is preferable to silent coupling — matches CLAUDE.md "正确架构优先".

Minor

  • var isActive = TryReadBool(svc, "is_active") ?? true; — fail-open default. Defensible (the worst-case fallout is creating an api-key for an inactive service, which NyxID will reject at proxy time with a different error), but worth flipping to ?? false if NyxID's contract guarantees is_active is always present. Match server behavior; don't guess.
  • The structured error error field uses string sentinels. For a tool wire-output that's fine, but if any upstream consumer branches on these values they should be a typed enum / proto field per CLAUDE.md "核心语义强类型". Not a blocker for this PR — flag for follow-up if these errors get parsed anywhere downstream.

Architecture check: clean

  • Layering respected: AgentBuilderTool (agents/) calls NyxIdApiClient (src/Aevatar.AI.ToolProviders.NyxId/). No cross-layer reverse deps introduced.
  • "删除优先": BestEffortRevokeApiKeyAsync and dead #411 comments removed. (See concern Refactor/project namespace #1 — the deletion is correct in spirit but the orphan-key consequence needs handling separately.)
  • "变更必须可验证": tests pass, includes a regression pin for the exact original bug.

Verdict

Fix #417 is correct. I'd ask for #1 (orphan revoke restoration) before merge — that one was caught and fixed once already and is regressing. #2 and #3 are improvements rather than blockers, but #2 is small enough to land in this same PR.

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core #417 direction is right: this PR switches the API-key scope resolver from the catalog list to /user-services and uses per-user UserService.id values. I found two remaining issues that can still break the flow or leave unreachable credentials behind.

// UserService rows for the same slug (legacy migration leftovers), order is
// server-determined; we don't try to pick "the best" one because the caller
// doesn't have a preference signal here.
if (resolutionsBySlug.ContainsKey(slug))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: the resolver should not freeze the first row for a slug and validate only that row. /user-services can legitimately contain multiple rows for the same catalog slug, such as an org-shared allowed:false row plus a personal active row, or inactive legacy rows plus a valid current row. With the current first-match behavior, an ineligible row that appears first returns service_org_viewer_only / service_inactive even though a valid UserService.id exists later in the payload, so /daily still fails for mixed bindings. Please collect candidates per slug and choose an eligible row (is_active == true and non-org or credential_source.allowed == true), only returning the specific inactive/org-viewer error when no eligible candidate exists. Add a regression where the invalid row appears before the valid row.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 94db054.

bestBySlug now keeps the most eligible row per slug instead of freezing the first one — replace logic only swaps in a candidate when (a) we have no row yet, or (b) existing is ineligible AND candidate is eligible. Eligibility is encoded as ServiceResolution.IsEligible = is_active && !(org && allowed != true). Specific service_inactive / service_org_viewer_only errors are only emitted when no eligible row exists for a slug.

Regression test _PicksEligibleRow_When_DuplicateSlugRowsExist covers both invalid-before-valid orderings:

  • api-github: org-viewer (allowed:false) before personal active
  • api-lark-bot: inactive before active

Pinned that allowed_service_ids carries svc-github-personal + svc-lark-active, not the first-seen ineligible ids.

if (preflight is not null)
{
await BestEffortRevokeApiKeyAsync(nyxClient, token, apiKeyId!, "github_preflight_failed", ct);
return preflight;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking/resource lifecycle: this branch aborts after minting the NyxID API key but before any actor/catalog state owns that key. Even if the downstream GitHub credential is the failing party, the newly minted key is now unreachable from Aevatar; after the user re-authorizes GitHub it can become a usable orphan scoped to the user services. That violates the resource ownership/cleanup rule for create flows. Keep the corrected GitHub re-authorization hint, but restore best-effort key deletion for aborts that happen after CreateApiKeyAsync and before actor initialization/persistence, and flip the test to assert the DELETE is attempted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 94db054.

Restored BestEffortRevokeApiKeyAsync and the call in the preflight failure branch. The hint stays at the corrected GitHub re-authorization wording (no regression on what #417 fixed). Comment at the call site (AgentBuilderTool.cs:252-257) explains the new rationale: each /daily retry mints a new api-key before preflight runs, so without revoke the user accumulates one orphan key per failed retry until they re-authorize GitHub.

Test _FailsClosed_When_GithubProxyDeniedForNewKey flipped to assert the DELETE on /api/v1/api-keys/key-403 IS issued (was: assert it is NOT issued). Stub for the DELETE handler restored.

…on preflight failure

Codex review (PR #418 r3141846173, r3141846175) flagged two issues:

P1 — `ResolveProxyServiceIdsAsync` froze the first row seen per slug. If
`/user-services` returned multiple rows for the same slug (mixed personal
+ org bindings, or stale + active rows), an ineligible row arriving first
would short-circuit resolution and emit `service_org_viewer_only` /
`service_inactive` even though a valid row existed later in the response.
NyxID does not guarantee any ordering, so this was a real false-negative.

Fix: replace existing-wins dedup with "keep the most eligible row per
slug". Eligibility is `is_active && !(org && allowed != true)`. A
`ServiceResolution.IsEligible` property captures the rule. The replace
logic only swaps in a candidate when (a) we have no row yet, or (b) the
existing row is ineligible and the candidate is eligible. Ineligible-vs-
ineligible races keep first-seen so we still surface a specific error.

Added regression test
`_PicksEligibleRow_When_DuplicateSlugRowsExist` that stubs an org-viewer
row before a personal row for `api-github`, and an inactive row before an
active row for `api-lark-bot`, and pins that `allowed_service_ids` carries
the personal/active ids — not the first-seen ones.

P2 — Removing `BestEffortRevokeApiKeyAsync` was wrong. Each `/daily`
retry mints a new api-key before preflight runs, so without revoke the
user accumulates one orphan proxy-scoped key per failed retry until they
re-authorize GitHub. The api-key correctness argument from the prior
commit holds (the key itself isn't broken under #417), but operationally
the orphan accumulation matters more than the correctness signal.

Fix: restore the helper and the call site, with updated comments
explaining the new rationale (cleanup on retry, not "the key is broken").
Updated `_FailsClosed_When_GithubProxyDeniedForNewKey` to assert the
DELETE fires (was: assert it does NOT fire — flipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.38%. Comparing base (cc95470) to head (d65adf5).
⚠️ Report is 4 commits behind head on dev.

@@           Coverage Diff           @@
##              dev     #418   +/-   ##
=======================================
  Coverage   70.38%   70.38%           
=======================================
  Files        1175     1175           
  Lines       84452    84453    +1     
  Branches    11124    11124           
=======================================
+ Hits        59439    59445    +6     
+ Misses      20721    20717    -4     
+ Partials     4292     4291    -1     
Flag Coverage Δ
ci 70.38% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...c/Aevatar.AI.ToolProviders.NyxId/NyxIdApiClient.cs 66.95% <100.00%> (+0.09%) ⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking after checking the local NyxID source at ~/Code/NyxID (HEAD 3279d9c95249e1ca159c66bec5754ba35281c783): the API-key create payload still needs allow_all_services = false when it sends allowed_service_ids.

NyxID contract from source:

  • CreateApiKeyRequest.allow_all_services defaults to true (backend/src/handlers/api_keys.rs:105-106).
  • create_key passes body.allowed_service_ids and body.allow_all_services straight to key_service::create_api_key (backend/src/handlers/api_keys.rs:1109-1112).
  • key_service::create_api_key validates service IDs only when !all_svcs (backend/src/services/key_service.rs:177-185).
  • proxy enforcement only checks allowed_service_ids when !auth_user.allow_all_services (backend/src/handlers/proxy.rs:1030-1033).

So with the current Aevatar payload, the key is broad and the corrected UserService.id list is ignored under the current NyxID source. That means #417 is not actually pinned by the payload, and the create flow also violates the narrow-scope boundary it is trying to enforce.

Please add allow_all_services = false in BuildCreateApiKeyPayload and update the tests that currently assert the field is omitted (for example AgentBuilderToolTests.cs around the TryGetProperty("allow_all_services", out _) assertions). allow_all_nodes can stay at the NyxID default unless this flow is also meant to restrict node routing.

…i-key payload

NyxID's `CreateApiKeyRequest.allow_all_services` defaults to `true`
(`backend/src/handlers/api_keys.rs:105` — `#[serde(default = "default_true")]`),
and proxy enforcement only checks `allowed_service_ids` when
`!auth_user.allow_all_services` (`proxy.rs:1030`). Omitting the field
from our payload means NyxID stored `true`, the resolved
`UserService.id` list was persisted but never consulted by enforcement,
and the new agent key had broad proxy reach.

Why production still surfaced a 403 on `/daily`: the agent path mints
child keys via the channel-relay delegation token; NyxID forces those
children to inherit `allow_all_services=false` from the parent, which
*then* makes `allowed_service_ids` matter — so the original #417 bug
(catalog id vs UserService id) was reachable. But the boundary was
brittle: any future caller minting directly under a broader parent (or
a session token) would silently bypass the narrow scope this flow tries
to enforce.

Fix: set `allow_all_services = false` explicitly in
`BuildCreateApiKeyPayload`. This also activates NyxID's `validate_service_ids`
at create-time (`key_service.rs:183`), so a malformed `UserService.id`
fails fast at `POST /api-keys` rather than passing through silently.

`allow_all_nodes` stays at the NyxID default — this flow does not
restrict node routing.

Tests: `_DispatchesInitializeAndImmediateTrigger` (line 220) and
`_SocialMedia_UpsertsWorkflowAndInitializesWorkflowAgent` (line 1750)
flipped from "field is absent" to "field is present and false".
Resolver doc comment updated to note that the narrow scope is now
first-class via the payload, not contingent on parent inheritance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 25, 2026

Fixed in d65adf5.

Reviewer is right — verified independently against ~/Code/NyxID HEAD 3279d9c9:

  • backend/src/handlers/api_keys.rs:105allow_all_services is #[serde(default = "default_true")], so omission = true.
  • backend/src/handlers/proxy.rs:1030 — enforcement only fires when !auth_user.allow_all_services.
  • backend/src/services/key_service.rs:179,183 — same default in the create path, and validate_service_ids only runs under narrow scope.

The 403 we saw in production reproduced because the channel-relay delegation token forces children to inherit allow_all_services=false. That's not a contract we should depend on — any future caller minting directly under a broader parent (session token, etc.) would silently bypass the narrow scope this flow is supposed to enforce, and the resolver work in this PR would be a no-op.

Changes:

  • BuildCreateApiKeyPayload now sets allow_all_services = false explicitly. Side benefit: this also turns on NyxID's validate_service_ids at create-time (key_service.rs:183), so a malformed UserService.id fails fast at POST /api-keys instead of silently passing through and 403'ing every later proxy call.
  • allow_all_nodes left at the NyxID default per your note — this flow doesn't restrict node routing.
  • Tests at AgentBuilderToolTests.cs:220 and :1750 flipped from TryGetProperty(...).Should().BeFalse() (field absent) to GetProperty("allow_all_services").GetBoolean().Should().BeFalse() (field present and false).
  • Resolver doc comment updated to note the narrow scope is now first-class via the payload, not contingent on parent inheritance.

421/421 tests passing.

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed latest head d65adf5755646229a6d6330c1b6692d51c0b3643 against the local NyxID source at ~/Code/NyxID (3279d9c95249e1ca159c66bec5754ba35281c783). The code-level blockers are addressed:

  • BuildCreateApiKeyPayload now sends allow_all_services = false, so NyxID validates and enforces the allowed_service_ids list instead of defaulting to broad service access.
  • ResolveProxyServiceIdsAsync now prefers an eligible UserService.id when duplicate slugs are returned, and only emits inactive/org-viewer errors when no eligible row exists.
  • The GitHub preflight failure path restores best-effort key deletion after the key is minted but before any actor/catalog state owns it.

Local verification: dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo passes 421/421.

Non-blocking cleanup before merge: the PR description is stale. It still says BestEffortRevokeApiKeyAsync was retired and that the preflight test asserts the key is not revoked, and the test plan still says 420/420; latest code restored revoke and tests are 421/421. Please update the PR body so the merge record matches the final behavior.

@eanzhao eanzhao merged commit f5cc284 into dev Apr 25, 2026
12 checks passed
Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up after checking production-style logs from the deployed fix: the new code path is active (GET /user-services -> POST /api-keys -> GET /proxy/s/api-github/rate_limit -> 403 -> DELETE /api-keys/{id}), but /daily can still fail before actor creation.

The remaining architectural gap is that PR #418 scopes the child key to one or more resolved UserService.id values, but all proxy calls still go through the slug route (/api/v1/proxy/s/api-github/... and later State.OutboundConfig.NyxProviderSlug) without pinning the same UserService.id via NyxID's ?_nyxid_via=<user_service_id> override.

NyxID source has that override specifically because slug resolution is not a stable identity contract:

  • backend/src/handlers/proxy.rs:319-370 / :531-584 accept ?_nyxid_via=<user_service_id> and then execute with that exact UserService.id.
  • Without _nyxid_via, the slug route calls resolve_proxy_target_from_user_service(..., Some(slug), None) and resolves again at proxy time.
  • backend/src/services/user_service_service.rs:173-182 find_by_slug is a find_one over active rows with no stable sort. If multiple active rows exist for the same slug, the proxy can choose a different UserService.id than the one Aevatar put into allowed_service_ids, and NyxID then correctly returns api_key_scope_forbidden at proxy.rs:1030-1034.

So the current fix proves the create payload uses UserService.id, but it still does not bind later proxy execution to the same service identity. The production log's 403 could be a real GitHub OAuth 403, but the current Aevatar code collapses any proxy 401/403 into github_proxy_access_denied and does not log the NyxID response body, so it cannot distinguish GitHub Bad credentials from NyxID api_key_scope_forbidden / org_role_insufficient.

Suggested fix before treating #417 as closed:

  1. Resolve service bindings as a typed slug -> UserService.id map, not just a flat ID list.
  2. Use that map to append _nyxid_via=<id> on the GitHub preflight call and on every runtime proxy call made with the scoped child key. Persist the map in typed proto fields on the relevant agent config/catalog state rather than relying on slug-only routing.
  3. In PreflightGitHubProxyAsync, parse proxy_body and return/log distinct NyxID-scope errors separately from downstream GitHub OAuth errors. Otherwise the user gets told to re-authorize GitHub when the actual failure is still service-id scoping.

/agents silence in the same logs looks separate: channel-relay/reply returns 502 for the interactive card payload before any agent-builder create path runs, and the relay token is single-use so the text fallback is intentionally not retried. That should be tracked outside #418 unless this PR is expected to cover the card payload issue too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant