Conversation
…llowed_service_ids (#417) The agent-create flow minted child API keys whose `allowed_service_ids` carried `DownstreamService.id` (catalog UUIDs) sourced from `GET /proxy/services`. NyxID proxy enforcement (proxy.rs:1030) compares the API key's `allowed_service_ids` against `UserService.id` (per-user instance UUIDs), so every proxied call returned 403 `ApiKeyScopeForbidden`. Session-token-minted keys default to `allow_all_services=true`, which short-circuits the check — that's why the bug was invisible in CLI repros and only surfaced for agent-path keys (which inherit `allow_all_services=false` from the channel-relay parent). The fix routes through `GET /api/v1/user-services` and populates `allowed_service_ids` with each row's per-user `id`, plus filters out inactive rows and org-shared rows the caller cannot proxy through (the latter would otherwise resurface as the less-actionable `org_role_insufficient`). Failure modes return structured error JSON with stable error keys (`service_not_connected`, `service_inactive`, `service_org_viewer_only`) instead of the previous free-text wrap. Also retires `BestEffortRevokeApiKeyAsync` and rewords the GitHub preflight hint: under #411 we revoked the freshly-minted key on a preflight 403 because we mis-attributed the failure to the api-key. With #417 fixed, preflight only fires on actual GitHub-side issues (OAuth grant revoked, scopes downgraded, account temp-banned) — the api-key is healthy at that point. Hint now points users at re-authorizing the GitHub provider, not at api-key bindings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 65b2d1c18a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (resolutionsBySlug.ContainsKey(slug)) | ||
| continue; |
There was a problem hiding this comment.
Prefer an eligible service row when duplicate slugs exist
In ResolveProxyServiceIdsAsync, the first row seen for a slug is frozen as the only candidate. If /user-services returns multiple rows for the same slug (for example an org-shared allowed:false row before a personal active row), this path will emit service_org_viewer_only/service_inactive and abort even though a valid service is available later in the payload. That creates false-negative agent creation failures for users with mixed service bindings.
Useful? React with 👍 / 👎.
| if (preflight is not null) | ||
| { | ||
| await BestEffortRevokeApiKeyAsync(nyxClient, token, apiKeyId!, "github_preflight_failed", ct); | ||
| return preflight; |
There was a problem hiding this comment.
Revoke API keys when GitHub preflight aborts agent creation
After the key is minted, the preflight failure branch now returns immediately without deleting that key. In the revoked/downgraded GitHub-token scenario, each /daily retry will create another unused NyxID API key while still failing before actor creation, causing unbounded orphan key accumulation and stale credential surface area.
Useful? React with 👍 / 👎.
Review: does this fix #417, and is it architecturally sound?TL;DR — fix is correct, but ship with two outstanding concerns Codex already flagged plus one additional one below. ✅ Does this solve #417?Yes. Root cause is correctly identified and the fix is targeted:
|
eanzhao
left a comment
There was a problem hiding this comment.
Core #417 direction is right: this PR switches the API-key scope resolver from the catalog list to /user-services and uses per-user UserService.id values. I found two remaining issues that can still break the flow or leave unreachable credentials behind.
| // UserService rows for the same slug (legacy migration leftovers), order is | ||
| // server-determined; we don't try to pick "the best" one because the caller | ||
| // doesn't have a preference signal here. | ||
| if (resolutionsBySlug.ContainsKey(slug)) |
There was a problem hiding this comment.
Blocking: the resolver should not freeze the first row for a slug and validate only that row. /user-services can legitimately contain multiple rows for the same catalog slug, such as an org-shared allowed:false row plus a personal active row, or inactive legacy rows plus a valid current row. With the current first-match behavior, an ineligible row that appears first returns service_org_viewer_only / service_inactive even though a valid UserService.id exists later in the payload, so /daily still fails for mixed bindings. Please collect candidates per slug and choose an eligible row (is_active == true and non-org or credential_source.allowed == true), only returning the specific inactive/org-viewer error when no eligible candidate exists. Add a regression where the invalid row appears before the valid row.
There was a problem hiding this comment.
Fixed in 94db054.
bestBySlug now keeps the most eligible row per slug instead of freezing the first one — replace logic only swaps in a candidate when (a) we have no row yet, or (b) existing is ineligible AND candidate is eligible. Eligibility is encoded as ServiceResolution.IsEligible = is_active && !(org && allowed != true). Specific service_inactive / service_org_viewer_only errors are only emitted when no eligible row exists for a slug.
Regression test _PicksEligibleRow_When_DuplicateSlugRowsExist covers both invalid-before-valid orderings:
api-github: org-viewer (allowed:false) before personal activeapi-lark-bot: inactive before active
Pinned that allowed_service_ids carries svc-github-personal + svc-lark-active, not the first-seen ineligible ids.
| if (preflight is not null) | ||
| { | ||
| await BestEffortRevokeApiKeyAsync(nyxClient, token, apiKeyId!, "github_preflight_failed", ct); | ||
| return preflight; |
There was a problem hiding this comment.
Blocking/resource lifecycle: this branch aborts after minting the NyxID API key but before any actor/catalog state owns that key. Even if the downstream GitHub credential is the failing party, the newly minted key is now unreachable from Aevatar; after the user re-authorizes GitHub it can become a usable orphan scoped to the user services. That violates the resource ownership/cleanup rule for create flows. Keep the corrected GitHub re-authorization hint, but restore best-effort key deletion for aborts that happen after CreateApiKeyAsync and before actor initialization/persistence, and flip the test to assert the DELETE is attempted.
There was a problem hiding this comment.
Fixed in 94db054.
Restored BestEffortRevokeApiKeyAsync and the call in the preflight failure branch. The hint stays at the corrected GitHub re-authorization wording (no regression on what #417 fixed). Comment at the call site (AgentBuilderTool.cs:252-257) explains the new rationale: each /daily retry mints a new api-key before preflight runs, so without revoke the user accumulates one orphan key per failed retry until they re-authorize GitHub.
Test _FailsClosed_When_GithubProxyDeniedForNewKey flipped to assert the DELETE on /api/v1/api-keys/key-403 IS issued (was: assert it is NOT issued). Stub for the DELETE handler restored.
…on preflight failure Codex review (PR #418 r3141846173, r3141846175) flagged two issues: P1 — `ResolveProxyServiceIdsAsync` froze the first row seen per slug. If `/user-services` returned multiple rows for the same slug (mixed personal + org bindings, or stale + active rows), an ineligible row arriving first would short-circuit resolution and emit `service_org_viewer_only` / `service_inactive` even though a valid row existed later in the response. NyxID does not guarantee any ordering, so this was a real false-negative. Fix: replace existing-wins dedup with "keep the most eligible row per slug". Eligibility is `is_active && !(org && allowed != true)`. A `ServiceResolution.IsEligible` property captures the rule. The replace logic only swaps in a candidate when (a) we have no row yet, or (b) the existing row is ineligible and the candidate is eligible. Ineligible-vs- ineligible races keep first-seen so we still surface a specific error. Added regression test `_PicksEligibleRow_When_DuplicateSlugRowsExist` that stubs an org-viewer row before a personal row for `api-github`, and an inactive row before an active row for `api-lark-bot`, and pins that `allowed_service_ids` carries the personal/active ids — not the first-seen ones. P2 — Removing `BestEffortRevokeApiKeyAsync` was wrong. Each `/daily` retry mints a new api-key before preflight runs, so without revoke the user accumulates one orphan proxy-scoped key per failed retry until they re-authorize GitHub. The api-key correctness argument from the prior commit holds (the key itself isn't broken under #417), but operationally the orphan accumulation matters more than the correctness signal. Fix: restore the helper and the call site, with updated comments explaining the new rationale (cleanup on retry, not "the key is broken"). Updated `_FailsClosed_When_GithubProxyDeniedForNewKey` to assert the DELETE fires (was: assert it does NOT fire — flipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## dev #418 +/- ##
=======================================
Coverage 70.38% 70.38%
=======================================
Files 1175 1175
Lines 84452 84453 +1
Branches 11124 11124
=======================================
+ Hits 59439 59445 +6
+ Misses 20721 20717 -4
+ Partials 4292 4291 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 3 files with indirect coverage changes 🚀 New features to boost your workflow:
|
eanzhao
left a comment
There was a problem hiding this comment.
Blocking after checking the local NyxID source at ~/Code/NyxID (HEAD 3279d9c95249e1ca159c66bec5754ba35281c783): the API-key create payload still needs allow_all_services = false when it sends allowed_service_ids.
NyxID contract from source:
CreateApiKeyRequest.allow_all_servicesdefaults totrue(backend/src/handlers/api_keys.rs:105-106).create_keypassesbody.allowed_service_idsandbody.allow_all_servicesstraight tokey_service::create_api_key(backend/src/handlers/api_keys.rs:1109-1112).key_service::create_api_keyvalidates service IDs only when!all_svcs(backend/src/services/key_service.rs:177-185).- proxy enforcement only checks
allowed_service_idswhen!auth_user.allow_all_services(backend/src/handlers/proxy.rs:1030-1033).
So with the current Aevatar payload, the key is broad and the corrected UserService.id list is ignored under the current NyxID source. That means #417 is not actually pinned by the payload, and the create flow also violates the narrow-scope boundary it is trying to enforce.
Please add allow_all_services = false in BuildCreateApiKeyPayload and update the tests that currently assert the field is omitted (for example AgentBuilderToolTests.cs around the TryGetProperty("allow_all_services", out _) assertions). allow_all_nodes can stay at the NyxID default unless this flow is also meant to restrict node routing.
…i-key payload NyxID's `CreateApiKeyRequest.allow_all_services` defaults to `true` (`backend/src/handlers/api_keys.rs:105` — `#[serde(default = "default_true")]`), and proxy enforcement only checks `allowed_service_ids` when `!auth_user.allow_all_services` (`proxy.rs:1030`). Omitting the field from our payload means NyxID stored `true`, the resolved `UserService.id` list was persisted but never consulted by enforcement, and the new agent key had broad proxy reach. Why production still surfaced a 403 on `/daily`: the agent path mints child keys via the channel-relay delegation token; NyxID forces those children to inherit `allow_all_services=false` from the parent, which *then* makes `allowed_service_ids` matter — so the original #417 bug (catalog id vs UserService id) was reachable. But the boundary was brittle: any future caller minting directly under a broader parent (or a session token) would silently bypass the narrow scope this flow tries to enforce. Fix: set `allow_all_services = false` explicitly in `BuildCreateApiKeyPayload`. This also activates NyxID's `validate_service_ids` at create-time (`key_service.rs:183`), so a malformed `UserService.id` fails fast at `POST /api-keys` rather than passing through silently. `allow_all_nodes` stays at the NyxID default — this flow does not restrict node routing. Tests: `_DispatchesInitializeAndImmediateTrigger` (line 220) and `_SocialMedia_UpsertsWorkflowAndInitializesWorkflowAgent` (line 1750) flipped from "field is absent" to "field is present and false". Resolver doc comment updated to note that the narrow scope is now first-class via the payload, not contingent on parent inheritance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Fixed in d65adf5. Reviewer is right — verified independently against
The 403 we saw in production reproduced because the channel-relay delegation token forces children to inherit Changes:
421/421 tests passing. |
eanzhao
left a comment
There was a problem hiding this comment.
Re-reviewed latest head d65adf5755646229a6d6330c1b6692d51c0b3643 against the local NyxID source at ~/Code/NyxID (3279d9c95249e1ca159c66bec5754ba35281c783). The code-level blockers are addressed:
BuildCreateApiKeyPayloadnow sendsallow_all_services = false, so NyxID validates and enforces theallowed_service_idslist instead of defaulting to broad service access.ResolveProxyServiceIdsAsyncnow prefers an eligibleUserService.idwhen duplicate slugs are returned, and only emits inactive/org-viewer errors when no eligible row exists.- The GitHub preflight failure path restores best-effort key deletion after the key is minted but before any actor/catalog state owns it.
Local verification: dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo passes 421/421.
Non-blocking cleanup before merge: the PR description is stale. It still says BestEffortRevokeApiKeyAsync was retired and that the preflight test asserts the key is not revoked, and the test plan still says 420/420; latest code restored revoke and tests are 421/421. Please update the PR body so the merge record matches the final behavior.
eanzhao
left a comment
There was a problem hiding this comment.
Follow-up after checking production-style logs from the deployed fix: the new code path is active (GET /user-services -> POST /api-keys -> GET /proxy/s/api-github/rate_limit -> 403 -> DELETE /api-keys/{id}), but /daily can still fail before actor creation.
The remaining architectural gap is that PR #418 scopes the child key to one or more resolved UserService.id values, but all proxy calls still go through the slug route (/api/v1/proxy/s/api-github/... and later State.OutboundConfig.NyxProviderSlug) without pinning the same UserService.id via NyxID's ?_nyxid_via=<user_service_id> override.
NyxID source has that override specifically because slug resolution is not a stable identity contract:
backend/src/handlers/proxy.rs:319-370/:531-584accept?_nyxid_via=<user_service_id>and then execute with that exactUserService.id.- Without
_nyxid_via, the slug route callsresolve_proxy_target_from_user_service(..., Some(slug), None)and resolves again at proxy time. backend/src/services/user_service_service.rs:173-182find_by_slugis afind_oneover active rows with no stable sort. If multiple active rows exist for the same slug, the proxy can choose a differentUserService.idthan the one Aevatar put intoallowed_service_ids, and NyxID then correctly returnsapi_key_scope_forbiddenatproxy.rs:1030-1034.
So the current fix proves the create payload uses UserService.id, but it still does not bind later proxy execution to the same service identity. The production log's 403 could be a real GitHub OAuth 403, but the current Aevatar code collapses any proxy 401/403 into github_proxy_access_denied and does not log the NyxID response body, so it cannot distinguish GitHub Bad credentials from NyxID api_key_scope_forbidden / org_role_insufficient.
Suggested fix before treating #417 as closed:
- Resolve service bindings as a typed slug ->
UserService.idmap, not just a flat ID list. - Use that map to append
_nyxid_via=<id>on the GitHub preflight call and on every runtime proxy call made with the scoped child key. Persist the map in typed proto fields on the relevant agent config/catalog state rather than relying on slug-only routing. - In
PreflightGitHubProxyAsync, parseproxy_bodyand return/log distinct NyxID-scope errors separately from downstream GitHub OAuth errors. Otherwise the user gets told to re-authorize GitHub when the actual failure is still service-id scoping.
/agents silence in the same logs looks separate: channel-relay/reply returns 502 for the interactive card payload before any agent-builder create path runs, and the relay token is single-use so the text fallback is intentionally not retried. That should be tracked outside #418 unless this PR is expected to cover the card payload issue too.
Summary
/daily(and anyagent_builderflow that mints a child NyxID API key) consistently fails withgithub_proxy_access_deniedeven when the user's GitHub OAuth is healthy. Root cause is that we populate the new key'sallowed_service_idswith catalogDownstreamService.idvalues, but NyxID's proxy enforcement (proxy.rs:1030) compares against per-userUserService.id. The mismatch is silently accepted at create-time and surfaces as 403ApiKeyScopeForbiddenon every proxy call.GET /api/v1/user-servicesand populatesallowed_service_idswith each per-userUserService.id. When the same slug has multiple rows (mixed personal + org bindings, stale rows, etc.), prefers the most eligible one — only emitsservice_inactive/service_org_viewer_onlyerrors when no eligible row exists.allow_all_services = falsein the api-key payload so NyxID's enforcement actually consultsallowed_service_ids(the field defaults totrue, which short-circuits enforcement and silently grants broad scope; see "Why this was invisible" below).BestEffortRevokeApiKeyAsync: the GitHub preflight failure path still revokes the freshly minted key so retries don't accumulate orphans. Hint reworded to point at re-authorizing the GitHub provider, not at api-key bindings (the previous hint was based on the misdiagnosis that SkillRunner daily_report GitHub proxy 403s at runtime #411 fixed).Why this was invisible until production
Session-token-minted API keys default to
allow_all_services=true, which short-circuits the enforcement check (NyxIDproxy.rs:1030). A developer reproducingPOST /api-keys+/proxy/s/api-github/rate_limitfrom a CLI never tripped the bug. The agent path mints child keys via the channel-relay delegation token; NyxID forces those children to inheritallow_all_services=falsefrom the parent, which is what activates the enforcement check and makes the ID mismatch fatal. See #417 review comment for the full repro matrix.This PR also makes the narrow scope first-class —
BuildCreateApiKeyPayloadsendsallow_all_services = falseexplicitly, so the resolver's output is enforced regardless of what the parent's setting happens to be. As a side benefit this also turns on NyxID'svalidate_service_idsat create-time (key_service.rs:183), so a malformedUserService.idfails fast atPOST /api-keysinstead of silently passing through and 403'ing every later proxy call.Changes
src/Aevatar.AI.ToolProviders.NyxId/NyxIdApiClient.cs— addListUserServicesAsync(GET /api/v1/user-services).agents/Aevatar.GAgents.ChannelRuntime/AgentBuilderTool.cs:ResolveProxyServiceIdsAsyncto use/user-servicesand return per-userUserService.id. When the same slug has duplicate rows, prefer the most eligible (is_active && !(org && allowed != true)) instead of freezing the first row seen. Specificservice_inactive/service_org_viewer_onlyerrors are only emitted when no eligible row exists for a slug.errorkeys (service_not_connected,service_inactive,service_org_viewer_only,user_services_unavailable,user_services_parse_failed,no_required_slugs). Both call sites return the envelope verbatim instead of double-wrapping.BuildCreateApiKeyPayloadaddsallow_all_services = false(per review #4175529548).allow_all_nodesleft at the NyxID default — this flow does not restrict node routing.PreflightGitHubProxyAsyncdoc comment + user-facinghintrewritten to point at re-authorizing the GitHub provider at NyxID, not at api-key bindings.BestEffortRevokeApiKeyAsyncretained for the preflight failure path so retries don't accumulate orphan keys.test/Aevatar.GAgents.ChannelRuntime.Tests/AgentBuilderToolTests.cs:/proxy/services?per_page=100to/user-serviceswith the new row shape (id,slug,is_active,credential_source)._FailsClosed_When_RequiredProxyServices_AreMissingto assert the structuredservice_not_connectedenvelope instead of free-text._FailsClosed_When_GithubProxyDeniedForNewKeyto pin the new hint wording AND that the api-key IS revoked (DELETE on/api/v1/api-keys/key-403).allow_all_servicesassertions flipped fromTryGetProperty(...).Should().BeFalse()(field absent) toGetProperty(...).GetBoolean().Should().BeFalse()(field present andfalse)._FailsClosed_When_RequiredSlug_IsInactive—is_active: falserow →service_inactive._FailsClosed_When_OrgSharedSlug_IsViewerOnly— org-sharedallowed: false→service_org_viewer_only._AllowedServiceIds_AreUserServiceIds_NotCatalogIds— regression pin: stubiddistinct fromcatalog_service_id, assert the api-key payload carries the per-userid._PicksEligibleRow_When_DuplicateSlugRowsExist— duplicate slug rows with the ineligible one first; assert the resolver still picks the eligible one.Test plan
dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj— 421/421 passing.bash tools/ci/architecture_guards.sh— clean (only flag is missingbuffor proto lint, unrelated)./daily alicefrom a Lark DM should now succeed end-to-end with a healthy GitHub OAuth connection. With a revoked GitHub token the response should be thegithub_proxy_access_deniederror pointing at re-authorization (not at api-key bindings).Follow-ups
allowed_service_idsaccept catalogDownstreamService.idorslugas aliases. Would prevent future SDK consumers from repeating this footgun. Not required if this fix lands./agents502 is tracked separately at fix(lark): malformed v2 card schema causes nyx-api 502, /agents silently drops reply #416.