feat(catalog): seed unauthenticated public APIs (arXiv, OpenAlex, Crossref)#592
feat(catalog): seed unauthenticated public APIs (arXiv, OpenAlex, Crossref)#592AlyciaBHZ wants to merge 1 commit intoChronoAIProject:mainfrom
Conversation
…ssref) Adds DEFAULT_PUBLIC_SERVICE_SEEDS + a parallel seed loop in seed_default_services for catalog entries that don't bind to any ProviderConfig. Resulting DownstreamService rows have: - provider_config_id: None - auth_method: "none" - requires_user_credential: false - no ServiceProviderRequirement build_catalog_entry already tolerates `provider: None` and emits `requires_credential: false`, so these surface in the AI Services dialog as one-click no-auth services. The proxy injects nothing — the benefit is centralised audit logging and a single place to manage polite-pool / rate-limit headers across agents that hit the same public source. Three initial seeds: - `arxiv-api` (http://export.arxiv.org/api): Atom feed search/metadata - `api-openalex` (https://api.openalex.org): 240M+ scholarly works graph - `api-crossref` (https://api.crossref.org): DOI metadata + citations Each description includes the polite-pool convention so agents can discover it from `nyxid catalog show <slug>` without leaving NyxID. Tests: - `public_service_seeds_have_unique_slugs_and_no_collision_with_default_seeds` - `arxiv_public_seed_is_present_and_unauthenticated` Motivation: agents working on academic / open-problem domains (e.g. literature staleness checks against erdosproblems / RESEARCH_BOARD targets, citation graph mining) need these sources first-class. Today they have to use `service add --custom` per machine and lose the audit trail. Seeding them in catalog gives one-line `nyxid service add arxiv-api` everywhere.
edb9f9d to
d1934e0
Compare
kaiweijw
left a comment
There was a problem hiding this comment.
Direction makes sense — having arxiv-api / api-openalex / api-crossref as one-liner catalog entries is a real ergonomic win over service add --custom, and the is_truly_no_auth path in unified_key_service.rs:523 was already designed for exactly this shape. Code is clean, additive, idempotent, CI green, builds and tests pass locally for me too.
Three things I'd like fixed before this lands. Two are correctness, one is a description-only fix that affects how this gets evaluated.
1. arXiv URL should be https://, not http://
http://export.arxiv.org/api leaks every search_query=... to any on-path observer. arXiv supports https on the same host. Since the headline benefit framing is observability/audit, sending the request itself in the clear undercuts the value prop. One-character fix:
base_url: "https://export.arxiv.org/api",2. The "AI Services dialog" claim is wrong as written — pick a fix
These seeds will not appear in the web /keys AI Services dialog as currently filtered. frontend/src/hooks/use-keys.ts:36 (useCatalog) calls /catalog, which routes to catalog_service::list_catalog (backend/src/services/catalog_service.rs:217), whose $or requires:
requires_user_credential: true
OR requires_user_credential: { $exists: false }
OR provider_config_id: { $ne: null }
The new seeds set requires_user_credential: false AND provider_config_id: None, so they're excluded. They're only reachable via:
/catalog?include_all=true(CLI wizard,catalog-grid.tsx:138)/catalog/{slug}direct lookup (which is whatnyxid service add arxiv-apiuses — i.e. the path you actually tested)
Two options:
- (a) Make the claim true. Add a fourth
$orclause tolist_catalog:{ \"auth_method\": \"none\", \"service_category\": \"internal\", \"provider_config_id\": null }. I checked — all 28 existing provider-backed seeds still match via the existingprovider_config_id != nullclause, so no double-counting. Add a unit test that callslist_catalogagainst a seeded fixture and assertsarxiv-apiappears. - (b) Drop the claim. Edit the PR description to say these surface via
nyxid catalog list --allandnyxid service add <slug>, not the web dialog.
I'd prefer (a) since it makes the feature actually do what the description promises.
3. The audit-trail framing in the PR description is incorrect
currently via a per-machine
service add --customwhich loses the audit trail
This isn't true. --custom services are routed through the same /api/v1/proxy/{service_id}/{path} handler, which calls audit_personal_routing / audit_org_routing on every request (handlers/proxy.rs:179, 217). Custom and catalog-seeded services produce identical audit entries. The audit benefit only holds vs. raw curl bypassing NyxID entirely — which is a real benefit, just a different one.
Suggest reframing the motivation as: (i) ergonomics — no per-user --base-url boilerplate; (ii) single-source-of-truth for the URL/description; (iii) admin-managed default_request_headers propagating polite-pool config to every agent without per-machine setup. The polite-pool argument is the strongest one and is well-suited to default_request_headers — though note the current PR sets default_request_headers: None, so an admin still needs to populate it post-seed. Worth either adding a comment about this in the seed table or filing a follow-up.
Smaller, optional
- The
service_category: "internal"/created_by: "system"/visibility: "public"choices follow existing convention but aren't called out — one comment line in the seed struct would help future maintainers reason about it. - Parallel seed table vs threading
Option<&str>throughDefaultServiceSeed— fine choice, your justification in the PR holds. Just be aware that future capability/header backfill mechanisms will need a parallel path too.
Happy to approve once #1 and #2 (either option) are in. #3 is description-only.
Summary
Adds three first-class catalog entries for unauthenticated public academic APIs:
arxiv-api(http://export.arxiv.org/api) — Atom feed search/metadata, 2.5M+ papersapi-openalex(https://api.openalex.org) — 240M+ scholarly works, authors, citationsapi-crossref(https://api.crossref.org) — DOI metadata + citation graph (~150M works)These have no
ProviderConfigto bind to. The implementation introduces a parallelDEFAULT_PUBLIC_SERVICE_SEEDStable and a second seed loop inseed_default_servicesthat producesDownstreamServicerows withprovider_config_id: None,auth_method: \"none\",requires_user_credential: false, and noServiceProviderRequirement.build_catalog_entryalready toleratesprovider: Noneand returnsrequires_credential: false, so these surface in the AI Services dialog as one-click no-auth services. Verified locally —nyxid service add arxiv-apibecomes a one-shot operation, no--customboilerplate required.Why route public APIs through NyxID at all?
The proxy injects nothing on these calls. The benefit is operational:
User-Agent: app/version (mailto:you@example.com)header for higher rate limits, an admin can set it once viaservice update --default-headerand every agent honours it without code changes.nyxid catalog show arxiv-apireturns a description that explains the no-auth policy and the polite-pool convention, so agents working on a new academic-data task discover the source from inside NyxID.Motivation
I'm using NyxID to broker external APIs for an outreach pipeline that targets open mathematical conjectures (https://github.com/the-omega-institute/automath). The pipeline routinely hits arXiv to scan recent math.NT/CO/AG/DS papers against a research board — currently via a per-machine
service add --customwhich loses the audit trail and doesn't propagate the polite-pool header to other agents that need the same source. Seeding these as catalog entries makes them as ergonomic asapi-githuborapi-redditalready are.The same argument extends to any agent doing literature work, citation mining, or paper deduplication.
Implementation notes
DEFAULT_SERVICE_SEEDSrather than threadingOption<&str>through the existingprovider_slugfield. This isolates the no-provider path from the existing 28 provider-backed seeds, so the audit / SPR / token-exchange logic stays unchanged for the credentialled cases.DEFAULT_SERVICE_SEEDS.DownstreamServicedoesn't have a separatedocumentation_urlfield; happy to follow up with a separate PR adding that field if desirable.Test plan
cargo check -p nyxidcleancargo fmt -p nyxidcleancargo clippy -p nyxidcleancargo test -p nyxid services::provider_service::tests::public_service_seeds_have_unique_slugs_and_no_collision_with_default_seeds✅cargo test -p nyxid services::provider_service::tests::arxiv_public_seed_is_present_and_unauthenticated✅nyxid service add arxiv-api(after deleting prior--customarxiv-api) →nyxid proxy request arxiv-api '/query?search_query=cat:math.NT&max_results=5' -m GETreturns Atom feed end-to-end.Follow-up ideas
documentation_urltoDownstreamServiceand re-fold the URLs out of the descriptions.DEFAULT_SERVICE_SEEDSwith a newsemanticscholarprovider).