feat(catalog): seed unauthenticated public APIs (arXiv, OpenAlex, Crossref) by AlyciaBHZ · Pull Request #592 · ChronoAIProject/NyxID

AlyciaBHZ · 2026-04-30T13:08:10Z

Summary

Adds three first-class catalog entries for unauthenticated public academic APIs:

arxiv-api (http://export.arxiv.org/api) — Atom feed search/metadata, 2.5M+ papers
api-openalex (https://api.openalex.org) — 240M+ scholarly works, authors, citations
api-crossref (https://api.crossref.org) — DOI metadata + citation graph (~150M works)

These have no ProviderConfig to bind to. The implementation introduces a parallel DEFAULT_PUBLIC_SERVICE_SEEDS table and a second seed loop in seed_default_services that produces DownstreamService rows with provider_config_id: None, auth_method: \"none\", requires_user_credential: false, and no ServiceProviderRequirement.

build_catalog_entry already tolerates provider: None and returns requires_credential: false, so these surface in the AI Services dialog as one-click no-auth services. Verified locally — nyxid service add arxiv-api becomes a one-shot operation, no --custom boilerplate required.

Why route public APIs through NyxID at all?

The proxy injects nothing on these calls. The benefit is operational:

Centralised audit logging — every academic-source call appears in NyxID's request log alongside authenticated proxy traffic, instead of scattered curl logs across machines.
Single point for polite-pool configuration — when arXiv/OpenAlex/Crossref recommend a User-Agent: app/version (mailto:you@example.com) header for higher rate limits, an admin can set it once via service update --default-header and every agent honours it without code changes.
Future rate-limit / circuit-breaker hooks — if NyxID later adds per-service rate limiters, public APIs benefit identically to credentialled ones.
Discoverability — nyxid catalog show arxiv-api returns a description that explains the no-auth policy and the polite-pool convention, so agents working on a new academic-data task discover the source from inside NyxID.

Motivation

I'm using NyxID to broker external APIs for an outreach pipeline that targets open mathematical conjectures (https://github.com/the-omega-institute/automath). The pipeline routinely hits arXiv to scan recent math.NT/CO/AG/DS papers against a research board — currently via a per-machine service add --custom which loses the audit trail and doesn't propagate the polite-pool header to other agents that need the same source. Seeding these as catalog entries makes them as ergonomic as api-github or api-reddit already are.

The same argument extends to any agent doing literature work, citation mining, or paper deduplication.

Implementation notes

New seed table is distinct from DEFAULT_SERVICE_SEEDS rather than threading Option<&str> through the existing provider_slug field. This isolates the no-provider path from the existing 28 provider-backed seeds, so the audit / SPR / token-exchange logic stays unchanged for the credentialled cases.
Slug uniqueness is enforced by a unit test that also asserts no collision with DEFAULT_SERVICE_SEEDS.
Descriptions inline the official documentation URLs (and polite-pool convention where applicable) since DownstreamService doesn't have a separate documentation_url field; happy to follow up with a separate PR adding that field if desirable.

Test plan

cargo check -p nyxid clean
cargo fmt -p nyxid clean
cargo clippy -p nyxid clean
cargo test -p nyxid services::provider_service::tests::public_service_seeds_have_unique_slugs_and_no_collision_with_default_seeds ✅
cargo test -p nyxid services::provider_service::tests::arxiv_public_seed_is_present_and_unauthenticated ✅
Locally exercised: nyxid service add arxiv-api (after deleting prior --custom arxiv-api) → nyxid proxy request arxiv-api '/query?search_query=cat:math.NT&max_results=5' -m GET returns Atom feed end-to-end.

Follow-up ideas

Add documentation_url to DownstreamService and re-fold the URLs out of the descriptions.
Add Semantic Scholar (requires API key, would go in DEFAULT_SERVICE_SEEDS with a new semanticscholar provider).
Add ORCID public API (read-only) here under the same no-auth path.

…ssref) Adds DEFAULT_PUBLIC_SERVICE_SEEDS + a parallel seed loop in seed_default_services for catalog entries that don't bind to any ProviderConfig. Resulting DownstreamService rows have: - provider_config_id: None - auth_method: "none" - requires_user_credential: false - no ServiceProviderRequirement build_catalog_entry already tolerates `provider: None` and emits `requires_credential: false`, so these surface in the AI Services dialog as one-click no-auth services. The proxy injects nothing — the benefit is centralised audit logging and a single place to manage polite-pool / rate-limit headers across agents that hit the same public source. Three initial seeds: - `arxiv-api` (http://export.arxiv.org/api): Atom feed search/metadata - `api-openalex` (https://api.openalex.org): 240M+ scholarly works graph - `api-crossref` (https://api.crossref.org): DOI metadata + citations Each description includes the polite-pool convention so agents can discover it from `nyxid catalog show <slug>` without leaving NyxID. Tests: - `public_service_seeds_have_unique_slugs_and_no_collision_with_default_seeds` - `arxiv_public_seed_is_present_and_unauthenticated` Motivation: agents working on academic / open-problem domains (e.g. literature staleness checks against erdosproblems / RESEARCH_BOARD targets, citation graph mining) need these sources first-class. Today they have to use `service add --custom` per machine and lose the audit trail. Seeding them in catalog gives one-line `nyxid service add arxiv-api` everywhere.

kaiweijw

Direction makes sense — having arxiv-api / api-openalex / api-crossref as one-liner catalog entries is a real ergonomic win over service add --custom, and the is_truly_no_auth path in unified_key_service.rs:523 was already designed for exactly this shape. Code is clean, additive, idempotent, CI green, builds and tests pass locally for me too.

Three things I'd like fixed before this lands. Two are correctness, one is a description-only fix that affects how this gets evaluated.

1. arXiv URL should be `https://`, not `http://`

http://export.arxiv.org/api leaks every search_query=... to any on-path observer. arXiv supports https on the same host. Since the headline benefit framing is observability/audit, sending the request itself in the clear undercuts the value prop. One-character fix:

base_url: "https://export.arxiv.org/api",

2. The "AI Services dialog" claim is wrong as written — pick a fix

These seeds will not appear in the web /keys AI Services dialog as currently filtered. frontend/src/hooks/use-keys.ts:36 (useCatalog) calls /catalog, which routes to catalog_service::list_catalog (backend/src/services/catalog_service.rs:217), whose $or requires:

requires_user_credential: true
  OR requires_user_credential: { $exists: false }
  OR provider_config_id: { $ne: null }

The new seeds set requires_user_credential: false AND provider_config_id: None, so they're excluded. They're only reachable via:

/catalog?include_all=true (CLI wizard, catalog-grid.tsx:138)
/catalog/{slug} direct lookup (which is what nyxid service add arxiv-api uses — i.e. the path you actually tested)

Two options:

(a) Make the claim true. Add a fourth $or clause to list_catalog: { \"auth_method\": \"none\", \"service_category\": \"internal\", \"provider_config_id\": null }. I checked — all 28 existing provider-backed seeds still match via the existing provider_config_id != null clause, so no double-counting. Add a unit test that calls list_catalog against a seeded fixture and asserts arxiv-api appears.
(b) Drop the claim. Edit the PR description to say these surface via nyxid catalog list --all and nyxid service add <slug>, not the web dialog.

I'd prefer (a) since it makes the feature actually do what the description promises.

3. The audit-trail framing in the PR description is incorrect

currently via a per-machine service add --custom which loses the audit trail

This isn't true. --custom services are routed through the same /api/v1/proxy/{service_id}/{path} handler, which calls audit_personal_routing / audit_org_routing on every request (handlers/proxy.rs:179, 217). Custom and catalog-seeded services produce identical audit entries. The audit benefit only holds vs. raw curl bypassing NyxID entirely — which is a real benefit, just a different one.

Suggest reframing the motivation as: (i) ergonomics — no per-user --base-url boilerplate; (ii) single-source-of-truth for the URL/description; (iii) admin-managed default_request_headers propagating polite-pool config to every agent without per-machine setup. The polite-pool argument is the strongest one and is well-suited to default_request_headers — though note the current PR sets default_request_headers: None, so an admin still needs to populate it post-seed. Worth either adding a comment about this in the seed table or filing a follow-up.

Smaller, optional

The service_category: "internal" / created_by: "system" / visibility: "public" choices follow existing convention but aren't called out — one comment line in the seed struct would help future maintainers reason about it.
Parallel seed table vs threading Option<&str> through DefaultServiceSeed — fine choice, your justification in the PR holds. Just be aware that future capability/header backfill mechanisms will need a parallel path too.

Happy to approve once #1 and #2 (either option) are in. #3 is description-only.

AlyciaBHZ force-pushed the add-public-academic-catalog branch from edb9f9d to d1934e0 Compare April 30, 2026 13:21

kaiweijw requested changes May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(catalog): seed unauthenticated public APIs (arXiv, OpenAlex, Crossref)#592

feat(catalog): seed unauthenticated public APIs (arXiv, OpenAlex, Crossref)#592
AlyciaBHZ wants to merge 1 commit intoChronoAIProject:mainfrom
AlyciaBHZ:add-public-academic-catalog

AlyciaBHZ commented Apr 30, 2026

Uh oh!

kaiweijw left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlyciaBHZ commented Apr 30, 2026

Summary

Why route public APIs through NyxID at all?

Motivation

Implementation notes

Test plan

Follow-up ideas

Uh oh!

kaiweijw left a comment

Choose a reason for hiding this comment

1. arXiv URL should be https://, not http://

2. The "AI Services dialog" claim is wrong as written — pick a fix

3. The audit-trail framing in the PR description is incorrect

Smaller, optional

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. arXiv URL should be `https://`, not `http://`