Skip to content

feat(catalog): seed unauthenticated public APIs (arXiv, OpenAlex, Crossref)#592

Open
AlyciaBHZ wants to merge 1 commit intoChronoAIProject:mainfrom
AlyciaBHZ:add-public-academic-catalog
Open

feat(catalog): seed unauthenticated public APIs (arXiv, OpenAlex, Crossref)#592
AlyciaBHZ wants to merge 1 commit intoChronoAIProject:mainfrom
AlyciaBHZ:add-public-academic-catalog

Conversation

@AlyciaBHZ
Copy link
Copy Markdown

Summary

Adds three first-class catalog entries for unauthenticated public academic APIs:

  • arxiv-api (http://export.arxiv.org/api) — Atom feed search/metadata, 2.5M+ papers
  • api-openalex (https://api.openalex.org) — 240M+ scholarly works, authors, citations
  • api-crossref (https://api.crossref.org) — DOI metadata + citation graph (~150M works)

These have no ProviderConfig to bind to. The implementation introduces a parallel DEFAULT_PUBLIC_SERVICE_SEEDS table and a second seed loop in seed_default_services that produces DownstreamService rows with provider_config_id: None, auth_method: \"none\", requires_user_credential: false, and no ServiceProviderRequirement.

build_catalog_entry already tolerates provider: None and returns requires_credential: false, so these surface in the AI Services dialog as one-click no-auth services. Verified locally — nyxid service add arxiv-api becomes a one-shot operation, no --custom boilerplate required.

Why route public APIs through NyxID at all?

The proxy injects nothing on these calls. The benefit is operational:

  • Centralised audit logging — every academic-source call appears in NyxID's request log alongside authenticated proxy traffic, instead of scattered curl logs across machines.
  • Single point for polite-pool configuration — when arXiv/OpenAlex/Crossref recommend a User-Agent: app/version (mailto:you@example.com) header for higher rate limits, an admin can set it once via service update --default-header and every agent honours it without code changes.
  • Future rate-limit / circuit-breaker hooks — if NyxID later adds per-service rate limiters, public APIs benefit identically to credentialled ones.
  • Discoverabilitynyxid catalog show arxiv-api returns a description that explains the no-auth policy and the polite-pool convention, so agents working on a new academic-data task discover the source from inside NyxID.

Motivation

I'm using NyxID to broker external APIs for an outreach pipeline that targets open mathematical conjectures (https://github.com/the-omega-institute/automath). The pipeline routinely hits arXiv to scan recent math.NT/CO/AG/DS papers against a research board — currently via a per-machine service add --custom which loses the audit trail and doesn't propagate the polite-pool header to other agents that need the same source. Seeding these as catalog entries makes them as ergonomic as api-github or api-reddit already are.

The same argument extends to any agent doing literature work, citation mining, or paper deduplication.

Implementation notes

  • New seed table is distinct from DEFAULT_SERVICE_SEEDS rather than threading Option<&str> through the existing provider_slug field. This isolates the no-provider path from the existing 28 provider-backed seeds, so the audit / SPR / token-exchange logic stays unchanged for the credentialled cases.
  • Slug uniqueness is enforced by a unit test that also asserts no collision with DEFAULT_SERVICE_SEEDS.
  • Descriptions inline the official documentation URLs (and polite-pool convention where applicable) since DownstreamService doesn't have a separate documentation_url field; happy to follow up with a separate PR adding that field if desirable.

Test plan

  • cargo check -p nyxid clean
  • cargo fmt -p nyxid clean
  • cargo clippy -p nyxid clean
  • cargo test -p nyxid services::provider_service::tests::public_service_seeds_have_unique_slugs_and_no_collision_with_default_seeds
  • cargo test -p nyxid services::provider_service::tests::arxiv_public_seed_is_present_and_unauthenticated
  • Locally exercised: nyxid service add arxiv-api (after deleting prior --custom arxiv-api) → nyxid proxy request arxiv-api '/query?search_query=cat:math.NT&max_results=5' -m GET returns Atom feed end-to-end.

Follow-up ideas

  • Add documentation_url to DownstreamService and re-fold the URLs out of the descriptions.
  • Add Semantic Scholar (requires API key, would go in DEFAULT_SERVICE_SEEDS with a new semanticscholar provider).
  • Add ORCID public API (read-only) here under the same no-auth path.

…ssref)

Adds DEFAULT_PUBLIC_SERVICE_SEEDS + a parallel seed loop in
seed_default_services for catalog entries that don't bind to any
ProviderConfig. Resulting DownstreamService rows have:

  - provider_config_id: None
  - auth_method: "none"
  - requires_user_credential: false
  - no ServiceProviderRequirement

build_catalog_entry already tolerates `provider: None` and emits
`requires_credential: false`, so these surface in the AI Services
dialog as one-click no-auth services. The proxy injects nothing — the
benefit is centralised audit logging and a single place to manage
polite-pool / rate-limit headers across agents that hit the same
public source.

Three initial seeds:
- `arxiv-api` (http://export.arxiv.org/api): Atom feed search/metadata
- `api-openalex` (https://api.openalex.org): 240M+ scholarly works graph
- `api-crossref` (https://api.crossref.org): DOI metadata + citations

Each description includes the polite-pool convention so agents can
discover it from `nyxid catalog show <slug>` without leaving NyxID.

Tests:
- `public_service_seeds_have_unique_slugs_and_no_collision_with_default_seeds`
- `arxiv_public_seed_is_present_and_unauthenticated`

Motivation: agents working on academic / open-problem domains (e.g.
literature staleness checks against erdosproblems / RESEARCH_BOARD
targets, citation graph mining) need these sources first-class. Today
they have to use `service add --custom` per machine and lose the
audit trail. Seeding them in catalog gives one-line `nyxid service add
arxiv-api` everywhere.
@AlyciaBHZ AlyciaBHZ force-pushed the add-public-academic-catalog branch from edb9f9d to d1934e0 Compare April 30, 2026 13:21
Copy link
Copy Markdown
Collaborator

@kaiweijw kaiweijw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direction makes sense — having arxiv-api / api-openalex / api-crossref as one-liner catalog entries is a real ergonomic win over service add --custom, and the is_truly_no_auth path in unified_key_service.rs:523 was already designed for exactly this shape. Code is clean, additive, idempotent, CI green, builds and tests pass locally for me too.

Three things I'd like fixed before this lands. Two are correctness, one is a description-only fix that affects how this gets evaluated.

1. arXiv URL should be https://, not http://

http://export.arxiv.org/api leaks every search_query=... to any on-path observer. arXiv supports https on the same host. Since the headline benefit framing is observability/audit, sending the request itself in the clear undercuts the value prop. One-character fix:

base_url: "https://export.arxiv.org/api",

2. The "AI Services dialog" claim is wrong as written — pick a fix

These seeds will not appear in the web /keys AI Services dialog as currently filtered. frontend/src/hooks/use-keys.ts:36 (useCatalog) calls /catalog, which routes to catalog_service::list_catalog (backend/src/services/catalog_service.rs:217), whose $or requires:

requires_user_credential: true
  OR requires_user_credential: { $exists: false }
  OR provider_config_id: { $ne: null }

The new seeds set requires_user_credential: false AND provider_config_id: None, so they're excluded. They're only reachable via:

  • /catalog?include_all=true (CLI wizard, catalog-grid.tsx:138)
  • /catalog/{slug} direct lookup (which is what nyxid service add arxiv-api uses — i.e. the path you actually tested)

Two options:

  • (a) Make the claim true. Add a fourth $or clause to list_catalog: { \"auth_method\": \"none\", \"service_category\": \"internal\", \"provider_config_id\": null }. I checked — all 28 existing provider-backed seeds still match via the existing provider_config_id != null clause, so no double-counting. Add a unit test that calls list_catalog against a seeded fixture and asserts arxiv-api appears.
  • (b) Drop the claim. Edit the PR description to say these surface via nyxid catalog list --all and nyxid service add <slug>, not the web dialog.

I'd prefer (a) since it makes the feature actually do what the description promises.

3. The audit-trail framing in the PR description is incorrect

currently via a per-machine service add --custom which loses the audit trail

This isn't true. --custom services are routed through the same /api/v1/proxy/{service_id}/{path} handler, which calls audit_personal_routing / audit_org_routing on every request (handlers/proxy.rs:179, 217). Custom and catalog-seeded services produce identical audit entries. The audit benefit only holds vs. raw curl bypassing NyxID entirely — which is a real benefit, just a different one.

Suggest reframing the motivation as: (i) ergonomics — no per-user --base-url boilerplate; (ii) single-source-of-truth for the URL/description; (iii) admin-managed default_request_headers propagating polite-pool config to every agent without per-machine setup. The polite-pool argument is the strongest one and is well-suited to default_request_headers — though note the current PR sets default_request_headers: None, so an admin still needs to populate it post-seed. Worth either adding a comment about this in the seed table or filing a follow-up.

Smaller, optional

  • The service_category: "internal" / created_by: "system" / visibility: "public" choices follow existing convention but aren't called out — one comment line in the seed struct would help future maintainers reason about it.
  • Parallel seed table vs threading Option<&str> through DefaultServiceSeed — fine choice, your justification in the PR holds. Just be aware that future capability/header backfill mechanisms will need a parallel path too.

Happy to approve once #1 and #2 (either option) are in. #3 is description-only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants