openalex-ingest

Code to pull external sources into S3. For more, see https://openalex.org.

Please send all bug reports and feature requests to support@openalex.org.

Repository Harvester

The repository harvester (repositories.py) pulls metadata from ~6,000 OAI-PMH endpoints daily and saves records to S3 for processing by the OpenAlex pipeline.

Architecture (Simplified January 2026)

The harvester uses a simple, massively parallel approach:

Daily Job: One scheduled job harvests ALL endpoints
Parallelization: 100 concurrent threads with per-host rate limiting (max 3 per host)
Health Tracking: Each endpoint's status is recorded after every harvest attempt
Runtime: ~15 minutes for all ~6,000 endpoints

This replaced a complex 4-tier system that scheduled endpoints separately based on "reliability" tiers. The old system was:

Overly complex with 4 separate scheduled jobs
Based on stale data (retry_interval was never updated)
Slow due to sequential processing

Usage

# Daily job (recommended): harvest all endpoints in parallel
python repositories.py --all-endpoints --n_threads 100

# Harvest a specific endpoint
python repositories.py --endpoint-id abc123

# Custom date range
python repositories.py --all-endpoints --start-date 2026-01-01 --end-date 2026-01-15

Health Tracking

The harvester tracks endpoint health with these database columns:

Column	Description
`last_health_status`	Status from last attempt: `success`, `blocked`, `timeout`, `connection_error`, `malformed`, `oai_error`
`last_health_check`	Timestamp of last harvest attempt
`last_response_time`	Response time in seconds
`last_error_message`	Error details if harvest failed

Legacy Columns (DO NOT USE)

The following columns are historical artifacts from the old tiering system. They are NOT used by the current harvester and may be removed in a future migration:

Column	Was Used For	Current Status
`retry_interval`	Exponential backoff scheduling	IGNORED - never updated
`retry_at`	Scheduling retries	IGNORED - never checked
`is_core`	Prioritizing "core" endpoints	IGNORED - all endpoints treated equally

DO NOT add new logic that depends on these columns. They exist only for historical context and to avoid a breaking migration.

Database Migration

Before running the new harvester, apply the migration to add health tracking columns:

psql $DATABASE_URL -f migrations/001_add_endpoint_health_columns.sql

Heroku Scheduler

Replace the old 4 scheduled jobs with a single daily job:

Old (remove these):

python repositories.py --core-endpoints
python repositories.py --reliable-endpoints
python repositories.py --other-endpoints
python repositories.py --abandoned-endpoints (monthly)

New (add this):

python repositories.py --all-endpoints --n_threads 100 (daily)

Configuration

Settings are defined at the top of repositories.py:

MAX_WORKERS = 100       # Total concurrent threads
MAX_PER_HOST = 3        # Max concurrent requests per host
REQUEST_TIMEOUT = 15    # Seconds before giving up
BATCH_SIZE = 5000       # Records per S3 file

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
migrations		migrations
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
common.py		common.py
crossref.py		crossref.py
crossref_journals.py		crossref_journals.py
datacite.py		datacite.py
datacite_clients.py		datacite_clients.py
datacite_providers.py		datacite_providers.py
doaj.py		doaj.py
issn_portal.py		issn_portal.py
orcid.py		orcid.py
pubmed.py		pubmed.py
pubmed_central.py		pubmed_central.py
repositories.py		repositories.py
requirements.txt		requirements.txt
ror.py		ror.py
ror_to_elastic.py		ror_to_elastic.py
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

openalex-ingest

Repository Harvester

Architecture (Simplified January 2026)

Usage

Health Tracking

Legacy Columns (DO NOT USE)

Database Migration

Heroku Scheduler

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

openalex-ingest

Repository Harvester

Architecture (Simplified January 2026)

Usage

Health Tracking

Legacy Columns (DO NOT USE)

Database Migration

Heroku Scheduler

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages