Official command-line interface for OpenAlex. Download work metadata and full-text content (PDFs, TEI XML) in bulk.
Note: This package was formerly known as
openalex-content-downloader. If you have that installed, please switch toopenalex-official.
pip install openalex-official# Download metadata for works matching a filter
openalex download \
--api-key YOUR_API_KEY \
--output ./results \
--filter "topics.id:T10325"
# Download metadata + PDFs
openalex download \
--api-key YOUR_API_KEY \
--output ./results \
--filter "topics.id:T10325" \
--content pdf
# Download metadata + PDFs + TEI XML
openalex download \
--api-key YOUR_API_KEY \
--output ./results \
--filter "topics.id:T10325" \
--content pdf,xml
# Download specific works by ID or DOI
openalex download \
--api-key YOUR_API_KEY \
--output ./results \
--ids "W2741809807,10.1038/nature12373"
# Download from a list of IDs via stdin
cat work_ids.txt | openalex download \
--api-key YOUR_API_KEY \
--output ./results \
--stdin
# Download to S3
openalex download \
--api-key YOUR_API_KEY \
--storage s3 \
--s3-bucket my-bucket \
--s3-prefix openalex/ \
--filter "topics.id:T12345"
# Check API key status
openalex status --api-key YOUR_API_KEY- Metadata-first approach - JSON metadata is always saved; content files are optional
- High-throughput async downloads - Configurable concurrency for millions of works
- Automatic checkpointing - Resume interrupted downloads without re-downloading
- Adaptive rate limiting - Automatically adjusts to API conditions
- Multiple storage backends - Local filesystem or S3
- Progress tracking - Rich terminal UI with live stats, or headless logging
- Flexible filtering - Use any OpenAlex filter syntax
- Multiple input modes - Filter, explicit IDs, or piped stdin
- DOI support - Auto-detects and resolves DOIs to OpenAlex work IDs
Download work metadata and optionally content (PDFs, TEI XML).
| Option | Description | Default |
|---|---|---|
--api-key |
OpenAlex API key (required) | $OPENALEX_API_KEY |
--output, -o |
Output directory | ./openalex-downloads |
--storage |
Storage backend: local or s3 |
local |
--s3-bucket |
S3 bucket name | - |
--s3-prefix |
S3 key prefix | "" |
--filter |
OpenAlex filter string | None (all works) |
--ids |
Comma-separated work IDs or DOIs | - |
--stdin |
Read work IDs/DOIs from stdin | false |
--content |
Content to download: pdf, xml, or pdf,xml |
None (metadata only) |
--nested |
Use nested folder structure (W##/##/) | false |
--workers |
Concurrent download workers (1-200) | 50 |
--resume/--no-resume |
Resume from checkpoint | true |
--fresh |
Ignore checkpoint, start fresh | false |
--quiet, -q |
Minimal output (log file only) | false |
--verbose, -v |
Extra debug output | false |
Check API key status and credit information.
| Option | Description |
|---|---|
--api-key |
OpenAlex API key (required) |
# Recent articles
--filter "publication_year:>2020,type:article"
# Specific topic
--filter "topics.id:T12345"
# From a specific institution
--filter "authorships.institutions.id:I123456789"
# Open access only
--filter "open_access.is_oa:true"
# Combined filters
--filter "publication_year:2023,type:article,open_access.is_oa:true"See OpenAlex filter documentation for all available filters.
By default, files are saved flat in the output directory. Metadata is always saved as JSON:
output/
├── W2741809807.json # metadata (always saved)
├── W2741809807.pdf # content (if --content pdf)
├── W2741809807.tei.xml # content (if --content xml)
├── W1234567890.json
└── .openalex-checkpoint.json
For large downloads (>10,000 files), use --nested to organize files in a nested structure that avoids filesystem issues:
output/
├── W27/
│ └── 41/
│ ├── W2741809807.json
│ └── W2741809807.pdf
├── W12/
│ └── 34/
│ └── W1234567890.json
└── .openalex-checkpoint.json
When downloading by DOI, files are named using the DOI (with / replaced by _):
output/
├── 10.1038_nature12373.json
└── 10.1038_nature12373.pdf
The downloader automatically saves progress to .openalex-checkpoint.json in the output directory. If interrupted, run the same command again to resume.
To start fresh and ignore the checkpoint:
openalex download --api-key KEY --output ./data --freshAll activity is logged to openalex-download.log in the output directory, regardless of terminal mode.
The download speed is typically limited by network bandwidth, not the tool or API. On a typical home connection (~400 Mbps), expect ~10-15 files/sec (~1M files/day). To achieve higher throughput, deploy from a cloud environment.
Performance scaling:
| Environment | Bandwidth | Workers | Expected Rate |
|---|---|---|---|
| Home connection | 400 Mbps | 50 | ~10-15 files/sec |
| Cloud VM (standard) | 1-5 Gbps | 100-150 | ~30-50 files/sec |
| Cloud VM (high-perf) | 10+ Gbps | 200-300 | ~60+ files/sec |
Recommendations for large-scale downloads:
-
Run from cloud - Deploy on AWS EC2, GCP, or Azure VMs with high network bandwidth. Instances close to Cloudflare edge locations will have lower latency.
-
Increase workers - Use
--workers 150or higher to saturate available bandwidth. Monitor with verbose mode to find the optimal setting. -
Use S3 storage - For very large downloads, stream directly to S3 instead of local disk:
openalex download \ --api-key KEY \ --storage s3 \ --s3-bucket my-corpus \ --workers 200
-
Parallelize across machines - For the full corpus, run multiple instances with different filters (e.g., by publication year) on separate machines.
We plan to add more commands to the CLI, including:
- CSV/JSON export of search results
- More entity types beyond works
Have a feature request? Open an issue.
- Python 3.9+
- OpenAlex API key with sufficient credits
Full documentation: docs.openalex.org
MIT