OpenAlex Official CLI

Official command-line interface for OpenAlex. Download work metadata and full-text content (PDFs, TEI XML) in bulk.

Note: This package was formerly known as openalex-content-downloader. If you have that installed, please switch to openalex-official.

Installation

pip install openalex-official

Quick Start

# Download metadata for works matching a filter
openalex download \
  --api-key YOUR_API_KEY \
  --output ./results \
  --filter "topics.id:T10325"

# Download metadata + PDFs
openalex download \
  --api-key YOUR_API_KEY \
  --output ./results \
  --filter "topics.id:T10325" \
  --content pdf

# Download metadata + PDFs + TEI XML
openalex download \
  --api-key YOUR_API_KEY \
  --output ./results \
  --filter "topics.id:T10325" \
  --content pdf,xml

# Download specific works by ID or DOI
openalex download \
  --api-key YOUR_API_KEY \
  --output ./results \
  --ids "W2741809807,10.1038/nature12373"

# Download from a list of IDs via stdin
cat work_ids.txt | openalex download \
  --api-key YOUR_API_KEY \
  --output ./results \
  --stdin

# Download to S3
openalex download \
  --api-key YOUR_API_KEY \
  --storage s3 \
  --s3-bucket my-bucket \
  --s3-prefix openalex/ \
  --filter "topics.id:T12345"

# Check API key status
openalex status --api-key YOUR_API_KEY

Features

Metadata-first approach - JSON metadata is always saved; content files are optional
High-throughput async downloads - Configurable concurrency for millions of works
Automatic checkpointing - Resume interrupted downloads without re-downloading
Adaptive rate limiting - Automatically adjusts to API conditions
Multiple storage backends - Local filesystem or S3
Progress tracking - Rich terminal UI with live stats, or headless logging
Flexible filtering - Use any OpenAlex filter syntax
Multiple input modes - Filter, explicit IDs, or piped stdin
DOI support - Auto-detects and resolves DOIs to OpenAlex work IDs

CLI Reference

`openalex download`

Download work metadata and optionally content (PDFs, TEI XML).

Option	Description	Default
`--api-key`	OpenAlex API key (required)	`$OPENALEX_API_KEY`
`--output`, `-o`	Output directory	`./openalex-downloads`
`--storage`	Storage backend: `local` or `s3`	`local`
`--s3-bucket`	S3 bucket name	-
`--s3-prefix`	S3 key prefix	`""`
`--filter`	OpenAlex filter string	None (all works)
`--ids`	Comma-separated work IDs or DOIs	-
`--stdin`	Read work IDs/DOIs from stdin	`false`
`--content`	Content to download: `pdf`, `xml`, or `pdf,xml`	None (metadata only)
`--nested`	Use nested folder structure (W##/##/)	`false`
`--workers`	Concurrent download workers (1-200)	`50`
`--resume/--no-resume`	Resume from checkpoint	`true`
`--fresh`	Ignore checkpoint, start fresh	`false`
`--quiet`, `-q`	Minimal output (log file only)	`false`
`--verbose`, `-v`	Extra debug output	`false`

`openalex status`

Check API key status and credit information.

Option	Description
`--api-key`	OpenAlex API key (required)

Filter Examples

# Recent articles
--filter "publication_year:>2020,type:article"

# Specific topic
--filter "topics.id:T12345"

# From a specific institution
--filter "authorships.institutions.id:I123456789"

# Open access only
--filter "open_access.is_oa:true"

# Combined filters
--filter "publication_year:2023,type:article,open_access.is_oa:true"

See OpenAlex filter documentation for all available filters.

File Organization

By default, files are saved flat in the output directory. Metadata is always saved as JSON:

output/
├── W2741809807.json     # metadata (always saved)
├── W2741809807.pdf      # content (if --content pdf)
├── W2741809807.tei.xml  # content (if --content xml)
├── W1234567890.json
└── .openalex-checkpoint.json

For large downloads (>10,000 files), use --nested to organize files in a nested structure that avoids filesystem issues:

output/
├── W27/
│   └── 41/
│       ├── W2741809807.json
│       └── W2741809807.pdf
├── W12/
│   └── 34/
│       └── W1234567890.json
└── .openalex-checkpoint.json

When downloading by DOI, files are named using the DOI (with / replaced by _):

output/
├── 10.1038_nature12373.json
└── 10.1038_nature12373.pdf

Checkpointing

The downloader automatically saves progress to .openalex-checkpoint.json in the output directory. If interrupted, run the same command again to resume.

To start fresh and ignore the checkpoint:

openalex download --api-key KEY --output ./data --fresh

Logging

All activity is logged to openalex-download.log in the output directory, regardless of terminal mode.

High-Throughput Deployment

The download speed is typically limited by network bandwidth, not the tool or API. On a typical home connection (~400 Mbps), expect ~10-15 files/sec (~1M files/day). To achieve higher throughput, deploy from a cloud environment.

Performance scaling:

Environment	Bandwidth	Workers	Expected Rate
Home connection	400 Mbps	50	~10-15 files/sec
Cloud VM (standard)	1-5 Gbps	100-150	~30-50 files/sec
Cloud VM (high-perf)	10+ Gbps	200-300	~60+ files/sec

Recommendations for large-scale downloads:

Run from cloud - Deploy on AWS EC2, GCP, or Azure VMs with high network bandwidth. Instances close to Cloudflare edge locations will have lower latency.
Increase workers - Use --workers 150 or higher to saturate available bandwidth. Monitor with verbose mode to find the optimal setting.

Use S3 storage - For very large downloads, stream directly to S3 instead of local disk:

openalex download \
  --api-key KEY \
  --storage s3 \
  --s3-bucket my-corpus \
  --workers 200

Parallelize across machines - For the full corpus, run multiple instances with different filters (e.g., by publication year) on separate machines.

Roadmap

We plan to add more commands to the CLI, including:

CSV/JSON export of search results
More entity types beyond works

Have a feature request? Open an issue.

Requirements

Python 3.9+
OpenAlex API key with sufficient credits

Documentation

Full documentation: docs.openalex.org

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
examples		examples
src/openalex_cli		src/openalex_cli
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenAlex Official CLI

Installation

Quick Start

Features

CLI Reference

`openalex download`

`openalex status`

Filter Examples

File Organization

Checkpointing

Logging

High-Throughput Deployment

Roadmap

Requirements

Documentation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenAlex Official CLI

Installation

Quick Start

Features

CLI Reference

openalex download

openalex status

Filter Examples

File Organization

Checkpointing

Logging

High-Throughput Deployment

Roadmap

Requirements

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`openalex download`

`openalex status`

Packages