Helpers to Build the Theorem Search Dataset
arXiTeX is a Python package for constructing a large-scale corpus of mathematical theorems and their informal, natural-language representations ("slogans").
It is designed to support:
- Theorem-level semantic search
- AI proof assistants
- Mathematical retrieval and reasoning systems
- Dataset construction for ML research
Our work powers the theorem search system described in our paper and demo:
- Paper: https://arxiv.org/abs/2602.05216
- Demo: https://huggingface.co/spaces/uw-math-ai/theorem-search
- Dataset subset: https://huggingface.co/datasets/uw-math-ai/theorem-search-dataset
The full internal dataset contains:
- 9M+ theorems
- From ~700,000 papers
- Sourced from arxiv (primarily), ProofWiki, The Stacks Project, and many more
However, only ~15% of arXiv papers use permissive licenses (CC BY, CC BY-SA, CC0). For that reason, we cannot distribute the full dataset.
Instead, arXiTeX provides the tooling to rebuild the dataset yourself.
arXiTeX provides helpers to:
- Catalog papers and metadata
- Parse LaTeX sources into structured theorem objects
- Extract theorem bodies, proofs, labels, and references
- Validate parsed theorems
It does NOT:
- Provide database code
- Perform upserts
- Manage infrastructure
We hosted our database on AWS internally. For reference implementation scripts, see our repo: https://github.com/uw-math-ai/TheoremSearch/tree/main/pipeline
Install directly from GitHub:
pip install git+https://github.com/uw-math-ai/arXiTeX.git
We recommend installing inside a virtual environment.
Use paper_catalog to iterate through papers in batches.
from arXiTeX import paper_catalog
from arXiTeX.types import Paper
CATEGORIES = ["stat.ML", "math"]
for paper_batch in paper_catalog(
categories=CATEGORIES,
batch_size=100,
):
for paper in paper_batch:
print(paper.id, paper.title)This yields batches of structured Paper objects.
We use SemanticScholar for citation counts. Therefore, the generator works best with a SemanticScholar API key (https://www.semanticscholar.org/product/api):
In a .env:
SEMANTIC_SCHOLAR_API_KEY=...
Otherwise, please consider using a tiny batch_size (e.g. 1 or 2) and adding delays in your paper_catalog loop.
You can parse either:
- A local LaTeX source directory
- An arXiv ID (automatically downloaded)
from arXiTeX import parse_paper
from arXiTeX.types import Theorem, TheoremValidationLevel
theorems_from_local: list[Theorem] = parse_paper(
paper_path="path/to/paper",
validation_level=TheoremValidationLevel.Paper,
timeout=10,
)
theorems_from_arxiv: list[Theorem] = parse_paper(
arxiv_id="2109.06451"
)Each call returns a list of validated Theorem objects.
We use pydantic for schema validation and Python typing for clarity
and safety.
class Paper(BaseModel):
id: str
title: str
license: Optional[str]
authors: List[str]
link: str
abstract: str
journal_ref: Optional[str]
categories: List[str]
citations: Optional[int]Notes:
authorsare formatted as "First Middle Last"categories[0]is the primary categorylicenseis stored as a URL when availablecitationsmay be None if unavailable
class Theorem(BaseModel):
type: TheoremType
ref: str
note: Optional[str]
label: Optional[str]
body: str
proof: Optional[str]Notes:
typeis eitherTheorem,Lemma,Proposition, orCorollarynoteis an optional theorem title or captionlabelis an internal LaTeX label (i.e\label{...})bodyis the raw LaTeX theorem body with simple macros resolvedproofis the raw LaTeX proof (if present) with simple macros resolved
We welcome issues and pull requests. If you are building theorem-level search, proof assistants, or structured math datasets, we would love to collaborate.