A tool that automatically processes textbooks to generate structured question-answer datasets for machine learning and reinforcement learning training.
- Extract Text: Converts PDF textbooks to text format
- Process Pages: Splits content into manageable chunks (3-5 pages)
- Generate Questions: Uses AI models to create questions from each chunk
- Verify Solutions: Checks answers by feeding page back in with question to ensure same answer
- Output Dataset: Saves structured JSON files with questions, solutions, and metadata
The tool supports multiple difficulty levels (high school → PhD) and various AI models (OpenAI, Anthropic, DeepSeek, etc.).
- Install dependencies:
uv sync- Set up API keys:
export OPENAI_API_KEY='your-key'
export ANTHROPIC_API_KEY='your-key'
export DEEPSEEK_API_KEY='your-key'
export DEEPINFRA_API_KEY='your-key'# Generate questions from textbooks
uv run textbooks-to-rl --help
# Run with custom settings
uv run textbooks-to-rl \
--model "Qwen/QwQ-32B" \
--output-dir "my_questions" \
--difficulty undergrad \
--verbosePlace PDF files in textbooks/pdfs/ or process them:
uv run python scripts/process_pdfs.py# Basic generation
uv run textbooks-to-rl
# With custom options
uv run textbooks-to-rl \
--pages-per-group 5 \
--batch-size 50 \
--questions-per-chunk 8 \
--difficulty grad \
--no-verifyuv run python scripts/filter.py \
--folders generated_questions \
--output-dir filtered_results \
--model gpt-4o-mini| Option | Description | Default |
|---|---|---|
--model |
AI model for generation | Qwen/QwQ-32B |
--output-dir |
Output directory | generated_questions |
--textbooks-dir |
Textbooks directory | textbooks/txt |
--pages-per-group |
Pages per processing group | 3 |
--batch-size |
Parallel batch size | 100 |
--questions-per-chunk |
Questions per chunk | 10 |
--difficulty |
Question difficulty level | undergrad |
--no-verify |
Skip solution verification | False |
--verbose |
Enable debug logging | False |
# Install with dev dependencies
uv sync --dev
# Run code quality checks
uv run ruff check src/ --fix
uv run pytest