This project benchmarks different AI models on Advent of Code challenges. It automatically fetches puzzles, generates solutions using various LLMs via OpenRouter, and validates them against the example answers.
The benchmark follows this flow for each model and day:
- Fetch puzzle - Uses
advent-of-code-datato get the puzzle text and example input/answers - Generate solution - Sends the puzzle to the LLM via OpenRouter, asking it to write a Python function
- Lint & type check - Runs
ruffandtyon the generated code; if issues found, asks the LLM to fix them (up to 3 attempts) - Test - Executes the code directly and compares output against the example answer
- Retry on failure - If the test fails, sends the error back to the LLM for another attempt (up to 10 attempts)
- Part B - If Part A passes, proceeds to Part B using the same process
All models run in parallel using async/await, with a live progress table showing status, attempts, time, and cost. Each model operates in isolation - no shared files or state between parallel runs.
This project relies on the advent-of-code-data package to fetch puzzles and verify answers. It requires your AoC session cookie to access puzzle data.
To get your session cookie:
- Log in to adventofcode.com
- Open browser dev tools (F12) → Application → Cookies
- Copy the value of the
sessioncookie
The package caches puzzle data locally, so subsequent runs don't hit the AoC servers.
This project uses uv for package management.
-
Clone the repository:
git clone https://github.com/japborst/aoc-ai-benchmark.git cd aoc-ai-benchmark -
Install dependencies:
uv sync
The benchmark requires API keys for the AI models and an Advent of Code session token.
-
Create a
.envfile:cp .env.example .env
-
Fill in the environment variables:
AOC_SESSION: Your Advent of Code session cookieOPENROUTER_API_KEY: Your OpenRouter API key (provides access to multiple LLM providers)
Run the benchmark using the benchmark.runner module:
uv run python -m benchmark.runner [OPTIONS]| Option | Description |
|---|---|
-m, --model TEXT |
Run only a specific model (e.g., openai/gpt-4o) |
-d, --days TEXT |
Days to run: single (3), range (1-5), or list (1,3,5) |
-r, --runs INT |
Number of runs per model/day (default: 1) |
-a, --attempts INT |
Max attempts per part (default: 10) |
-v, --verbose |
Show generated code during run |
--stop-on-fail |
Stop a model when it fails a day (default: continue) |
--debug/--no-debug |
Save generated code to debug/ (default: enabled) |
# Run full benchmark for all models
uv run python -m benchmark.runner
# Run specific days
uv run python -m benchmark.runner --days 5
uv run python -m benchmark.runner --days 1-5
uv run python -m benchmark.runner --days 1,3,5,7
# Run single model
uv run python -m benchmark.runner --model openai/gpt-4o
# Stop models on first failure (saves cost on hard problems)
uv run python -m benchmark.runner --stop-on-fail
# Verbose output
uv run python -m benchmark.runner --days 1 --verbose
# By default, the code is stored in /debug. This can be disabled.
uv run python -m benchmark.runner --no-debugResults are saved incrementally as JSON files in results/ with timestamps. If interrupted (Ctrl+C), all completed days are preserved.
Use the plotter to visualize benchmark results:
uv run python -m benchmark.plotterThis auto-loads all JSON files from results/, deduplicates by (model, run, day) (keeping the latest), and generates:
Overview of days solved, total time, and total cost per model (sorted by performance):
Line chart showing cumulative days solved over time - a "race" visualization:
Total cost and cost efficiency (days solved per dollar):
- Partial runs: Run subsets of models/days anytime. The plotter combines all files automatically.
- Re-running: If you re-run a benchmark for the same model/day, the newer result supersedes the old one.
- Deleting invalid runs: Simply delete the JSON file from
results/.
aoc/
├── aoc/ # Solution code (for manual testing)
├── benchmark/
│ ├── config.py # Model list, prompts, constants
│ ├── models.py # OpenAI/OpenRouter async API client
│ ├── runner.py # Main benchmark orchestration
│ ├── results.py # Result tracking and persistence
│ ├── plotter.py # Visualization generation
│ └── prices.yaml # Token pricing for cost calculation
├── results/ # JSON benchmark results
├── plots/ # Generated visualizations
└── debug/ # Debug output (generated code per attempt)


