Advent of Code - AI Benchmark

This project benchmarks different AI models on Advent of Code challenges. It automatically fetches puzzles, generates solutions using various LLMs via OpenRouter, and validates them against the example answers.

How It Works

The benchmark follows this flow for each model and day:

Fetch puzzle - Uses advent-of-code-data to get the puzzle text and example input/answers
Generate solution - Sends the puzzle to the LLM via OpenRouter, asking it to write a Python function
Lint & type check - Runs ruff and ty on the generated code; if issues found, asks the LLM to fix them (up to 3 attempts)
Test - Executes the code directly and compares output against the example answer
Retry on failure - If the test fails, sends the error back to the LLM for another attempt (up to 10 attempts)
Part B - If Part A passes, proceeds to Part B using the same process

All models run in parallel using async/await, with a live progress table showing status, attempts, time, and cost. Each model operates in isolation - no shared files or state between parallel runs.

Advent of Code Data

This project relies on the advent-of-code-data package to fetch puzzles and verify answers. It requires your AoC session cookie to access puzzle data.

To get your session cookie:

Log in to adventofcode.com
Open browser dev tools (F12) → Application → Cookies
Copy the value of the session cookie

The package caches puzzle data locally, so subsequent runs don't hit the AoC servers.

Installation

This project uses uv for package management.

Clone the repository:

git clone https://github.com/japborst/aoc-ai-benchmark.git
cd aoc-ai-benchmark

Install dependencies:
```
uv sync
```

Configuration

The benchmark requires API keys for the AI models and an Advent of Code session token.

Create a .env file:
```
cp .env.example .env
```
Fill in the environment variables:
- AOC_SESSION: Your Advent of Code session cookie
- OPENROUTER_API_KEY: Your OpenRouter API key (provides access to multiple LLM providers)

Usage

Run the benchmark using the benchmark.runner module:

uv run python -m benchmark.runner [OPTIONS]

Options

Option	Description
`-m, --model TEXT`	Run only a specific model (e.g., `openai/gpt-4o`)
`-d, --days TEXT`	Days to run: single (`3`), range (`1-5`), or list (`1,3,5`)
`-r, --runs INT`	Number of runs per model/day (default: 1)
`-a, --attempts INT`	Max attempts per part (default: 10)
`-v, --verbose`	Show generated code during run
`--stop-on-fail`	Stop a model when it fails a day (default: continue)
`--debug/--no-debug`	Save generated code to `debug/` (default: enabled)

Examples

# Run full benchmark for all models
uv run python -m benchmark.runner

# Run specific days
uv run python -m benchmark.runner --days 5
uv run python -m benchmark.runner --days 1-5
uv run python -m benchmark.runner --days 1,3,5,7

# Run single model
uv run python -m benchmark.runner --model openai/gpt-4o

# Stop models on first failure (saves cost on hard problems)
uv run python -m benchmark.runner --stop-on-fail

# Verbose output
uv run python -m benchmark.runner --days 1 --verbose

# By default, the code is stored in /debug. This can be disabled.
uv run python -m benchmark.runner --no-debug

Benchmark Results

Results are saved incrementally as JSON files in results/ with timestamps. If interrupted (Ctrl+C), all completed days are preserved.

Generating Plots

Use the plotter to visualize benchmark results:

uv run python -m benchmark.plotter

This auto-loads all JSON files from results/, deduplicates by (model, run, day) (keeping the latest), and generates:

Summary

Overview of days solved, total time, and total cost per model (sorted by performance):

Time vs Days Solved

Line chart showing cumulative days solved over time - a "race" visualization:

Cost Analysis

Total cost and cost efficiency (days solved per dollar):

Managing Results

Partial runs: Run subsets of models/days anytime. The plotter combines all files automatically.
Re-running: If you re-run a benchmark for the same model/day, the newer result supersedes the old one.
Deleting invalid runs: Simply delete the JSON file from results/.

Project Structure

aoc/
├── aoc/                    # Solution code (for manual testing)
├── benchmark/
│   ├── config.py          # Model list, prompts, constants
│   ├── models.py          # OpenAI/OpenRouter async API client
│   ├── runner.py          # Main benchmark orchestration
│   ├── results.py         # Result tracking and persistence
│   ├── plotter.py         # Visualization generation
│   └── prices.yaml        # Token pricing for cost calculation
├── results/               # JSON benchmark results
├── plots/                 # Generated visualizations
└── debug/                 # Debug output (generated code per attempt)

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
benchmark		benchmark
plots		plots
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advent of Code - AI Benchmark

How It Works

Advent of Code Data

Installation

Configuration

Usage

Options

Examples

Benchmark Results

Generating Plots

Summary

Time vs Days Solved

Cost Analysis

Managing Results

Project Structure

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Advent of Code - AI Benchmark

How It Works

Advent of Code Data

Installation

Configuration

Usage

Options

Examples

Benchmark Results

Generating Plots

Summary

Time vs Days Solved

Cost Analysis

Managing Results

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages