Skip to content

japborst/aoc-ai-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Advent of Code - AI Benchmark

This project benchmarks different AI models on Advent of Code challenges. It automatically fetches puzzles, generates solutions using various LLMs via OpenRouter, and validates them against the example answers.

How It Works

The benchmark follows this flow for each model and day:

  1. Fetch puzzle - Uses advent-of-code-data to get the puzzle text and example input/answers
  2. Generate solution - Sends the puzzle to the LLM via OpenRouter, asking it to write a Python function
  3. Lint & type check - Runs ruff and ty on the generated code; if issues found, asks the LLM to fix them (up to 3 attempts)
  4. Test - Executes the code directly and compares output against the example answer
  5. Retry on failure - If the test fails, sends the error back to the LLM for another attempt (up to 10 attempts)
  6. Part B - If Part A passes, proceeds to Part B using the same process

All models run in parallel using async/await, with a live progress table showing status, attempts, time, and cost. Each model operates in isolation - no shared files or state between parallel runs.

Advent of Code Data

This project relies on the advent-of-code-data package to fetch puzzles and verify answers. It requires your AoC session cookie to access puzzle data.

To get your session cookie:

  1. Log in to adventofcode.com
  2. Open browser dev tools (F12) → Application → Cookies
  3. Copy the value of the session cookie

The package caches puzzle data locally, so subsequent runs don't hit the AoC servers.

Installation

This project uses uv for package management.

  1. Clone the repository:

    git clone https://github.com/japborst/aoc-ai-benchmark.git
    cd aoc-ai-benchmark
  2. Install dependencies:

    uv sync

Configuration

The benchmark requires API keys for the AI models and an Advent of Code session token.

  1. Create a .env file:

    cp .env.example .env
  2. Fill in the environment variables:

    • AOC_SESSION: Your Advent of Code session cookie
    • OPENROUTER_API_KEY: Your OpenRouter API key (provides access to multiple LLM providers)

Usage

Run the benchmark using the benchmark.runner module:

uv run python -m benchmark.runner [OPTIONS]

Options

Option Description
-m, --model TEXT Run only a specific model (e.g., openai/gpt-4o)
-d, --days TEXT Days to run: single (3), range (1-5), or list (1,3,5)
-r, --runs INT Number of runs per model/day (default: 1)
-a, --attempts INT Max attempts per part (default: 10)
-v, --verbose Show generated code during run
--stop-on-fail Stop a model when it fails a day (default: continue)
--debug/--no-debug Save generated code to debug/ (default: enabled)

Examples

# Run full benchmark for all models
uv run python -m benchmark.runner

# Run specific days
uv run python -m benchmark.runner --days 5
uv run python -m benchmark.runner --days 1-5
uv run python -m benchmark.runner --days 1,3,5,7

# Run single model
uv run python -m benchmark.runner --model openai/gpt-4o

# Stop models on first failure (saves cost on hard problems)
uv run python -m benchmark.runner --stop-on-fail

# Verbose output
uv run python -m benchmark.runner --days 1 --verbose

# By default, the code is stored in /debug. This can be disabled.
uv run python -m benchmark.runner --no-debug

Benchmark Results

Results are saved incrementally as JSON files in results/ with timestamps. If interrupted (Ctrl+C), all completed days are preserved.

Generating Plots

Use the plotter to visualize benchmark results:

uv run python -m benchmark.plotter

This auto-loads all JSON files from results/, deduplicates by (model, run, day) (keeping the latest), and generates:

Summary

Overview of days solved, total time, and total cost per model (sorted by performance):

Summary

Time vs Days Solved

Line chart showing cumulative days solved over time - a "race" visualization:

Time vs Days

Cost Analysis

Total cost and cost efficiency (days solved per dollar):

Cost Analysis

Managing Results

  • Partial runs: Run subsets of models/days anytime. The plotter combines all files automatically.
  • Re-running: If you re-run a benchmark for the same model/day, the newer result supersedes the old one.
  • Deleting invalid runs: Simply delete the JSON file from results/.

Project Structure

aoc/
├── aoc/                    # Solution code (for manual testing)
├── benchmark/
│   ├── config.py          # Model list, prompts, constants
│   ├── models.py          # OpenAI/OpenRouter async API client
│   ├── runner.py          # Main benchmark orchestration
│   ├── results.py         # Result tracking and persistence
│   ├── plotter.py         # Visualization generation
│   └── prices.yaml        # Token pricing for cost calculation
├── results/               # JSON benchmark results
├── plots/                 # Generated visualizations
└── debug/                 # Debug output (generated code per attempt)

About

Advent of Code AI Benchmark

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages