Skip to content

SWT-bench image build throughput tracker (historical source of truth) #530

@simonrosenberg

Description

@simonrosenberg

This issue serves as the historical source of truth for SWT-bench image build throughput. It tracks every workflow run that built more than 50 images, with careful accounting of truly built vs. skipped images.

Build throughput over time

Date Status SDK SHA Benchmarks SHA Images Built Images Skipped Build Duration Throughput (img/h) Run
2026-02-25 bde31c4 7d81e21 225 0 5h57m 37.8 #22410410373
2026-02-27 cefaebf 744df22 364 69 5h29m 66.4 #22508380588
2026-02-27 cefaebf 744df22 379 54 5h12m 72.9 #22508424484
2026-02-27 cefaebf 744df22 382 51 5h03m 75.6 #22508430612
2026-03-04 eab666f 09c9fb1 50 0 55m18s 54.2 #22670483404
2026-03-04 3c2bce2 863df2a 50 0 1h00m 50.0 #22671307073
2026-03-05 🟡 cancelled 63565b9 981a95a 219 18 5h58m 36.7 #22715409658
2026-03-06 🟡 cancelled 8e3e306 1f29020 275 0 5h59m 46.0 #22747565501
2026-03-11 🟡 cancelled 3081956 4d786be 246 34 5h55m 41.6 #22951682281
2026-03-11 3081956 4d786be 145 288 4h08m 35.1 #22967516108
2026-03-11 🟡 cancelled 46f3d78 e6f6da4 262 56 5h59m 43.8 #22974200370
2026-03-12 🟡 cancelled b498a69 36cb585 196 14 3h02m 64.6 #22985176798
2026-03-12 🟡 cancelled aa9df69 49a94d1 112 1 1h32m 73.0 #23001334237
2026-03-12 🟡 cancelled aa9df69 c614f83 59 119 56m40s 62.5 #23005078511
2026-03-12 🟡 cancelled d0c1a39 355829e 158 12 2h48m 56.4 #23008709458
2026-03-12 🟡 cancelled 447aa91 22ea7b5 68 0 2h55m 23.3 #23028385721
2026-03-13 🟡 cancelled 8e8223b 277cb51 201 0 2h54m 69.3 #23033839708
2026-03-13 d94cdf8 2579041 161 272 2h55m 55.2 #23035485709
2026-03-13 8e8223b 039aebd 230 203 4h28m 51.5 #23043936501
2026-03-16 🟡 cancelled d129025 e326a67 409 0 9h56m 41.2 #23164396524
2026-03-17 🟡 cancelled c34cb27 bedaad9 343 7 9h58m 34.4 #23198836788
2026-03-17 🟡 cancelled c34cb27 76b2dbc 348 1 9h57m 35.0 #23198839061
2026-03-17 🟡 cancelled d2d5d47 e326a67 314 35 9h57m 31.6 #23202392035
2026-03-18 2d027b4 e2132d1 392 41 9h20m 42.0 #23226296458

Reading the table

  • Images Built = images that were actually built and pushed to GHCR by this run (excludes skipped).
  • Images Skipped = images that already existed in GHCR and were not rebuilt.
  • Build Duration = wall-clock time spent building (from tqdm elapsed or the build summary timer; excludes job setup).
  • Throughput (img/h) = images_built / build_duration_hours — the measured build rate during this run.
  • Status: ✅ completed, ❌ failed (hit timeout), 🟡 cancelled (hit timeout or manually cancelled).

Caveats on throughput comparisons across runs:

  • Throughput is non-linear at scale: runs that build fewer images may not hit disk pressure or BuildKit prune events, inflating their measured throughput relative to full 433-image runs.
  • Follow-up runs with high skip counts (e.g., 272/288 skips) only built images missed by prior cancelled runs. They build a biased subset (often the slower-to-build images left over), so their throughput may not be representative.
  • Throughput varies within a single run: profiling of run #23382357696 showed 61.3 img/h pre-prune dropping to 25-28 img/h during prune batches.

Key observations

  • Throughput regressed significantly. Late February runs (SDK cefaebf) measured 66-76 img/h on 364-382 images. Mid-March runs measured 31-42 img/h on 314-409 images.
  • Many cancelled runs: The job timeout (6h, later increased to 10h) is routinely hit before all images finish, requiring follow-up runs to complete remaining images.
  • One extreme outlier: Run #23028385721 on 2026-03-12 (SDK 447aa91) measured only 23.3 img/h on 68 images.
  • Feb 27 parallel runs: Three runs with the same SDK/benchmarks SHA launched concurrently. The "skipped" images in each were built by sibling runs first. Each independently built 364–382 images.

Methodology

Why counting "real builds" is non-trivial

The build workflow checks GHCR for each image before building it (remote_image_exists() check). If the image already exists, it's skipped. A run can report "Built: 433/433" while only truly building a fraction — the rest were skipped because they were already pushed by a prior or concurrent run. If we don't distinguish, a mostly-skipping run looks blazingly fast when it isn't.

How real build counts were determined

The log format changed over time, so two different extraction strategies are needed:

Newer runs (mid-March+): The build summary line directly reports separate counts:

Done in 33633.7s. Built=392 Skipped=41 Failed=0 Retried=6 Throughput=46.3 images/hour

The progress bar also distinguishes built from skipped: 🛠 392 ⏭ 41 ❌ 0 🏃 0

Older runs (Feb–early March): The summary lumps everything together:

Done. Built=433  Failed=0  Manifest=...

The progress bar shows ✅ 433 ❌ 0 🏃 0 where includes both built AND skipped. However, each skipped image produces a distinct log line:

Image ghcr.io/openhands/eval-agent-server:010e847-sweb.eval.x86_64.sympy_1776_sympy-24213-source-minimal already exists. Skipping build.

So for old-format runs: real_built = reported_total - count("already exists. Skipping build."). This is verified: the count of skip messages equals the count of unique images skipped (no double-counting).

Cancelled runs: The last tqdm progress bar line gives the final built/skipped counts at the time of cancellation.

How build duration was measured

  • New-format runs: The exact build process time from the summary (Done in Xs.).
  • Old-format and cancelled runs: The tqdm elapsed time from the last progress bar line (e.g., [5:29:27<00:00, 45.65s/it]), which excludes job setup overhead (checkout, docker login, uv install — typically ~1–2 minutes).

Script

Prerequisites: gh CLI authenticated with access to OpenHands/benchmarks. Run with:

python3 analyze_swtbench_builds.py 2>stderr.log > results.json
analyze_swtbench_builds.py (click to expand)
#!/usr/bin/env python3
"""Analyze swtbench image build workflow runs to determine real build counts vs skipped.

Downloads CI logs for each run via GitHub API, parses them to distinguish
truly-built images from skipped ones (already in GHCR), and computes
per-image build time and throughput.

Handles two log formats:
- New (mid-March+): summary line has separate Built/Skipped counts
- Old (Feb-early March): summary lumps built+skipped; we count skip log messages
"""

import json
import re
import subprocess
import sys

REPO = "OpenHands/benchmarks"
TOTAL_IMAGES = 433


def run_cmd(cmd, timeout=120):
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
    return result.stdout.strip()


def get_all_runs():
    cmd = f'''gh run list --repo {REPO} --workflow build-swtbench-images.yml --limit 100 \
      --json databaseId,conclusion,createdAt,updatedAt,headSha,displayTitle,status'''
    output = run_cmd(cmd)
    all_runs = json.loads(output)
    return [r for r in all_runs
            if r['conclusion'] not in ('skipped', '') and r['status'] == 'completed']


def get_job_id(run_id):
    cmd = f'gh run view {run_id} --repo {REPO} --json jobs'
    output = run_cmd(cmd)
    if not output:
        return None, None, None
    data = json.loads(output)
    for job in data.get('jobs', []):
        if 'build' in job['name'].lower():
            return job['databaseId'], job.get('startedAt'), job.get('completedAt')
    if data.get('jobs'):
        j = data['jobs'][0]
        return j['databaseId'], j.get('startedAt'), j.get('completedAt')
    return None, None, None


def get_job_logs(job_id):
    cmd = f'gh api repos/{REPO}/actions/jobs/{job_id}/logs'
    return run_cmd(cmd, timeout=60)


def extract_sdk_sha(title):
    m = re.search(r'SDK:\s*([a-f0-9]+)', title)
    return m.group(1) if m else "unknown"


def parse_duration_seconds(start_str, end_str):
    if not start_str or not end_str:
        return None
    try:
        from datetime import datetime
        start = datetime.fromisoformat(start_str.replace('Z', '+00:00'))
        end = datetime.fromisoformat(end_str.replace('Z', '+00:00'))
        return (end - start).total_seconds()
    except Exception:
        return None


def parse_tqdm_elapsed(elapsed_str):
    """Parse tqdm elapsed time like '2:55:21' or '44:16' to seconds."""
    parts = elapsed_str.split(':')
    if len(parts) == 3:
        return int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2])
    elif len(parts) == 2:
        return int(parts[0]) * 60 + int(parts[1])
    return None


def analyze_logs(logs):
    """Analyze logs to determine real built count, skipped count, and build timing.

    Tries four methods in order of preference:
    1. New summary line: "Done in Xs. Built=N Skipped=M Failed=K"
    2. Old summary line + skip message count: "Done. Built=N Failed=M" minus skip count
    3. New progress bar (for cancelled runs): 🛠 N  ⏭ M  ❌ K  🏃 R
    4. Old progress bar (for cancelled runs): ✅ N  ❌ K  🏃 R minus skip count
    """
    result = {
        'real_built': None, 'skipped': None, 'failed': None,
        'build_seconds': None, 'tqdm_elapsed': None, 'method': 'unknown',
    }
    if not logs:
        return result

    # Count "already exists. Skipping build." messages (needed for old format)
    skip_count = len(re.findall(r'already exists\. Skipping build\.', logs))

    # Extract tqdm elapsed time from the LAST progress bar line
    tqdm_matches = re.findall(r'\[(\d+:\d+(?::\d+)?)<', logs)
    if tqdm_matches:
        result['tqdm_elapsed'] = parse_tqdm_elapsed(tqdm_matches[-1])

    # METHOD 1: New format summary
    new_summary = re.search(
        r'Done in ([\d.]+)s\.\s+Built=(\d+)\s+Skipped=(\d+)\s+Failed=(\d+)', logs)
    if new_summary:
        result['build_seconds'] = float(new_summary.group(1))
        result['real_built'] = int(new_summary.group(2))
        result['skipped'] = int(new_summary.group(3))
        result['failed'] = int(new_summary.group(4))
        result['method'] = 'new_summary'
        return result

    # METHOD 2: Old format summary + skip count
    old_summary = re.search(r'Done\.\s+Built=(\d+)\s+Failed=(\d+)', logs)
    if old_summary:
        total_reported = int(old_summary.group(1))
        result['real_built'] = total_reported - skip_count
        result['skipped'] = skip_count
        result['failed'] = int(old_summary.group(2))
        result['method'] = 'old_summary_minus_skips'
        if result['tqdm_elapsed']:
            result['build_seconds'] = result['tqdm_elapsed']
        return result

    # METHOD 3: New progress bar (cancelled runs)
    new_progress = re.findall(r'🛠\s*(\d+)\s+⏭\s*(\d+)\s+❌\s*(\d+)\s+🏃\s*(\d+)', logs)
    if new_progress:
        last = new_progress[-1]
        result['real_built'] = int(last[0])
        result['skipped'] = int(last[1])
        result['failed'] = int(last[2])
        result['method'] = 'new_progress_bar'
        if result['tqdm_elapsed']:
            result['build_seconds'] = result['tqdm_elapsed']
        return result

    # METHOD 4: Old progress bar (cancelled runs)
    old_progress = re.findall(r'✅\s*(\d+)\s+❌\s*(\d+)\s+🏃\s*(\d+)', logs)
    if old_progress:
        last = old_progress[-1]
        result['real_built'] = int(last[0]) - skip_count
        result['skipped'] = skip_count
        result['failed'] = int(last[1])
        result['method'] = 'old_progress_bar_minus_skips'
        if result['tqdm_elapsed']:
            result['build_seconds'] = result['tqdm_elapsed']
        return result

    return result


def format_duration(seconds):
    if seconds is None:
        return "N/A"
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    return f"{h}h{m:02d}m" if h > 0 else f"{m}m{s:02d}s"


def main():
    print("Fetching workflow runs...", file=sys.stderr)
    runs = get_all_runs()
    print(f"Found {len(runs)} non-skipped completed runs", file=sys.stderr)

    results = []
    for i, run in enumerate(runs):
        run_id = run['databaseId']
        title = run['displayTitle']
        conclusion = run['conclusion']
        created = run['createdAt']

        # Skip instant failures (< 2 min)
        duration_total = parse_duration_seconds(run['createdAt'], run['updatedAt'])
        if duration_total is not None and duration_total < 120:
            continue

        print(f"  [{i+1}/{len(runs)}] Analyzing run {run_id} ({conclusion})...",
              file=sys.stderr)

        sdk_sha = extract_sdk_sha(title)
        benchmarks_sha = run['headSha'][:7]

        job_id, job_start, job_end = get_job_id(run_id)
        if not job_id:
            continue

        logs = get_job_logs(job_id)
        if not logs:
            continue

        analysis = analyze_logs(logs)

        if analysis['real_built'] is None or analysis['real_built'] < 50:
            print(f"    Built {analysis['real_built']} images (< 50), skipping",
                  file=sys.stderr)
            continue

        build_secs = analysis.get('build_seconds')
        if build_secs is None:
            build_secs = parse_duration_seconds(job_start, job_end)

        real_built = analysis['real_built']
        throughput = None
        if real_built > 0 and build_secs:
            throughput = real_built / (build_secs / 3600)

        results.append({
            'run_id': run_id,
            'date': created[:10],
            'conclusion': conclusion,
            'sdk_sha': sdk_sha[:7],
            'benchmarks_sha': benchmarks_sha,
            'real_built': real_built,
            'skipped': analysis['skipped'],
            'failed': analysis.get('failed', 0),
            'build_seconds': build_secs,
            'build_duration_str': format_duration(build_secs),
            'throughput_img_per_hour': round(throughput, 1) if throughput else None,
            'method': analysis['method'],
        })

        print(f"    Built={real_built} Skipped={analysis['skipped']} "
              f"Duration={format_duration(build_secs)} "
              f"Throughput={round(throughput, 1) if throughput else 'N/A'} img/h "
              f"Method={analysis['method']}", file=sys.stderr)

    results.sort(key=lambda x: x['date'] + str(x['run_id']))
    json.dump(results, sys.stdout, indent=2)


if __name__ == '__main__':
    main()

Verifying skip counts

To independently verify that skip counts are not double-counted for any run:

JOB_ID=$(gh run view <RUN_ID> --repo OpenHands/benchmarks --json jobs --jq '.jobs[0].databaseId')
LOGS=$(gh api repos/OpenHands/benchmarks/actions/jobs/$JOB_ID/logs)

# Total skip messages
echo "$LOGS" | grep -c 'already exists. Skipping build.'

# Unique images skipped (should match the above)
echo "$LOGS" | grep 'already exists. Skipping build.' | sed 's/.*Image //' | sed 's/ already.*//' | sort -u | wc -l

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions