-
Notifications
You must be signed in to change notification settings - Fork 49
Description
This issue serves as the historical source of truth for SWT-bench image build throughput. It tracks every workflow run that built more than 50 images, with careful accounting of truly built vs. skipped images.
Build throughput over time
| Date | Status | SDK SHA | Benchmarks SHA | Images Built | Images Skipped | Build Duration | Throughput (img/h) | Run |
|---|---|---|---|---|---|---|---|---|
| 2026-02-25 | ❌ | bde31c4 |
7d81e21 |
225 | 0 | 5h57m | 37.8 | #22410410373 |
| 2026-02-27 | ✅ | cefaebf |
744df22 |
364 | 69 | 5h29m | 66.4 | #22508380588 |
| 2026-02-27 | ✅ | cefaebf |
744df22 |
379 | 54 | 5h12m | 72.9 | #22508424484 |
| 2026-02-27 | ✅ | cefaebf |
744df22 |
382 | 51 | 5h03m | 75.6 | #22508430612 |
| 2026-03-04 | ✅ | eab666f |
09c9fb1 |
50 | 0 | 55m18s | 54.2 | #22670483404 |
| 2026-03-04 | ✅ | 3c2bce2 |
863df2a |
50 | 0 | 1h00m | 50.0 | #22671307073 |
| 2026-03-05 | 🟡 cancelled | 63565b9 |
981a95a |
219 | 18 | 5h58m | 36.7 | #22715409658 |
| 2026-03-06 | 🟡 cancelled | 8e3e306 |
1f29020 |
275 | 0 | 5h59m | 46.0 | #22747565501 |
| 2026-03-11 | 🟡 cancelled | 3081956 |
4d786be |
246 | 34 | 5h55m | 41.6 | #22951682281 |
| 2026-03-11 | ✅ | 3081956 |
4d786be |
145 | 288 | 4h08m | 35.1 | #22967516108 |
| 2026-03-11 | 🟡 cancelled | 46f3d78 |
e6f6da4 |
262 | 56 | 5h59m | 43.8 | #22974200370 |
| 2026-03-12 | 🟡 cancelled | b498a69 |
36cb585 |
196 | 14 | 3h02m | 64.6 | #22985176798 |
| 2026-03-12 | 🟡 cancelled | aa9df69 |
49a94d1 |
112 | 1 | 1h32m | 73.0 | #23001334237 |
| 2026-03-12 | 🟡 cancelled | aa9df69 |
c614f83 |
59 | 119 | 56m40s | 62.5 | #23005078511 |
| 2026-03-12 | 🟡 cancelled | d0c1a39 |
355829e |
158 | 12 | 2h48m | 56.4 | #23008709458 |
| 2026-03-12 | 🟡 cancelled | 447aa91 |
22ea7b5 |
68 | 0 | 2h55m | 23.3 | #23028385721 |
| 2026-03-13 | 🟡 cancelled | 8e8223b |
277cb51 |
201 | 0 | 2h54m | 69.3 | #23033839708 |
| 2026-03-13 | ✅ | d94cdf8 |
2579041 |
161 | 272 | 2h55m | 55.2 | #23035485709 |
| 2026-03-13 | ✅ | 8e8223b |
039aebd |
230 | 203 | 4h28m | 51.5 | #23043936501 |
| 2026-03-16 | 🟡 cancelled | d129025 |
e326a67 |
409 | 0 | 9h56m | 41.2 | #23164396524 |
| 2026-03-17 | 🟡 cancelled | c34cb27 |
bedaad9 |
343 | 7 | 9h58m | 34.4 | #23198836788 |
| 2026-03-17 | 🟡 cancelled | c34cb27 |
76b2dbc |
348 | 1 | 9h57m | 35.0 | #23198839061 |
| 2026-03-17 | 🟡 cancelled | d2d5d47 |
e326a67 |
314 | 35 | 9h57m | 31.6 | #23202392035 |
| 2026-03-18 | ✅ | 2d027b4 |
e2132d1 |
392 | 41 | 9h20m | 42.0 | #23226296458 |
Reading the table
- Images Built = images that were actually built and pushed to GHCR by this run (excludes skipped).
- Images Skipped = images that already existed in GHCR and were not rebuilt.
- Build Duration = wall-clock time spent building (from tqdm elapsed or the build summary timer; excludes job setup).
- Throughput (img/h) =
images_built / build_duration_hours— the measured build rate during this run. - Status: ✅ completed, ❌ failed (hit timeout), 🟡 cancelled (hit timeout or manually cancelled).
Caveats on throughput comparisons across runs:
- Throughput is non-linear at scale: runs that build fewer images may not hit disk pressure or BuildKit prune events, inflating their measured throughput relative to full 433-image runs.
- Follow-up runs with high skip counts (e.g., 272/288 skips) only built images missed by prior cancelled runs. They build a biased subset (often the slower-to-build images left over), so their throughput may not be representative.
- Throughput varies within a single run: profiling of run #23382357696 showed 61.3 img/h pre-prune dropping to 25-28 img/h during prune batches.
Key observations
- Throughput regressed significantly. Late February runs (SDK
cefaebf) measured 66-76 img/h on 364-382 images. Mid-March runs measured 31-42 img/h on 314-409 images. - Many cancelled runs: The job timeout (6h, later increased to 10h) is routinely hit before all images finish, requiring follow-up runs to complete remaining images.
- One extreme outlier: Run #23028385721 on 2026-03-12 (SDK
447aa91) measured only 23.3 img/h on 68 images. - Feb 27 parallel runs: Three runs with the same SDK/benchmarks SHA launched concurrently. The "skipped" images in each were built by sibling runs first. Each independently built 364–382 images.
Methodology
Why counting "real builds" is non-trivial
The build workflow checks GHCR for each image before building it (remote_image_exists() check). If the image already exists, it's skipped. A run can report "Built: 433/433" while only truly building a fraction — the rest were skipped because they were already pushed by a prior or concurrent run. If we don't distinguish, a mostly-skipping run looks blazingly fast when it isn't.
How real build counts were determined
The log format changed over time, so two different extraction strategies are needed:
Newer runs (mid-March+): The build summary line directly reports separate counts:
Done in 33633.7s. Built=392 Skipped=41 Failed=0 Retried=6 Throughput=46.3 images/hour
The progress bar also distinguishes built from skipped: 🛠 392 ⏭ 41 ❌ 0 🏃 0
Older runs (Feb–early March): The summary lumps everything together:
Done. Built=433 Failed=0 Manifest=...
The progress bar shows ✅ 433 ❌ 0 🏃 0 where ✅ includes both built AND skipped. However, each skipped image produces a distinct log line:
Image ghcr.io/openhands/eval-agent-server:010e847-sweb.eval.x86_64.sympy_1776_sympy-24213-source-minimal already exists. Skipping build.
So for old-format runs: real_built = reported_total - count("already exists. Skipping build."). This is verified: the count of skip messages equals the count of unique images skipped (no double-counting).
Cancelled runs: The last tqdm progress bar line gives the final built/skipped counts at the time of cancellation.
How build duration was measured
- New-format runs: The exact build process time from the summary (
Done in Xs.). - Old-format and cancelled runs: The tqdm elapsed time from the last progress bar line (e.g.,
[5:29:27<00:00, 45.65s/it]), which excludes job setup overhead (checkout, docker login, uv install — typically ~1–2 minutes).
Script
Prerequisites: gh CLI authenticated with access to OpenHands/benchmarks. Run with:
python3 analyze_swtbench_builds.py 2>stderr.log > results.json
analyze_swtbench_builds.py (click to expand)
#!/usr/bin/env python3
"""Analyze swtbench image build workflow runs to determine real build counts vs skipped.
Downloads CI logs for each run via GitHub API, parses them to distinguish
truly-built images from skipped ones (already in GHCR), and computes
per-image build time and throughput.
Handles two log formats:
- New (mid-March+): summary line has separate Built/Skipped counts
- Old (Feb-early March): summary lumps built+skipped; we count skip log messages
"""
import json
import re
import subprocess
import sys
REPO = "OpenHands/benchmarks"
TOTAL_IMAGES = 433
def run_cmd(cmd, timeout=120):
result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
return result.stdout.strip()
def get_all_runs():
cmd = f'''gh run list --repo {REPO} --workflow build-swtbench-images.yml --limit 100 \
--json databaseId,conclusion,createdAt,updatedAt,headSha,displayTitle,status'''
output = run_cmd(cmd)
all_runs = json.loads(output)
return [r for r in all_runs
if r['conclusion'] not in ('skipped', '') and r['status'] == 'completed']
def get_job_id(run_id):
cmd = f'gh run view {run_id} --repo {REPO} --json jobs'
output = run_cmd(cmd)
if not output:
return None, None, None
data = json.loads(output)
for job in data.get('jobs', []):
if 'build' in job['name'].lower():
return job['databaseId'], job.get('startedAt'), job.get('completedAt')
if data.get('jobs'):
j = data['jobs'][0]
return j['databaseId'], j.get('startedAt'), j.get('completedAt')
return None, None, None
def get_job_logs(job_id):
cmd = f'gh api repos/{REPO}/actions/jobs/{job_id}/logs'
return run_cmd(cmd, timeout=60)
def extract_sdk_sha(title):
m = re.search(r'SDK:\s*([a-f0-9]+)', title)
return m.group(1) if m else "unknown"
def parse_duration_seconds(start_str, end_str):
if not start_str or not end_str:
return None
try:
from datetime import datetime
start = datetime.fromisoformat(start_str.replace('Z', '+00:00'))
end = datetime.fromisoformat(end_str.replace('Z', '+00:00'))
return (end - start).total_seconds()
except Exception:
return None
def parse_tqdm_elapsed(elapsed_str):
"""Parse tqdm elapsed time like '2:55:21' or '44:16' to seconds."""
parts = elapsed_str.split(':')
if len(parts) == 3:
return int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2])
elif len(parts) == 2:
return int(parts[0]) * 60 + int(parts[1])
return None
def analyze_logs(logs):
"""Analyze logs to determine real built count, skipped count, and build timing.
Tries four methods in order of preference:
1. New summary line: "Done in Xs. Built=N Skipped=M Failed=K"
2. Old summary line + skip message count: "Done. Built=N Failed=M" minus skip count
3. New progress bar (for cancelled runs): 🛠 N ⏭ M ❌ K 🏃 R
4. Old progress bar (for cancelled runs): ✅ N ❌ K 🏃 R minus skip count
"""
result = {
'real_built': None, 'skipped': None, 'failed': None,
'build_seconds': None, 'tqdm_elapsed': None, 'method': 'unknown',
}
if not logs:
return result
# Count "already exists. Skipping build." messages (needed for old format)
skip_count = len(re.findall(r'already exists\. Skipping build\.', logs))
# Extract tqdm elapsed time from the LAST progress bar line
tqdm_matches = re.findall(r'\[(\d+:\d+(?::\d+)?)<', logs)
if tqdm_matches:
result['tqdm_elapsed'] = parse_tqdm_elapsed(tqdm_matches[-1])
# METHOD 1: New format summary
new_summary = re.search(
r'Done in ([\d.]+)s\.\s+Built=(\d+)\s+Skipped=(\d+)\s+Failed=(\d+)', logs)
if new_summary:
result['build_seconds'] = float(new_summary.group(1))
result['real_built'] = int(new_summary.group(2))
result['skipped'] = int(new_summary.group(3))
result['failed'] = int(new_summary.group(4))
result['method'] = 'new_summary'
return result
# METHOD 2: Old format summary + skip count
old_summary = re.search(r'Done\.\s+Built=(\d+)\s+Failed=(\d+)', logs)
if old_summary:
total_reported = int(old_summary.group(1))
result['real_built'] = total_reported - skip_count
result['skipped'] = skip_count
result['failed'] = int(old_summary.group(2))
result['method'] = 'old_summary_minus_skips'
if result['tqdm_elapsed']:
result['build_seconds'] = result['tqdm_elapsed']
return result
# METHOD 3: New progress bar (cancelled runs)
new_progress = re.findall(r'🛠\s*(\d+)\s+⏭\s*(\d+)\s+❌\s*(\d+)\s+🏃\s*(\d+)', logs)
if new_progress:
last = new_progress[-1]
result['real_built'] = int(last[0])
result['skipped'] = int(last[1])
result['failed'] = int(last[2])
result['method'] = 'new_progress_bar'
if result['tqdm_elapsed']:
result['build_seconds'] = result['tqdm_elapsed']
return result
# METHOD 4: Old progress bar (cancelled runs)
old_progress = re.findall(r'✅\s*(\d+)\s+❌\s*(\d+)\s+🏃\s*(\d+)', logs)
if old_progress:
last = old_progress[-1]
result['real_built'] = int(last[0]) - skip_count
result['skipped'] = skip_count
result['failed'] = int(last[1])
result['method'] = 'old_progress_bar_minus_skips'
if result['tqdm_elapsed']:
result['build_seconds'] = result['tqdm_elapsed']
return result
return result
def format_duration(seconds):
if seconds is None:
return "N/A"
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
return f"{h}h{m:02d}m" if h > 0 else f"{m}m{s:02d}s"
def main():
print("Fetching workflow runs...", file=sys.stderr)
runs = get_all_runs()
print(f"Found {len(runs)} non-skipped completed runs", file=sys.stderr)
results = []
for i, run in enumerate(runs):
run_id = run['databaseId']
title = run['displayTitle']
conclusion = run['conclusion']
created = run['createdAt']
# Skip instant failures (< 2 min)
duration_total = parse_duration_seconds(run['createdAt'], run['updatedAt'])
if duration_total is not None and duration_total < 120:
continue
print(f" [{i+1}/{len(runs)}] Analyzing run {run_id} ({conclusion})...",
file=sys.stderr)
sdk_sha = extract_sdk_sha(title)
benchmarks_sha = run['headSha'][:7]
job_id, job_start, job_end = get_job_id(run_id)
if not job_id:
continue
logs = get_job_logs(job_id)
if not logs:
continue
analysis = analyze_logs(logs)
if analysis['real_built'] is None or analysis['real_built'] < 50:
print(f" Built {analysis['real_built']} images (< 50), skipping",
file=sys.stderr)
continue
build_secs = analysis.get('build_seconds')
if build_secs is None:
build_secs = parse_duration_seconds(job_start, job_end)
real_built = analysis['real_built']
throughput = None
if real_built > 0 and build_secs:
throughput = real_built / (build_secs / 3600)
results.append({
'run_id': run_id,
'date': created[:10],
'conclusion': conclusion,
'sdk_sha': sdk_sha[:7],
'benchmarks_sha': benchmarks_sha,
'real_built': real_built,
'skipped': analysis['skipped'],
'failed': analysis.get('failed', 0),
'build_seconds': build_secs,
'build_duration_str': format_duration(build_secs),
'throughput_img_per_hour': round(throughput, 1) if throughput else None,
'method': analysis['method'],
})
print(f" Built={real_built} Skipped={analysis['skipped']} "
f"Duration={format_duration(build_secs)} "
f"Throughput={round(throughput, 1) if throughput else 'N/A'} img/h "
f"Method={analysis['method']}", file=sys.stderr)
results.sort(key=lambda x: x['date'] + str(x['run_id']))
json.dump(results, sys.stdout, indent=2)
if __name__ == '__main__':
main()Verifying skip counts
To independently verify that skip counts are not double-counted for any run:
JOB_ID=$(gh run view <RUN_ID> --repo OpenHands/benchmarks --json jobs --jq '.jobs[0].databaseId')
LOGS=$(gh api repos/OpenHands/benchmarks/actions/jobs/$JOB_ID/logs)
# Total skip messages
echo "$LOGS" | grep -c 'already exists. Skipping build.'
# Unique images skipped (should match the above)
echo "$LOGS" | grep 'already exists. Skipping build.' | sed 's/.*Image //' | sed 's/ already.*//' | sort -u | wc -l