Add swt-bench results for GPT-5.4 by all-hands-bot · Pull Request #715 · OpenHands/openhands-index-results

all-hands-bot · 2026-03-18T15:21:11Z

Evaluation Results

Model: GPT-5.4
Benchmark: swt-bench
Agent Version: v1.14.0

Results

Accuracy: 37.6%
Total Cost: $237.67
Average Instance Cost: $0.55
Total Duration: 101870s (1697.8m)
Average Instance Runtime: 235s

Report Summary

Total instances: 433
Submitted instances: 433
Resolved instances: 163
Unresolved instances: 49
Empty patch instances: 0
Error instances: 221

Additional Metadata

Mean coverage: 0.8017340509484087
Mean coverage delta: 0.5565111313782735
completed_instances: 212
unstopped_instances: 0

This PR was automatically created by the evaluation pipeline.

github-actions · 2026-03-18T15:21:27Z

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  21 models × 5 benchmarks = 105 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

Incomplete Pairs (5):
  Qwen3.5-Flash:
    - swe-bench (all metrics)
    - swe-bench-multimodal (all metrics)
    - swt-bench (all metrics)
    - gaia (all metrics)
  Qwen3-Coder-Next:
    - swt-bench (all metrics)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬜ 95.24%
  Complete: 100 / 105 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 42
  Passed: 42
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

juanmichelini · 2026-03-18T15:34:12Z

@OpenHands please look at the full archive of swebench https://results.eval.all-hands.dev/swtbench/litellm_proxy-openai-gpt-5-4/23226253018/results.tar.gz, it got 221 error instances can you find the reason and report here in a comment?

openhands-ai · 2026-03-18T15:34:38Z

I'm on it! juanmichelini can track my progress at all-hands.dev

juanmichelini · 2026-03-18T15:41:50Z

Analysis of 221 Error Instances in SWT-bench GPT-5.4 Evaluation

After downloading and analyzing the full results archive, the 221 error instances are caused by SWT-bench evaluation infrastructure failures, not by model/agent issues. The agent successfully produced patches for 220 out of 221 error instances (only 1 had an empty patch: django__django-13513).

Root Cause: Evaluation Harness Docker Environment Failures

The "error" classification comes from the SWT-bench test evaluation phase (which applies patches and runs tests in Docker containers), not from the agent inference phase. The evidence is overwhelming:

Error rate by repository:

Repository	Error / Total	Error Rate
`sphinx-doc__sphinx`	28 / 28	100%
`scikit-learn__scikit-learn`	24 / 24	100%
`astropy__astropy`	17 / 17	100%
`pydata__xarray`	15 / 15	100%
`pytest-dev__pytest`	15 / 15	100%
`pylint-dev__pylint`	6 / 6	100%
`psf__requests`	4 / 4	100%
`mwaskom__seaborn`	2 / 2	100%
`pallets__flask`	1 / 1	100%
`matplotlib__matplotlib`	23 / 32	72%
`django__django`	86 / 216	40%
`sympy__sympy`	0 / 73	0% ✅

9 out of 12 repositories have a 100% error rate — every single instance from those repos errored during evaluation. Only sympy__sympy had a 0% error rate.

Django Version-Specific Failures

Even within Django, the errors are strongly correlated with specific version ranges (i.e., specific Docker environment images):

Django Issue Range	Error / Total	Error Rate
`<11000`	0 / 6	0%
`11000-11999`	0 / 39	0%
`12000-12999`	2 / 20	10%
`13000-13999`	39 / 43	91%
`14000-14999`	0 / 34	0%
`15000-15999`	10 / 39	26%
`16000-16999`	32 / 32	100%
`17000+`	3 / 3	100%

Additional Evidence

The agent inference completed successfully for all 433 instances (all agent runs finished normally)
220/221 error instances have non-empty model patches that look structurally valid
The evaluation used mode=prebaked-images with only 36 prebaked Docker images loaded for evaluation
The output.swtbench.jsonl entries for all instances have resolved: null — no test results were produced by the harness for error instances
Error instances have zero overlap with completed instances (clean partition of 212 + 221 = 433)

Conclusion

This is an evaluation infrastructure issue with the SWT-bench prebaked Docker evaluation images. The test execution harness failed for specific repository/version combinations. The actual model performance should be evaluated only on the 212 instances where evaluation completed successfully, yielding 163/212 = 76.9% resolve rate on successfully evaluated instances (reported as 163/433 = 37.6% overall due to the infrastructure errors).

A re-run of the evaluation with fixed Docker images for the affected repos would likely improve the reported accuracy significantly.

openhands-ai · 2026-03-18T15:42:42Z

Summary

The PR comment requested an analysis of why the GPT-5.4 SWT-bench evaluation had 221 error instances. I downloaded the full results archive from https://results.eval.all-hands.dev/swtbench/litellm_proxy-openai-gpt-5-4/23226253018/results.tar.gz, performed a thorough analysis, and posted the findings as a comment on PR #715.

What was done:

✅ Downloaded and extracted the full results archive (~222MB)
✅ Analyzed the output.report.json, output.jsonl, output.swtbench.jsonl, output.critic_attempt_*.jsonl, and eval.log files
✅ Identified the root cause of the 221 error instances
✅ Posted a detailed analysis comment on the PR: Add swt-bench results for GPT-5.4 #715 (comment)

Key findings reported in the comment:

The 221 errors are SWT-bench evaluation infrastructure failures (broken prebaked Docker environment images), not model/agent issues:

9 of 12 repositories have 100% error rate — sphinx, scikit-learn, astropy, xarray, pytest, pylint, requests, seaborn, and flask all failed completely during the test evaluation phase
Only sympy__sympy had 0% error rate; Django had a version-correlated partial error rate (e.g., django 16000+ = 100% error, django 11000-11999 = 0%)
The agent successfully produced valid non-empty patches for 220/221 error instances — the inference phase worked fine
The actual resolve rate on successfully evaluated instances is 163/212 = 76.9%, not the reported 37.6%
A re-run with fixed Docker images would likely recover most of those 221 instances

No code changes were made — this was purely an investigative/reporting task as requested.

juanmichelini · 2026-03-27T14:44:40Z

Rerunning evaluation here https://github.com/OpenHands/software-agent-sdk/actions/runs/23651927409

neubig

Looks good to me.

Add swt-bench results for GPT-5.4

93e5cff

all-hands-bot requested a review from juanmichelini March 18, 2026 15:21

neubig approved these changes Mar 30, 2026

View reviewed changes

neubig merged commit 408ab8a into main Mar 30, 2026
1 check passed

neubig deleted the eval/GPT-5.4/swt-bench-20260318-152107 branch March 30, 2026 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add swt-bench results for GPT-5.4#715

Add swt-bench results for GPT-5.4#715
neubig merged 1 commit intomainfrom
eval/GPT-5.4/swt-bench-20260318-152107

all-hands-bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

juanmichelini commented Mar 18, 2026

Uh oh!

openhands-ai bot commented Mar 18, 2026

Uh oh!

juanmichelini commented Mar 18, 2026

Uh oh!

openhands-ai bot commented Mar 18, 2026

Uh oh!

juanmichelini commented Mar 27, 2026

Uh oh!

neubig left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

all-hands-bot commented Mar 18, 2026

Evaluation Results

Results

Report Summary

Additional Metadata

Uh oh!

github-actions bot commented Mar 18, 2026

📊 Progress Report

✅ Schema Validation

Uh oh!

juanmichelini commented Mar 18, 2026

Uh oh!

openhands-ai bot commented Mar 18, 2026

Uh oh!

juanmichelini commented Mar 18, 2026

Analysis of 221 Error Instances in SWT-bench GPT-5.4 Evaluation

Root Cause: Evaluation Harness Docker Environment Failures

Django Version-Specific Failures

Additional Evidence

Conclusion

Uh oh!

openhands-ai bot commented Mar 18, 2026

Summary

What was done:

Key findings reported in the comment:

No code changes were made — this was purely an investigative/reporting task as requested.

Uh oh!

juanmichelini commented Mar 27, 2026

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants