Skip to content

Add swt-bench results for GPT-5.4#715

Merged
neubig merged 1 commit intomainfrom
eval/GPT-5.4/swt-bench-20260318-152107
Mar 30, 2026
Merged

Add swt-bench results for GPT-5.4#715
neubig merged 1 commit intomainfrom
eval/GPT-5.4/swt-bench-20260318-152107

Conversation

@all-hands-bot
Copy link
Copy Markdown
Collaborator

Evaluation Results

Model: GPT-5.4
Benchmark: swt-bench
Agent Version: v1.14.0

Results

  • Accuracy: 37.6%
  • Total Cost: $237.67
  • Average Instance Cost: $0.55
  • Total Duration: 101870s (1697.8m)
  • Average Instance Runtime: 235s

Report Summary

  • Total instances: 433
  • Submitted instances: 433
  • Resolved instances: 163
  • Unresolved instances: 49
  • Empty patch instances: 0
  • Error instances: 221

Additional Metadata

  • Mean coverage: 0.8017340509484087
  • Mean coverage delta: 0.5565111313782735
  • completed_instances: 212
  • unstopped_instances: 0

This PR was automatically created by the evaluation pipeline.

@github-actions
Copy link
Copy Markdown

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  21 models × 5 benchmarks = 105 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

Incomplete Pairs (5):
  Qwen3.5-Flash:
    - swe-bench (all metrics)
    - swe-bench-multimodal (all metrics)
    - swt-bench (all metrics)
    - gaia (all metrics)
  Qwen3-Coder-Next:
    - swt-bench (all metrics)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬜ 95.24%
  Complete: 100 / 105 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 42
  Passed: 42
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

@juanmichelini
Copy link
Copy Markdown
Collaborator

@OpenHands please look at the full archive of swebench https://results.eval.all-hands.dev/swtbench/litellm_proxy-openai-gpt-5-4/23226253018/results.tar.gz, it got 221 error instances can you find the reason and report here in a comment?

@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Mar 18, 2026

I'm on it! juanmichelini can track my progress at all-hands.dev

Copy link
Copy Markdown
Collaborator

Analysis of 221 Error Instances in SWT-bench GPT-5.4 Evaluation

After downloading and analyzing the full results archive, the 221 error instances are caused by SWT-bench evaluation infrastructure failures, not by model/agent issues. The agent successfully produced patches for 220 out of 221 error instances (only 1 had an empty patch: django__django-13513).

Root Cause: Evaluation Harness Docker Environment Failures

The "error" classification comes from the SWT-bench test evaluation phase (which applies patches and runs tests in Docker containers), not from the agent inference phase. The evidence is overwhelming:

Error rate by repository:

Repository Error / Total Error Rate
sphinx-doc__sphinx 28 / 28 100%
scikit-learn__scikit-learn 24 / 24 100%
astropy__astropy 17 / 17 100%
pydata__xarray 15 / 15 100%
pytest-dev__pytest 15 / 15 100%
pylint-dev__pylint 6 / 6 100%
psf__requests 4 / 4 100%
mwaskom__seaborn 2 / 2 100%
pallets__flask 1 / 1 100%
matplotlib__matplotlib 23 / 32 72%
django__django 86 / 216 40%
sympy__sympy 0 / 73 0%

9 out of 12 repositories have a 100% error rate — every single instance from those repos errored during evaluation. Only sympy__sympy had a 0% error rate.

Django Version-Specific Failures

Even within Django, the errors are strongly correlated with specific version ranges (i.e., specific Docker environment images):

Django Issue Range Error / Total Error Rate
<11000 0 / 6 0%
11000-11999 0 / 39 0%
12000-12999 2 / 20 10%
13000-13999 39 / 43 91%
14000-14999 0 / 34 0%
15000-15999 10 / 39 26%
16000-16999 32 / 32 100%
17000+ 3 / 3 100%

Additional Evidence

  • The agent inference completed successfully for all 433 instances (all agent runs finished normally)
  • 220/221 error instances have non-empty model patches that look structurally valid
  • The evaluation used mode=prebaked-images with only 36 prebaked Docker images loaded for evaluation
  • The output.swtbench.jsonl entries for all instances have resolved: null — no test results were produced by the harness for error instances
  • Error instances have zero overlap with completed instances (clean partition of 212 + 221 = 433)

Conclusion

This is an evaluation infrastructure issue with the SWT-bench prebaked Docker evaluation images. The test execution harness failed for specific repository/version combinations. The actual model performance should be evaluated only on the 212 instances where evaluation completed successfully, yielding 163/212 = 76.9% resolve rate on successfully evaluated instances (reported as 163/433 = 37.6% overall due to the infrastructure errors).

A re-run of the evaluation with fixed Docker images for the affected repos would likely improve the reported accuracy significantly.

@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Mar 18, 2026

Summary

The PR comment requested an analysis of why the GPT-5.4 SWT-bench evaluation had 221 error instances. I downloaded the full results archive from https://results.eval.all-hands.dev/swtbench/litellm_proxy-openai-gpt-5-4/23226253018/results.tar.gz, performed a thorough analysis, and posted the findings as a comment on PR #715.

What was done:

  • ✅ Downloaded and extracted the full results archive (~222MB)
  • ✅ Analyzed the output.report.json, output.jsonl, output.swtbench.jsonl, output.critic_attempt_*.jsonl, and eval.log files
  • ✅ Identified the root cause of the 221 error instances
  • ✅ Posted a detailed analysis comment on the PR: Add swt-bench results for GPT-5.4 #715 (comment)

Key findings reported in the comment:

The 221 errors are SWT-bench evaluation infrastructure failures (broken prebaked Docker environment images), not model/agent issues:

  • 9 of 12 repositories have 100% error rate — sphinx, scikit-learn, astropy, xarray, pytest, pylint, requests, seaborn, and flask all failed completely during the test evaluation phase
  • Only sympy__sympy had 0% error rate; Django had a version-correlated partial error rate (e.g., django 16000+ = 100% error, django 11000-11999 = 0%)
  • The agent successfully produced valid non-empty patches for 220/221 error instances — the inference phase worked fine
  • The actual resolve rate on successfully evaluated instances is 163/212 = 76.9%, not the reported 37.6%
  • A re-run with fixed Docker images would likely recover most of those 221 instances

No code changes were made — this was purely an investigative/reporting task as requested.

@juanmichelini
Copy link
Copy Markdown
Collaborator

Copy link
Copy Markdown
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@neubig neubig merged commit 408ab8a into main Mar 30, 2026
1 check passed
@neubig neubig deleted the eval/GPT-5.4/swt-bench-20260318-152107 branch March 30, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants