Conversation
📊 Progress Report✅ Schema ValidationThis report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2. |
|
@OpenHands please look at the full archive of swebench https://results.eval.all-hands.dev/swtbench/litellm_proxy-openai-gpt-5-4/23226253018/results.tar.gz, it got 221 error instances can you find the reason and report here in a comment? |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
Analysis of 221 Error Instances in SWT-bench GPT-5.4 EvaluationAfter downloading and analyzing the full results archive, the 221 error instances are caused by SWT-bench evaluation infrastructure failures, not by model/agent issues. The agent successfully produced patches for 220 out of 221 error instances (only 1 had an empty patch: Root Cause: Evaluation Harness Docker Environment FailuresThe "error" classification comes from the SWT-bench test evaluation phase (which applies patches and runs tests in Docker containers), not from the agent inference phase. The evidence is overwhelming: Error rate by repository:
9 out of 12 repositories have a 100% error rate — every single instance from those repos errored during evaluation. Only Django Version-Specific FailuresEven within Django, the errors are strongly correlated with specific version ranges (i.e., specific Docker environment images):
Additional Evidence
ConclusionThis is an evaluation infrastructure issue with the SWT-bench prebaked Docker evaluation images. The test execution harness failed for specific repository/version combinations. The actual model performance should be evaluated only on the 212 instances where evaluation completed successfully, yielding 163/212 = 76.9% resolve rate on successfully evaluated instances (reported as 163/433 = 37.6% overall due to the infrastructure errors). A re-run of the evaluation with fixed Docker images for the affected repos would likely improve the reported accuracy significantly. |
SummaryThe PR comment requested an analysis of why the GPT-5.4 SWT-bench evaluation had 221 error instances. I downloaded the full results archive from What was done:
Key findings reported in the comment:The 221 errors are SWT-bench evaluation infrastructure failures (broken prebaked Docker environment images), not model/agent issues:
No code changes were made — this was purely an investigative/reporting task as requested. |
|
Rerunning evaluation here https://github.com/OpenHands/software-agent-sdk/actions/runs/23651927409 |
Evaluation Results
Model:
GPT-5.4Benchmark:
swt-benchAgent Version:
v1.14.0Results
Report Summary
Additional Metadata
0.80173405094840870.55651113137827352120This PR was automatically created by the evaluation pipeline.