-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Description
Summary
Running swebenchmultimodal with acp-claude agent on claude-opus-4-6 resulted in unexpectedly low submission rate: 36/102 instances (35%) submitted patches, with only 12 resolved (11.8%).
Configuration
| Parameter | Value |
|---|---|
| Benchmark | swebenchmultimodal |
| Model | claude-opus-4-6 |
| Agent Type | acp-claude |
| Eval Limit | 500 (dataset has 102 instances) |
| SDK SHA | d129025974ddf82256a0095da53783c5289f437c |
| Correlation ID | FCC30896 |
Results (from Slack notification)
- Total instances: 102
- Submitted instances: 36
- Resolved instances: 12
- Unresolved instances: 24
- Empty patch instances: 0
- Error instances: 0
- Success rate: 12/102 (11.8%)
Investigation Links
GitHub Actions
- SDK workflow: https://github.com/OpenHands/software-agent-sdk/actions/runs/23164258999
- Evaluation workflow: https://github.com/OpenHands/evaluation/actions/runs/23164289982
- Image build workflow: https://github.com/OpenHands/benchmarks/actions/runs/23164344897
Datadog
- K8s Job logs: https://us5.datadoghq.com/logs?query=kube_namespace%3Aevaluation-jobs%20kube_job%3Aeval-23164289982-claude-4-6
K8s Job
- Job name:
eval-23164289982-claude-4-6 - Namespace:
evaluation-jobs
Observed Issues
1. "Remote conversation ended with error" (16 instances)
Multiple instances from these repos failed with runtime errors:
diegomura__react-pdf-*(multiple instances)chartjs__Chart.js-*(multiple instances)markedjs__marked-*
Example error:
Instance diegomura__react-pdf-1552 failed (attempt 1/3): Conversation run failed for id=ab46982a-260e-4d5a-8142-4769d2f3f9ee: Remote conversation ended with error
2. Low Submission Rate
- 66/102 instances (65%) did not produce patches
- "Error instances: 0" suggests these completed without errors but didn't generate diffs
- Possible issue with ACP agent patch extraction or output handling
Questions
- Why did 66 instances complete without submitting patches?
- Is the ACP agent output being captured correctly for patch extraction?
- Are the runtime failures (react-pdf, Chart.js) infrastructure issues or ACP-specific?
Related
- This may be related to known swebenchmultimodal infra issues with large images (wp-calypso, react-pdf)
/cc @OpenHands/benchmarks-team
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels