Skip to content

swebenchmultimodal ACP evaluation: Low submission rate (35%) with claude-opus-4-6 #523

@simonrosenberg

Description

@simonrosenberg

Summary

Running swebenchmultimodal with acp-claude agent on claude-opus-4-6 resulted in unexpectedly low submission rate: 36/102 instances (35%) submitted patches, with only 12 resolved (11.8%).

Configuration

Parameter Value
Benchmark swebenchmultimodal
Model claude-opus-4-6
Agent Type acp-claude
Eval Limit 500 (dataset has 102 instances)
SDK SHA d129025974ddf82256a0095da53783c5289f437c
Correlation ID FCC30896

Results (from Slack notification)

  • Total instances: 102
  • Submitted instances: 36
  • Resolved instances: 12
  • Unresolved instances: 24
  • Empty patch instances: 0
  • Error instances: 0
  • Success rate: 12/102 (11.8%)

Investigation Links

GitHub Actions

Datadog

K8s Job

  • Job name: eval-23164289982-claude-4-6
  • Namespace: evaluation-jobs

Observed Issues

1. "Remote conversation ended with error" (16 instances)

Multiple instances from these repos failed with runtime errors:

  • diegomura__react-pdf-* (multiple instances)
  • chartjs__Chart.js-* (multiple instances)
  • markedjs__marked-*

Example error:

Instance diegomura__react-pdf-1552 failed (attempt 1/3): Conversation run failed for id=ab46982a-260e-4d5a-8142-4769d2f3f9ee: Remote conversation ended with error

2. Low Submission Rate

  • 66/102 instances (65%) did not produce patches
  • "Error instances: 0" suggests these completed without errors but didn't generate diffs
  • Possible issue with ACP agent patch extraction or output handling

Questions

  1. Why did 66 instances complete without submitting patches?
  2. Is the ACP agent output being captured correctly for patch extraction?
  3. Are the runtime failures (react-pdf, Chart.js) infrastructure issues or ACP-specific?

Related

  • This may be related to known swebenchmultimodal infra issues with large images (wp-calypso, react-pdf)

/cc @OpenHands/benchmarks-team

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions