Add swt-bench results for MiniMax-M2.7 by all-hands-bot · Pull Request #720 · OpenHands/openhands-index-results

all-hands-bot · 2026-03-23T21:58:54Z

Evaluation Results

Model: MiniMax-M2.7
Benchmark: swt-bench
Agent Version: v1.14.0

Results

Accuracy: 34.6%
Total Cost: $0.00
Average Instance Cost: $0.00
Total Duration: 152430s (2540.5m)
Average Instance Runtime: 352s

Report Summary

Total instances: 433
Submitted instances: 433
Resolved instances: 150
Unresolved instances: 62
Empty patch instances: 0
Error instances: 221

Additional Metadata

Mean coverage: 0.8763179185789769
Mean coverage delta: 0.6673582469849959
completed_instances: 212
unstopped_instances: 0

This PR was automatically created by the evaluation pipeline.

github-actions · 2026-03-23T21:59:10Z

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  22 models × 5 benchmarks = 110 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

Incomplete Pairs (10):
  GPT-5.4:
    - swt-bench (all metrics)
  Qwen3.5-Flash:
    - swe-bench (all metrics)
    - swe-bench-multimodal (all metrics)
    - swt-bench (all metrics)
    - gaia (all metrics)
  Qwen3-Coder-Next:
    - swt-bench (all metrics)
  Minimax-2.7:
    - swe-bench (all metrics)
    - swe-bench-multimodal (all metrics)
    - commit0 (all metrics)
    - gaia (all metrics)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬜ 90.91%
  Complete: 100 / 110 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 46
  Passed: 46
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

Pricing from metadata.json: Input (cache miss): $0.3/M tokens Input (cache hit): $0.06/M tokens Output: $1.2/M tokens cache_write_price: ignored New cost_per_instance: $0.1283 Co-authored-by: OpenHands Bot <openhands@all-hands.dev>

juanmichelini · 2026-03-25T21:33:08Z

0.4% errors see https://openhands-eval-monitor.vercel.app/?run=swtbench%2Flitellm_proxy-minimax-MiniMax-M2-7%2F23324404309%2F&days=7&text=23324404309#error-report

Add swt-bench results for MiniMax-M2.7

eb74fa7

all-hands-bot requested a review from juanmichelini March 23, 2026 21:58

juanmichelini and others added 4 commits March 23, 2026 19:02

Update metadata.json

fe0c155

Update metadata.json

a781332

Recalculate MiniMax-M2.7 swt-bench costs (#722)

da5da87

Pricing from metadata.json: Input (cache miss): $0.3/M tokens Input (cache hit): $0.06/M tokens Output: $1.2/M tokens cache_write_price: ignored New cost_per_instance: $0.1283 Co-authored-by: OpenHands Bot <openhands@all-hands.dev>

Merge branch 'main' into eval/MiniMax-M2.7/swt-bench-20260323-215850

bb96f3f

juanmichelini requested review from neubig and removed request for juanmichelini March 25, 2026 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add swt-bench results for MiniMax-M2.7#720

Add swt-bench results for MiniMax-M2.7#720
all-hands-bot wants to merge 5 commits intomainfrom
eval/MiniMax-M2.7/swt-bench-20260323-215850

all-hands-bot commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026 •

edited

Loading

Uh oh!

juanmichelini commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

all-hands-bot commented Mar 23, 2026

Evaluation Results

Results

Report Summary

Additional Metadata

Uh oh!

github-actions bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Progress Report

✅ Schema Validation

Uh oh!

juanmichelini commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 23, 2026 •

edited

Loading