Add swe-bench results for MiniMax-M2.7 by all-hands-bot · Pull Request #725 · OpenHands/openhands-index-results

all-hands-bot · 2026-03-25T21:42:01Z

Evaluation Results

Model: MiniMax-M2.7
Benchmark: swe-bench
Agent Version: v1.14.0

Results

Accuracy: 75.6%
Total Cost: $0.00
Average Instance Cost: $0.00
Total Duration: 264558s (4409.3m)
Average Instance Runtime: 529s

Report Summary

Total instances: 500
Submitted instances: 500
Resolved instances: 378
Unresolved instances: 121
Empty patch instances: 1
Error instances: 0

Additional Metadata

completed_instances: 499
schema_version: 2
unstopped_instances: 0

This PR was automatically created by the evaluation pipeline.

github-actions · 2026-03-25T21:42:16Z

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  22 models × 5 benchmarks = 110 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

Incomplete Pairs (10):
  GPT-5.4:
    - swt-bench (all metrics)
  Qwen3.5-Flash:
    - swe-bench (all metrics)
    - swe-bench-multimodal (all metrics)
    - swt-bench (all metrics)
    - gaia (all metrics)
  Qwen3-Coder-Next:
    - swt-bench (all metrics)
  Minimax-2.7:
    - swe-bench-multimodal (all metrics)
    - swt-bench (all metrics)
    - commit0 (all metrics)
    - gaia (all metrics)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬜ 90.91%
  Complete: 100 / 110 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 46
  Passed: 46
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

Pricing from metadata.json: Input (cache miss): $0.3/M tokens Input (cache hit): $0.06/M tokens Output: $1.2/M tokens cache_write_price: ignored New cost_per_instance: $0.1731 Co-authored-by: OpenHands Bot <openhands@all-hands.dev>

juanmichelini · 2026-03-27T16:44:50Z

Conversation Error Report

Archive: https://results.eval.all-hands.dev/swebench/litellm_proxy-minimax-MiniMax-M2-7/23463806447/results.tar.gz
Generated: 2026-03-24 01:31:13 UTC

Summary

Total conversations: 499
Conversations with errors: 9
Error rate: 1.8%

Error Occurrences (sorted by count)

Count % Error

 3    0.6%  Tool 'grep' not found. Available: ['terminal', 'file_editor', 'task_tracker', 'finish', 'think']
 1    0.2%  Failed to process archive: not a gzip file
 1    0.2%  Error executing tool 'terminal': Cannot use reset=True with is_input=True
 1    0.2%  Error validating args {"command": "view", "path": "/workspace/scikit-learn/sklearn/model_selection/_split.py", "view_range": [1167, 1225], "security_rity": "LOW", "summary": "View RepeatedKFold init"}...
 1    0.2%  MaxIterationsReached
 1    0.2%  Error validating args {"thought": "Let me understand the issue better. In 5.1.1:\n\nIn `_getconftestmodules`:\n- Old code used `directory.realpath().parts()` to iterate\n- It passed `conftestpath.real...
 1    0.2%  Error validating args {"command": "str_replace", "old_str": "# Expected: __in with [0] should be equivalent to __key = 0\nprint(\"\\n--- Verification ---\")\nif exact_items.count() != in_items.count()...

Unique error types: 7

Orchestrator / Runtime Failures (from instance logs)

Instances that required retries or failed at the harness level.
has-conv = conversation archive exists; no-conv = completely failed.

Instance Retries Status Error

sphinx-doc__sphinx-9320 3 has-conv runtime_failure_count=3
django__django-11885 1 has-conv runtime_failure_count=1
matplotlib__matplotlib-26208 1 has-conv runtime_failure_count=1
sphinx-doc__sphinx-8056 1 has-conv runtime_failure_count=1

Instances with runtime failures: 4

with conversation archive: 4
without conversation archive: 0

Add swe-bench results for MiniMax-M2.7

6dcef53

all-hands-bot requested a review from juanmichelini March 25, 2026 21:42

juanmichelini and others added 2 commits March 27, 2026 13:19

Update metadata.json

7d69dad

Recalculate MiniMax-M2.7 swe-bench costs (#731)

b90d712

Pricing from metadata.json: Input (cache miss): $0.3/M tokens Input (cache hit): $0.06/M tokens Output: $1.2/M tokens cache_write_price: ignored New cost_per_instance: $0.1731 Co-authored-by: OpenHands Bot <openhands@all-hands.dev>

juanmichelini requested a review from neubig March 27, 2026 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add swe-bench results for MiniMax-M2.7#725

Add swe-bench results for MiniMax-M2.7#725
all-hands-bot wants to merge 3 commits intomainfrom
eval/MiniMax-M2.7/swe-bench-20260325-214158

all-hands-bot commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

juanmichelini commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

all-hands-bot commented Mar 25, 2026

Evaluation Results

Results

Report Summary

Additional Metadata

Uh oh!

github-actions bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Progress Report

✅ Schema Validation

Uh oh!

juanmichelini commented Mar 27, 2026

Conversation Error Report

Summary

Error Occurrences (sorted by count)

Orchestrator / Runtime Failures (from instance logs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 25, 2026 •

edited

Loading