Skip to content

Add swe-bench results for MiniMax-M2.7#725

Open
all-hands-bot wants to merge 3 commits intomainfrom
eval/MiniMax-M2.7/swe-bench-20260325-214158
Open

Add swe-bench results for MiniMax-M2.7#725
all-hands-bot wants to merge 3 commits intomainfrom
eval/MiniMax-M2.7/swe-bench-20260325-214158

Conversation

@all-hands-bot
Copy link
Copy Markdown
Collaborator

Evaluation Results

Model: MiniMax-M2.7
Benchmark: swe-bench
Agent Version: v1.14.0

Results

  • Accuracy: 75.6%
  • Total Cost: $0.00
  • Average Instance Cost: $0.00
  • Total Duration: 264558s (4409.3m)
  • Average Instance Runtime: 529s

Report Summary

  • Total instances: 500
  • Submitted instances: 500
  • Resolved instances: 378
  • Unresolved instances: 121
  • Empty patch instances: 1
  • Error instances: 0

Additional Metadata

  • completed_instances: 499
  • schema_version: 2
  • unstopped_instances: 0

This PR was automatically created by the evaluation pipeline.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 25, 2026

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  22 models × 5 benchmarks = 110 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

Incomplete Pairs (10):
  GPT-5.4:
    - swt-bench (all metrics)
  Qwen3.5-Flash:
    - swe-bench (all metrics)
    - swe-bench-multimodal (all metrics)
    - swt-bench (all metrics)
    - gaia (all metrics)
  Qwen3-Coder-Next:
    - swt-bench (all metrics)
  Minimax-2.7:
    - swe-bench-multimodal (all metrics)
    - swt-bench (all metrics)
    - commit0 (all metrics)
    - gaia (all metrics)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬜ 90.91%
  Complete: 100 / 110 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 46
  Passed: 46
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

juanmichelini and others added 2 commits March 27, 2026 13:19
Pricing from metadata.json:
Input (cache miss): $0.3/M tokens
Input (cache hit): $0.06/M tokens
Output: $1.2/M tokens
cache_write_price: ignored
New cost_per_instance: $0.1731

Co-authored-by: OpenHands Bot <openhands@all-hands.dev>
@juanmichelini juanmichelini requested a review from neubig March 27, 2026 16:43
@juanmichelini
Copy link
Copy Markdown
Collaborator

Conversation Error Report

Archive: https://results.eval.all-hands.dev/swebench/litellm_proxy-minimax-MiniMax-M2-7/23463806447/results.tar.gz
Generated: 2026-03-24 01:31:13 UTC

Summary

Total conversations: 499
Conversations with errors: 9
Error rate: 1.8%

Error Occurrences (sorted by count)

Count % Error


 3    0.6%  Tool 'grep' not found. Available: ['terminal', 'file_editor', 'task_tracker', 'finish', 'think']
 1    0.2%  Failed to process archive: not a gzip file
 1    0.2%  Error executing tool 'terminal': Cannot use reset=True with is_input=True
 1    0.2%  Error validating args {"command": "view", "path": "/workspace/scikit-learn/sklearn/model_selection/_split.py", "view_range": [1167, 1225], "security_rity": "LOW", "summary": "View RepeatedKFold init"}...
 1    0.2%  MaxIterationsReached
 1    0.2%  Error validating args {"thought": "Let me understand the issue better. In 5.1.1:\n\nIn `_getconftestmodules`:\n- Old code used `directory.realpath().parts()` to iterate\n- It passed `conftestpath.real...
 1    0.2%  Error validating args {"command": "str_replace", "old_str": "# Expected: __in with [0] should be equivalent to __key = 0\nprint(\"\\n--- Verification ---\")\nif exact_items.count() != in_items.count()...

Unique error types: 7

Orchestrator / Runtime Failures (from instance logs)

Instances that required retries or failed at the harness level.
has-conv = conversation archive exists; no-conv = completely failed.

Instance Retries Status Error


sphinx-doc__sphinx-9320 3 has-conv runtime_failure_count=3
django__django-11885 1 has-conv runtime_failure_count=1
matplotlib__matplotlib-26208 1 has-conv runtime_failure_count=1
sphinx-doc__sphinx-8056 1 has-conv runtime_failure_count=1

Instances with runtime failures: 4

  • with conversation archive: 4
  • without conversation archive: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants