Skip to content

Add swt-bench results for MiniMax-M2.7#720

Open
all-hands-bot wants to merge 5 commits intomainfrom
eval/MiniMax-M2.7/swt-bench-20260323-215850
Open

Add swt-bench results for MiniMax-M2.7#720
all-hands-bot wants to merge 5 commits intomainfrom
eval/MiniMax-M2.7/swt-bench-20260323-215850

Conversation

@all-hands-bot
Copy link
Copy Markdown
Collaborator

Evaluation Results

Model: MiniMax-M2.7
Benchmark: swt-bench
Agent Version: v1.14.0

Results

  • Accuracy: 34.6%
  • Total Cost: $0.00
  • Average Instance Cost: $0.00
  • Total Duration: 152430s (2540.5m)
  • Average Instance Runtime: 352s

Report Summary

  • Total instances: 433
  • Submitted instances: 433
  • Resolved instances: 150
  • Unresolved instances: 62
  • Empty patch instances: 0
  • Error instances: 221

Additional Metadata

  • Mean coverage: 0.8763179185789769
  • Mean coverage delta: 0.6673582469849959
  • completed_instances: 212
  • unstopped_instances: 0

This PR was automatically created by the evaluation pipeline.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 23, 2026

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  22 models × 5 benchmarks = 110 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

Incomplete Pairs (10):
  GPT-5.4:
    - swt-bench (all metrics)
  Qwen3.5-Flash:
    - swe-bench (all metrics)
    - swe-bench-multimodal (all metrics)
    - swt-bench (all metrics)
    - gaia (all metrics)
  Qwen3-Coder-Next:
    - swt-bench (all metrics)
  Minimax-2.7:
    - swe-bench (all metrics)
    - swe-bench-multimodal (all metrics)
    - commit0 (all metrics)
    - gaia (all metrics)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬜ 90.91%
  Complete: 100 / 110 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 46
  Passed: 46
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

juanmichelini and others added 4 commits March 23, 2026 19:02
Pricing from metadata.json:
Input (cache miss): $0.3/M tokens
Input (cache hit): $0.06/M tokens
Output: $1.2/M tokens
cache_write_price: ignored
New cost_per_instance: $0.1283

Co-authored-by: OpenHands Bot <openhands@all-hands.dev>
@juanmichelini
Copy link
Copy Markdown
Collaborator

@juanmichelini juanmichelini requested review from neubig and removed request for juanmichelini March 25, 2026 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants