Skip to content

Add gaia results for MiniMax-M2.7#723

Open
all-hands-bot wants to merge 2 commits intomainfrom
eval/MiniMax-M2.7/gaia-20260325-214141
Open

Add gaia results for MiniMax-M2.7#723
all-hands-bot wants to merge 2 commits intomainfrom
eval/MiniMax-M2.7/gaia-20260325-214141

Conversation

@all-hands-bot
Copy link
Copy Markdown
Collaborator

Evaluation Results

Model: MiniMax-M2.7
Benchmark: gaia
Agent Version: v1.14.0

Results

  • Accuracy: 0.0%
  • Total Cost: $0.00
  • Average Instance Cost: $0.00
  • Total Duration: 0s (0.0m)
  • Average Instance Runtime: 0s

Report Summary

  • Total instances: 165
  • Submitted instances: 165
  • Resolved instances: 0
  • Unresolved instances: 0
  • Empty patch instances: 0
  • Error instances: 165

Additional Metadata

  • completed_instances: 0
  • incomplete_instances: 165

This PR was automatically created by the evaluation pipeline.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 25, 2026

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  22 models × 5 benchmarks = 110 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

Incomplete Pairs (10):
  GPT-5.4:
    - swt-bench (all metrics)
  Qwen3.5-Flash:
    - swe-bench (all metrics)
    - swe-bench-multimodal (all metrics)
    - swt-bench (all metrics)
    - gaia (all metrics)
  Qwen3-Coder-Next:
    - swt-bench (all metrics)
  Minimax-2.7:
    - swe-bench (all metrics)
    - swe-bench-multimodal (all metrics)
    - swt-bench (all metrics)
    - commit0 (all metrics)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬜ 90.91%
  Complete: 100 / 110 pairs
============================================================

❌ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 46
  Passed: 45
  Failed: 1

Errors:
  - /home/runner/work/openhands-index-results/openhands-index-results/results/MiniMax-M2.7/scores.json: Entry 0:
  • Field 'cost_per_instance': Input should be greater than 0 (got: 0.0)
  • Field 'average_runtime': Input should be greater than 0 (got: 0.0)

============================================================
VALIDATION FAILED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants