Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
4898c4e
nemo-evaluator implementation
e-dobrowolska Feb 12, 2026
f118e3a
remove redundant VERSION file
e-dobrowolska Feb 12, 2026
82386bc
Revert fake_user_response.py to main, remove run_timeout= from callers
simonrosenberg Feb 25, 2026
2c962f2
Revert eval_ prefix on run_id to match main convention
simonrosenberg Feb 25, 2026
7eb5b73
Move benchmark identification to nemo_metadata.json
simonrosenberg Feb 25, 2026
698c685
Simplify prompt choices: drop relative_to(Path.cwd())
simonrosenberg Feb 25, 2026
1637184
Remove --conversation-timeout CLI arg (now env var CONVERSATION_TIMEOUT)
simonrosenberg Feb 25, 2026
5c902ae
Remove --skip-failed-samples feature entirely
simonrosenberg Feb 25, 2026
cb16dee
Replace uv fallback with sys.executable in eval scripts
simonrosenberg Feb 25, 2026
0c6494a
Minimize eval_infer.py diffs: only replace uv with sys.executable
simonrosenberg Feb 25, 2026
e26104b
Remove swebenchmultimodal modal changes (moved to PR #452)
simonrosenberg Feb 25, 2026
fdc8c11
Fix pre-commit issues: imports, formatting, undefined variable
simonrosenberg Feb 25, 2026
d1c8c5f
Sync swtbench split parameter with main
simonrosenberg Feb 26, 2026
2480654
Restore TargetType annotations in build_utils.py
simonrosenberg Feb 26, 2026
f835292
Move run_benchmark.py and generate_llm_config.py to nemo_evaluator
simonrosenberg Feb 26, 2026
6ec0d35
Make nemo_evaluator an optional dependency
simonrosenberg Feb 26, 2026
8c6cf37
Remove redundant package include patterns
simonrosenberg Feb 26, 2026
0ff388e
Move NeMo-specific logic from llm_config.py to generate_llm_config.py
simonrosenberg Feb 26, 2026
e445fa9
Merge main into nemo-evaluator, resolve conflicts with PR #456
simonrosenberg Mar 2, 2026
c6834eb
Restore --depth 1 shallow clone for reward hacking prevention
simonrosenberg Mar 2, 2026
673b855
Restore DelegateTool import and enable_delegation support
simonrosenberg Mar 2, 2026
fc03186
Restore summarize_instance calls for post-evaluation logging
simonrosenberg Mar 2, 2026
2963114
Restore _extract_answer_from_history to use FinishAction properly
simonrosenberg Mar 2, 2026
861b33b
Remove extraneous comment added by PR
simonrosenberg Mar 2, 2026
42990d7
Revert Multi-SWE-Bench check to use startswith (matches main)
simonrosenberg Mar 2, 2026
b0620fa
Refactor IMAGE_TAG_PREFIX: remove dirty fallback, complete migration
simonrosenberg Mar 2, 2026
ff7d185
Revert fake_user_response.py to match main
simonrosenberg Mar 2, 2026
008d6c0
Revert error tracing changes (moved to #467)
simonrosenberg Mar 2, 2026
3fafd8e
Restore --modal/--no-modal CLI flags in swebenchmultimodal eval
simonrosenberg Mar 2, 2026
2a7ff00
Rename image_exists to remote_image_exists, remove misleading local c…
simonrosenberg Mar 2, 2026
8880c55
Fix CI: update test for get_tools_for_preset removal, regenerate uv.lock
simonrosenberg Mar 2, 2026
0ec5a82
Fix pre-commit: remove extra blank line, exclude nemo_evaluator from …
simonrosenberg Mar 2, 2026
2e6ee1b
Update benchmarks/openagentsafety/run_infer.py
simonrosenberg Mar 2, 2026
7425fad
Update pyproject.toml
simonrosenberg Mar 2, 2026
d219e11
Refactor benchmark-specific args to data-driven BENCHMARK_INFER_PARAMS
simonrosenberg Mar 2, 2026
0de1771
Refactor eval cmd and env vars to data-driven tables
simonrosenberg Mar 2, 2026
0f88bb2
Fix syntax error in openagentsafety/run_infer.py and pre-commit forma…
simonrosenberg Mar 2, 2026
2faffac
Restore get_tools_for_preset() function and tool_preset support
simonrosenberg Mar 2, 2026
603a439
Fix test mock target for swebench get_tools_for_preset
simonrosenberg Mar 2, 2026
4ca4d3a
Merge main into nemo-evaluator
simonrosenberg Mar 3, 2026
4533ed8
Merge main into nemo-evaluator
simonrosenberg Mar 3, 2026
82a63ad
Revert redundant argparse default for --timeout in swebench eval
simonrosenberg Mar 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,4 @@ repos:
types: [python]
pass_filenames: true
always_run: false
exclude: ^legacy/
exclude: ^(legacy|nemo_evaluator)/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical: Excluding nemo_evaluator from pre-commit bypasses type checking and linting.

You've exempted this code from pyright and ruff. This means type errors and code quality issues will slip through. NeMo has its own conventions, but type safety and basic linting shouldn't be negotiable.

If NeMo's conventions conflict with your project's, document the specific conflicts and exclude only those rules. Blanket exclusion is how technical debt accumulates silently.

Empty file.
261 changes: 261 additions & 0 deletions nemo_evaluator/openhands_benchmarks/framework.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,261 @@
framework:
name: openhands_benchmarks
pkg_name: openhands_benchmarks
full_name: OpenHands Benchmarks
description: Multi-benchmark evaluation harness using the OpenHands agent framework.
url: https://github.com/All-Hands-AI/openhands-agent-benchmarks

defaults:
command: >-
python3 -m nemo_evaluator.openhands_benchmarks.run_benchmark
--model openai/{{target.api_endpoint.model_id}}
--api-base-url {{target.api_endpoint.url}}
{% if target.api_endpoint.api_key_name is not none %}--api-key-env {{target.api_endpoint.api_key_name}}{% endif %}
--temperature {{config.params.temperature}}
--top-p {{config.params.top_p}}
--max-completion-tokens {{config.params.max_new_tokens}}
--timeout {{config.params.request_timeout}}
--max-retries {{config.params.max_retries}}
--benchmark {{config.params.extra.benchmark}}
{% if config.params.extra.dataset is defined and config.params.extra.dataset is not none %}--dataset {{config.params.extra.dataset}}{% endif %}
{% if config.params.extra.split is defined and config.params.extra.split is not none %}--split {{config.params.extra.split}}{% endif %}
--workspace {{config.params.extra.workspace}}
--max-iterations {{config.params.extra.max_steps}}
--num-workers {{config.params.parallelism}}
--note {{config.type}}
--output-dir {{config.output_dir}}
--max-attempts {{config.params.extra.max_attempts}}
--instance-max-retries {{config.params.extra.instance_max_retries}}
{% if config.params.limit_samples is not none %}--n-limit {{config.params.limit_samples}}{% endif %}
{% if config.params.extra.level is defined and config.params.extra.level is not none %}--level {{config.params.extra.level}}{% endif %}
{% if config.params.extra.repo_split is defined and config.params.extra.repo_split is not none %}--repo-split {{config.params.extra.repo_split}}{% endif %}
{% if config.params.extra.language is defined and config.params.extra.language is not none %}--language {{config.params.extra.language}}{% endif %}
{% if config.params.extra.modal is defined and config.params.extra.modal is not none %}{% if config.params.extra.modal %}--modal{% else %}--no-modal{% endif %}{% endif %}

config:
params:
limit_samples: null
temperature: 0.6
top_p: 1.0
max_new_tokens: 64000
request_timeout: 84000
max_retries: 5
parallelism: 1
extra:
workspace: docker
max_steps: 100
max_attempts: 3
instance_max_retries: 3
target:
api_endpoint:
adapter_config:
mode: client # disable adapters by default

evaluations:
# SWE-bench variants
- name: swebench-verified
description: SWE-bench Verified - 500 human-validated GitHub issues
defaults:
config:
type: swebench-verified
supported_endpoint_types: [chat]
params:
extra:
benchmark: swebench
dataset: princeton-nlp/SWE-bench_Verified
split: test

- name: swebench-lite
description: SWE-bench Lite - 300 curated GitHub issues
defaults:
config:
type: swebench-lite
supported_endpoint_types: [chat]
params:
extra:
benchmark: swebench
dataset: princeton-nlp/SWE-bench_Lite
split: test

- name: swebench-full
description: SWE-bench Full - Complete dataset of GitHub issues
defaults:
config:
type: swebench-full
supported_endpoint_types: [chat]
params:
extra:
benchmark: swebench
dataset: princeton-nlp/SWE-bench
split: test

# GAIA benchmark
- name: gaia
description: GAIA - General AI Assistant benchmark for real-world tasks requiring reasoning, tool use, and web browsing
defaults:
config:
type: gaia
supported_endpoint_types: [chat]
params:
extra:
benchmark: gaia
dataset: gaia-benchmark/GAIA
split: test
level: "2023_all"

# Commit0 benchmark
- name: commit0
description: Commit0 - Repository-level code generation benchmark
defaults:
config:
type: commit0
supported_endpoint_types: [chat]
params:
extra:
benchmark: commit0
dataset: wentingzhao/commit0_combined
split: test
repo_split: lite
max_attempts: 1

# Multi-SWE-bench (multilingual)
- name: multiswebench-java
description: Multi-SWE-bench Java - Multilingual SWE-bench for Java repositories
defaults:
config:
type: multiswebench-java
supported_endpoint_types: [chat]
params:
extra:
benchmark: multiswebench
dataset: bytedance-research/Multi-SWE-Bench
split: java_verified
language: java

- name: multiswebench-python # empty subset
description: Multi-SWE-bench Python - Multilingual SWE-bench for Python repositories
defaults:
config:
type: multiswebench-python
supported_endpoint_types: [chat]
params:
extra:
benchmark: multiswebench
dataset: bytedance-research/Multi-SWE-Bench
split: python_verified
language: python

- name: multiswebench-go
description: Multi-SWE-bench Go - Multilingual SWE-bench for Go repositories
defaults:
config:
type: multiswebench-go
supported_endpoint_types: [chat]
params:
extra:
benchmark: multiswebench
dataset: bytedance-research/Multi-SWE-Bench
split: go_verified
language: go

- name: multiswebench-c
description: Multi-SWE-bench C - Multilingual SWE-bench for C repositories
defaults:
config:
type: multiswebench-c
supported_endpoint_types: [chat]
params:
extra:
benchmark: multiswebench
dataset: bytedance-research/Multi-SWE-Bench
split: c_verified
language: c

- name: multiswebench-cpp
description: Multi-SWE-bench C++ - Multilingual SWE-bench for C++ repositories
defaults:
config:
type: multiswebench-cpp
supported_endpoint_types: [chat]
params:
extra:
benchmark: multiswebench
dataset: bytedance-research/Multi-SWE-Bench
split: cpp_verified
language: cpp

- name: multiswebench-js
description: Multi-SWE-bench JavaScript - Multilingual SWE-bench for JavaScript repositories
defaults:
config:
type: multiswebench-js
supported_endpoint_types: [chat]
params:
extra:
benchmark: multiswebench
dataset: bytedance-research/Multi-SWE-Bench
split: js_verified
language: js

- name: multiswebench-rust
description: Multi-SWE-bench Rust - Multilingual SWE-bench for Rust repositories
defaults:
config:
type: multiswebench-rust
supported_endpoint_types: [chat]
params:
extra:
benchmark: multiswebench
dataset: bytedance-research/Multi-SWE-Bench
split: rust_verified
language: rust

- name: multiswebench-ts
description: Multi-SWE-bench TypeScript - Multilingual SWE-bench for TypeScript repositories
defaults:
config:
type: multiswebench-ts
supported_endpoint_types: [chat]
params:
extra:
benchmark: multiswebench
dataset: bytedance-research/Multi-SWE-Bench
split: ts_verified
language: ts

# SWT-bench
- name: swtbench
description: SWT-bench - Software testing benchmark for test generation
defaults:
config:
type: swtbench
supported_endpoint_types: [chat]
params:
extra:
benchmark: swtbench

# SWE-bench Multimodal
- name: swebench-multimodal
description: SWE-bench Multimodal - GitHub issues with visual context
defaults:
config:
type: swebench-multimodal
supported_endpoint_types: [chat]
params:
extra:
benchmark: swebenchmultimodal
dataset: princeton-nlp/SWE-bench_Multimodal
split: dev # test split did not work

# OpenAgentSafety benchmark
- name: openagentsafety
description: OpenAgentSafety - Safety evaluation benchmark for AI agents
defaults:
config:
type: openagentsafety
supported_endpoint_types: [chat]
params:
extra:
benchmark: openagentsafety
dataset: mgulavani/openagentsafety_full_updated_v3
split: train
98 changes: 98 additions & 0 deletions nemo_evaluator/openhands_benchmarks/generate_llm_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
from __future__ import annotations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: No tests for new integration package.

You've added 3 new modules (generate_llm_config.py, run_benchmark.py, output.py) totaling ~400 lines with zero test coverage. This is an integration shim to NeMo - if it breaks, debugging will be painful.

Minimum viable tests:

  1. test_generate_config() - verify JSON output format, env var resolution, URL stripping
  2. test_build_infer_cmd() - verify command building for each benchmark type
  3. test_parse_output() - verify report.json parsing and accuracy calculation

"Testing and practice sometimes clash. Testing wins. Every single time." - If this breaks in production, you'll wish you had tests to reproduce the issue locally.

Add tests in nemo_evaluator/tests/ before merging.


import argparse
import json
import os
from pathlib import Path


def generate_config(
model: str,
output_path: str,
api_base_url: str | None = None,
api_key_env: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
max_completion_tokens: int | None = None,
timeout: int | None = None,
max_retries: int | None = None,
) -> None:
llm_config: dict[str, object] = {"model": model}

if api_base_url:
# Strip /chat/completions suffix for LiteLLM compatibility
base_url = api_base_url.rstrip("/")
if base_url.endswith("/chat/completions"):
base_url = base_url.removesuffix("/chat/completions")
llm_config["base_url"] = base_url
if api_key_env:
# Resolve env var name to actual API key
api_key = os.environ.get(api_key_env, "")
if not api_key:
raise ValueError(
f"Environment variable {api_key_env} is not set or empty. "
f"Please set it with your API key."
)
llm_config["api_key"] = api_key
if temperature is not None:
llm_config["temperature"] = temperature
if top_p is not None:
llm_config["top_p"] = top_p
if max_completion_tokens is not None:
llm_config["max_output_tokens"] = max_completion_tokens
if timeout is not None:
llm_config["timeout"] = timeout
if max_retries is not None:
llm_config["num_retries"] = max_retries

out_path = Path(output_path)
out_path.parent.mkdir(parents=True, exist_ok=True)
out_path.write_text(json.dumps(llm_config, indent=2) + "\n", encoding="utf-8")

print(f"Wrote LLM config to {str(out_path)}")


def main() -> None:
parser = argparse.ArgumentParser(
description="Generate LLM config from CLI args",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)

parser.add_argument("--model", type=str, required=True, help="Model name/id")
parser.add_argument("--api-base-url", type=str, help="API base URL")
parser.add_argument(
"--api-key-env",
type=str,
help="Environment variable name containing the API key",
)
parser.add_argument("--temperature", type=float, help="Sampling temperature")
parser.add_argument("--top-p", type=float, help="Nucleus sampling (top-p)")
parser.add_argument(
"--max-completion-tokens", type=int, help="Max completion tokens"
)
parser.add_argument("--timeout", type=int, help="API timeout in seconds")
parser.add_argument("--max-retries", type=int, help="Max API call retries")
parser.add_argument(
"--output-path",
type=str,
required=True,
help="Where to write the generated JSON config",
)

args = parser.parse_args()

generate_config(
model=args.model,
output_path=args.output_path,
api_base_url=args.api_base_url,
api_key_env=args.api_key_env,
temperature=args.temperature,
top_p=args.top_p,
max_completion_tokens=args.max_completion_tokens,
timeout=args.timeout,
max_retries=args.max_retries,
)


if __name__ == "__main__":
main()
Loading
Loading