-
Notifications
You must be signed in to change notification settings - Fork 47
Description
Summary
Two Commit0 repos — cachetools and parsel — fail with 0/0 scores across almost every eval on every retry. The agents solve these repos correctly, but the evaluation harness's post-agent test command is broken.
Bug 1: cachetools — Missing PYTHONPATH=src for src-layout repo
What happens
cachetools uses a src-layout (src/cachetools/). The harness runs:
cd /workspace/cachetools && python -m pytest --json-report --json-report-file=report.json --continue-on-collection-errors tests/
Without PYTHONPATH=src, every test file fails at import:
ImportError: No module named 'cachetools'
Result: {"total": 0, "collected": 0} → ValueError: Report summary missing or empty 'total' field → retry → same failure.
Evidence
- Claude Code agent: Discovered the fix itself (
PYTHONPATH=src python -m pytest tests/), passed 215/215 tests internally. Harness scored 0/0. - OH w/ subagents: Agent completed, harness scored 0/0. Retried 3 times, all identical.
- OH Vanilla: Agent completed, harness scored 0/0. Retried 3 times, all identical.
The dataset metadata already includes the field "src_dir": "src/cachetools/", but run_infer.py never reads it.
Root cause
In benchmarks/commit0/run_infer.py, evaluate_instance():
test_cmd = instance.data["test"]["test_cmd"]
test_dir = instance.data["test"]["test_dir"]
if test_cmd.strip() == "pytest":
test_cmd = "python -m pytest"
full_test_cmd = f"cd {repo_path} && {test_cmd} --json-report --json-report-file=report.json --continue-on-collection-errors {test_dir} > test_output.txt 2>&1"There is no PYTHONPATH setup for src-layout repos. The src_dir field is available in instance.data["test"] but unused.
Suggested fix
src_dir = instance.data["test"].get("src_dir", "")
env_prefix = ""
if src_dir and "src/" in src_dir:
env_prefix = "PYTHONPATH=src "
full_test_cmd = f"cd {repo_path} && {env_prefix}{test_cmd} --json-report ... {test_dir} > test_output.txt 2>&1"Alternatively, run pip install -e . before the test command, which handles any layout.
Bug 2: parsel — Bare pytest not on $PATH (exit code 127)
What happens
After the agent finishes, the harness runs the test command, which returns exit code 127 ("command not found"). No report.json is generated → FileNotFoundError.
The agents successfully run tests during their sessions using python -m pytest, but the bare pytest binary is not on $PATH in the post-agent evaluation context.
Evidence
All three agents, all retries, identical failure:
Test command exit code: 127
Test command failed with stderr:
Test command failed with stdout:
...
FileNotFoundError: [Errno 2] No such file or directory: 'report.json'
Agent results (never captured by harness):
- CC: 206 passed, 2 skipped (99%)
- OH Vanilla: 206 passed, 2 skipped (99%)
- OH Delegation: 154 passed, 52 failed (74%)
The harness confirms pytest is available at setup time ("Pytest available: pytest 9.0.2"), but the agent's pip install actions during execution alter the environment so the bare pytest binary is no longer resolvable.
Root cause
In benchmarks/commit0/run_infer.py, the pytest → python -m pytest substitution is an exact match:
if test_cmd.strip() == "pytest":
test_cmd = "python -m pytest"If test_cmd is anything other than exactly "pytest" (e.g., "pytest -x", or if the dataset provides a different format), the substitution doesn't fire. More fundamentally, the bare pytest binary can disappear from $PATH after agent execution, while python -m pytest always works.
Suggested fix
Always use python -m pytest:
if "pytest" in test_cmd and "python -m pytest" not in test_cmd:
test_cmd = test_cmd.replace("pytest", "python -m pytest", 1)Impact
Both bugs are 100% deterministic — retries cannot fix them. The retry mechanism (up to 3 retries with resource_factor escalation 1→2→4→8) wastes significant compute on a problem that is not resource-related.
Affected file
benchmarks/commit0/run_infer.py — evaluate_instance() method, test command construction section.