Skip to content

[Bug] Commit0 evaluation harness fails to score cachetools and parsel due to test command issues #526

@VascoSch92

Description

@VascoSch92

Summary

Two Commit0 repos — cachetools and parsel — fail with 0/0 scores across almost every eval on every retry. The agents solve these repos correctly, but the evaluation harness's post-agent test command is broken.

Bug 1: cachetools — Missing PYTHONPATH=src for src-layout repo

What happens

cachetools uses a src-layout (src/cachetools/). The harness runs:

cd /workspace/cachetools && python -m pytest --json-report --json-report-file=report.json --continue-on-collection-errors tests/

Without PYTHONPATH=src, every test file fails at import:

ImportError: No module named 'cachetools'

Result: {"total": 0, "collected": 0}ValueError: Report summary missing or empty 'total' field → retry → same failure.

Evidence

  • Claude Code agent: Discovered the fix itself (PYTHONPATH=src python -m pytest tests/), passed 215/215 tests internally. Harness scored 0/0.
  • OH w/ subagents: Agent completed, harness scored 0/0. Retried 3 times, all identical.
  • OH Vanilla: Agent completed, harness scored 0/0. Retried 3 times, all identical.

The dataset metadata already includes the field "src_dir": "src/cachetools/", but run_infer.py never reads it.

Root cause

In benchmarks/commit0/run_infer.py, evaluate_instance():

test_cmd = instance.data["test"]["test_cmd"]
test_dir = instance.data["test"]["test_dir"]
if test_cmd.strip() == "pytest":
    test_cmd = "python -m pytest"
full_test_cmd = f"cd {repo_path} && {test_cmd} --json-report --json-report-file=report.json --continue-on-collection-errors {test_dir} > test_output.txt 2>&1"

There is no PYTHONPATH setup for src-layout repos. The src_dir field is available in instance.data["test"] but unused.

Suggested fix

src_dir = instance.data["test"].get("src_dir", "")
env_prefix = ""
if src_dir and "src/" in src_dir:
    env_prefix = "PYTHONPATH=src "
full_test_cmd = f"cd {repo_path} && {env_prefix}{test_cmd} --json-report ... {test_dir} > test_output.txt 2>&1"

Alternatively, run pip install -e . before the test command, which handles any layout.


Bug 2: parsel — Bare pytest not on $PATH (exit code 127)

What happens

After the agent finishes, the harness runs the test command, which returns exit code 127 ("command not found"). No report.json is generated → FileNotFoundError.

The agents successfully run tests during their sessions using python -m pytest, but the bare pytest binary is not on $PATH in the post-agent evaluation context.

Evidence

All three agents, all retries, identical failure:

Test command exit code: 127
Test command failed with stderr:
Test command failed with stdout:
...
FileNotFoundError: [Errno 2] No such file or directory: 'report.json'

Agent results (never captured by harness):

  • CC: 206 passed, 2 skipped (99%)
  • OH Vanilla: 206 passed, 2 skipped (99%)
  • OH Delegation: 154 passed, 52 failed (74%)

The harness confirms pytest is available at setup time ("Pytest available: pytest 9.0.2"), but the agent's pip install actions during execution alter the environment so the bare pytest binary is no longer resolvable.

Root cause

In benchmarks/commit0/run_infer.py, the pytestpython -m pytest substitution is an exact match:

if test_cmd.strip() == "pytest":
    test_cmd = "python -m pytest"

If test_cmd is anything other than exactly "pytest" (e.g., "pytest -x", or if the dataset provides a different format), the substitution doesn't fire. More fundamentally, the bare pytest binary can disappear from $PATH after agent execution, while python -m pytest always works.

Suggested fix

Always use python -m pytest:

if "pytest" in test_cmd and "python -m pytest" not in test_cmd:
    test_cmd = test_cmd.replace("pytest", "python -m pytest", 1)

Impact

Both bugs are 100% deterministic — retries cannot fix them. The retry mechanism (up to 3 retries with resource_factor escalation 1→2→4→8) wastes significant compute on a problem that is not resource-related.

Affected file

benchmarks/commit0/run_infer.pyevaluate_instance() method, test command construction section.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions