[Bug] Commit0 evaluation harness fails to score `cachetools` and `parsel` due to test command issues

## Summary

Two Commit0 repos — **cachetools** and **parsel** — fail with 0/0 scores across almost every eval on every retry. The agents solve these repos correctly, but the evaluation harness's post-agent test command is broken.

## Bug 1: `cachetools` — Missing `PYTHONPATH=src` for src-layout repo

### What happens

`cachetools` uses a src-layout (`src/cachetools/`). The harness runs:

```
cd /workspace/cachetools && python -m pytest --json-report --json-report-file=report.json --continue-on-collection-errors tests/
```

Without `PYTHONPATH=src`, every test file fails at import:

```
ImportError: No module named 'cachetools'
```

Result: `{"total": 0, "collected": 0}` → `ValueError: Report summary missing or empty 'total' field` → retry → same failure.

### Evidence

- **Claude Code agent**: Discovered the fix itself (`PYTHONPATH=src python -m pytest tests/`), passed 215/215 tests internally. Harness scored 0/0.
- **OH w/ subagents**: Agent completed, harness scored 0/0. Retried 3 times, all identical.
- **OH Vanilla**: Agent completed, harness scored 0/0. Retried 3 times, all identical.

The dataset metadata already includes the field `"src_dir": "src/cachetools/"`, but `run_infer.py` never reads it.

### Root cause

In `benchmarks/commit0/run_infer.py`, `evaluate_instance()`:

```python
test_cmd = instance.data["test"]["test_cmd"]
test_dir = instance.data["test"]["test_dir"]
if test_cmd.strip() == "pytest":
    test_cmd = "python -m pytest"
full_test_cmd = f"cd {repo_path} && {test_cmd} --json-report --json-report-file=report.json --continue-on-collection-errors {test_dir} > test_output.txt 2>&1"
```

There is no `PYTHONPATH` setup for src-layout repos. The `src_dir` field is available in `instance.data["test"]` but unused.

### Suggested fix

```python
src_dir = instance.data["test"].get("src_dir", "")
env_prefix = ""
if src_dir and "src/" in src_dir:
    env_prefix = "PYTHONPATH=src "
full_test_cmd = f"cd {repo_path} && {env_prefix}{test_cmd} --json-report ... {test_dir} > test_output.txt 2>&1"
```

Alternatively, run `pip install -e .` before the test command, which handles any layout.

---

## Bug 2: `parsel` — Bare `pytest` not on `$PATH` (exit code 127)

### What happens

After the agent finishes, the harness runs the test command, which returns **exit code 127** ("command not found"). No `report.json` is generated → `FileNotFoundError`.

The agents successfully run tests during their sessions using `python -m pytest`, but the bare `pytest` binary is not on `$PATH` in the post-agent evaluation context.

### Evidence

All three agents, all retries, identical failure:

```
Test command exit code: 127
Test command failed with stderr:
Test command failed with stdout:
...
FileNotFoundError: [Errno 2] No such file or directory: 'report.json'
```

Agent results (never captured by harness):
- **CC**: 206 passed, 2 skipped (99%)
- **OH Vanilla**: 206 passed, 2 skipped (99%)
- **OH Delegation**: 154 passed, 52 failed (74%)

The harness confirms pytest is available at setup time ("Pytest available: pytest 9.0.2"), but the agent's `pip install` actions during execution alter the environment so the bare `pytest` binary is no longer resolvable.

### Root cause

In `benchmarks/commit0/run_infer.py`, the `pytest` → `python -m pytest` substitution is an exact match:

```python
if test_cmd.strip() == "pytest":
    test_cmd = "python -m pytest"
```

If `test_cmd` is anything other than exactly `"pytest"` (e.g., `"pytest -x"`, or if the dataset provides a different format), the substitution doesn't fire. More fundamentally, the bare `pytest` binary can disappear from `$PATH` after agent execution, while `python -m pytest` always works.

### Suggested fix

Always use `python -m pytest`:

```python
if "pytest" in test_cmd and "python -m pytest" not in test_cmd:
    test_cmd = test_cmd.replace("pytest", "python -m pytest", 1)
```

---

## Impact

Both bugs are **100% deterministic** — retries cannot fix them. The retry mechanism (up to 3 retries with resource_factor escalation 1→2→4→8) wastes significant compute on a problem that is not resource-related.

## Affected file

`benchmarks/commit0/run_infer.py` — `evaluate_instance()` method, test command construction section.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Commit0 evaluation harness fails to score `cachetools` and `parsel` due to test command issues #526

Summary

Bug 1: `cachetools` — Missing `PYTHONPATH=src` for src-layout repo

What happens

Evidence

Root cause

Suggested fix

Bug 2: `parsel` — Bare `pytest` not on `$PATH` (exit code 127)

What happens

Evidence

Root cause

Suggested fix

Impact

Affected file

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Commit0 evaluation harness fails to score cachetools and parsel due to test command issues #526

Description

Summary

Bug 1: cachetools — Missing PYTHONPATH=src for src-layout repo

What happens

Evidence

Root cause

Suggested fix

Bug 2: parsel — Bare pytest not on $PATH (exit code 127)

What happens

Evidence

Root cause

Suggested fix

Impact

Affected file

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] Commit0 evaluation harness fails to score `cachetools` and `parsel` due to test command issues #526

Bug 1: `cachetools` — Missing `PYTHONPATH=src` for src-layout repo

Bug 2: `parsel` — Bare `pytest` not on `$PATH` (exit code 127)