fix(commit0): evaluation harness fails to score cachetools and parsel due to test command issues by VascoSch92 · Pull Request #539 · OpenHands/benchmarks

VascoSch92 · 2026-03-19T08:10:59Z

The PR solves issue #526

I tested here: https://openhands-ai.slack.com/archives/C09QGUDQVTL/p1773907384946099

Here are the results:

cachetools

Before (bug): 0/0 — ImportError: No module named 'cachetools' due to missing PYTHONPATH=src
Now: 215/215 passed (100%) — exit code 0, all tests pass, report.json generated correctly

parsel

Before (bug): 0/0 — exit code 127 ("command not found") because bare pytest wasn't on $PATH, no report.json generated
Now: 155/208 passed (74.5%) — exit code 1 (some test failures, but the harness correctly ran tests and captured results)

The parsel failures (51 failed, 155 passed) are genuine agent implementation gaps, not harness bugs, the test command executed successfully and produced a proper report.

all-hands-bot

🟡 Acceptable - Solves real problems (ImportError and PATH issues) but has some fragile heuristics that could break edge cases.

benchmarks/commit0/run_infer.py

all-hands-bot

🟡 Acceptable - Solves real production problems (cachetools, parsel) and tests are comprehensive. Two minor code clarity issues that don't block merging but would improve maintainability.

all-hands-bot · 2026-03-19T08:39:45Z

benchmarks/commit0/run_infer.py

+def normalize_pytest_cmd(test_cmd: str) -> str:
+    """Replace bare pytest/pytest3 with python -m pytest to avoid PATH/permission issues."""
+    if re.match(r"pytest\d?$", test_cmd.strip()) and "python -m pytest" not in test_cmd:
+        test_cmd = re.sub(r"\bpytest(\d?)", r"python -m pytest\1", test_cmd, count=1)


🟡 Suggestion: Logic redundancy - lacks "good taste"

You check if the stripped command is exactly pytest or pytest3 with re.match(r"pytest\d?$", ...), but then use word boundaries \b in the substitution. If you've already verified it's exactly "pytest", word boundaries are redundant.

Simpler approach:

def normalize_pytest_cmd(test_cmd: str) -> str: """Replace bare pytest/pytest3 with python -m pytest to avoid PATH/permission issues.""" stripped = test_cmd.strip() if stripped in ("pytest", "pytest3") and "python -m pytest" not in test_cmd: return test_cmd.replace(stripped, f"python -m {stripped}", 1) return test_cmd

Or if you need the regex flexibility:

if re.match(r"^pytest\d?$", test_cmd.strip()): return re.sub(r"^pytest(\d?)$", r"python -m pytest\1", test_cmd.strip()) return test_cmd

Current code works but mixes two different validation strategies unnecessarily.

all-hands-bot · 2026-03-19T08:39:45Z

benchmarks/commit0/run_infer.py

+def get_pythonpath_prefix(src_dir: str) -> str:
+    """Return PYTHONPATH env prefix for src-layout repos."""
+    if src_dir and src_dir.startswith("src"):
+        return "PYTHONPATH=src:$PYTHONPATH "


🟡 Suggestion: Misleading function signature

The function takes src_dir as a parameter but ignores its actual value and returns a hardcoded "PYTHONPATH=src:$PYTHONPATH ". This makes it look like the function would use different paths based on src_dir, but it only uses it as a boolean check.

Clearer alternatives:

Inline it (simplest):

env_prefix = "PYTHONPATH=src:$PYTHONPATH " if src_dir and src_dir.startswith("src") else ""

Rename to reflect actual behavior:

def is_src_layout(src_dir: str) -> bool: """Check if repo uses src-layout requiring PYTHONPATH.""" return bool(src_dir and src_dir.startswith("src")) # Then: env_prefix = "PYTHONPATH=src:$PYTHONPATH " if is_src_layout(src_dir) else ""

The current name get_pythonpath_prefix suggests it derives a path from the parameter, but it doesn't.

tests/test_commit0_run_infer.py

The regex anchor $ prevented matching 'pytest --assert=plain ...' (parsel). Changed to (\s|$) to match pytest followed by flags or end-of-string. Fixes parsel exit code 127 in commit0 evaluation. Ref: #526

The src_dir field is a top-level field in the dataset: {"test": {"test_cmd": "pytest", "test_dir": "tests/"}, "src_dir": "src/cachetools/"} The previous code read from instance.data["test"].get("src_dir", "") which always returned "" because src_dir is not nested under "test". This caused the PYTHONPATH=src prefix to never be applied for cachetools. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Your message

68feee6

VascoSch92 requested a review from all-hands-bot March 19, 2026 08:11

VascoSch92 changed the title ~~fix(commit0):~~ fix(commit0): evaluation harness fails to score cachetools and parsel due to test command issues Mar 19, 2026

all-hands-bot reviewed Mar 19, 2026

View reviewed changes

benchmarks/commit0/run_infer.py Outdated Show resolved Hide resolved

benchmarks/commit0/run_infer.py Outdated Show resolved Hide resolved

benchmarks/commit0/run_infer.py Show resolved Hide resolved

benchmarks/commit0/run_infer.py Outdated Show resolved Hide resolved

Address PR review: tighten pytest detection, fix PYTHONPATH, add tests

c9ae51f

VascoSch92 marked this pull request as ready for review March 19, 2026 08:36

all-hands-bot reviewed Mar 19, 2026

View reviewed changes

VascoSch92 requested a review from juanmichelini March 19, 2026 08:44

VascoSch92 and others added 2 commits March 19, 2026 11:42

fix: handle pytest with extra flags in normalize_pytest_cmd

7652655

The regex anchor $ prevented matching 'pytest --assert=plain ...' (parsel). Changed to (\s|$) to match pytest followed by flags or end-of-string. Fixes parsel exit code 127 in commit0 evaluation. Ref: #526

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(commit0): evaluation harness fails to score cachetools and parsel due to test command issues#539

fix(commit0): evaluation harness fails to score cachetools and parsel due to test command issues#539
VascoSch92 wants to merge 4 commits intomainfrom
vasco/fix-issue-526

VascoSch92 commented Mar 19, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Mar 19, 2026

Uh oh!

all-hands-bot Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

VascoSch92 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cachetools

parsel

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

VascoSch92 commented Mar 19, 2026 •

edited

Loading