feat: Martian benchmark harness — 5-phase evaluation pipeline by Fieldnote-Echo · Pull Request #51 · Project-Navi/grippy-code-review

Nelson Spence (Fieldnote-Echo) · 2026-03-13T15:01:40Z

Summary

Adds a full 5-phase Martian benchmark harness for evaluating Grippy's review quality against known-vulnerability diffs
Includes indexed benchmark runner with lite toolkit fallback for resource-constrained environments
33 tests covering the benchmark pipeline

Also includes 3 bug fixes (being cherry-picked to main separately):

fix: resolve tool + structured-output grammar conflict on local endpoints
fix: suppress output_schema for non-structured providers (Anthropic et al)
fix: preserve diff context in retry messages

Test plan

uv run pytest tests/test_bench*.py -v passes
Benchmark runner completes a sample evaluation cycle
Bug fixes verified independently on main after cherry-pick

🤖 Generated with Claude Code

benchmarks/martian/run_indexed.py

+        except ValueError:
+            continue
+        try:
+            imports = extract_imports(py_f)


github-actions · 2026-03-13T15:05:53Z

benchmarks/martian/config.py

+            model_id=os.environ.get("GRIPPY_MODEL_ID", cls.model_id),
+            transport=os.environ.get("GRIPPY_TRANSPORT", cls.transport),
+            base_url=os.environ.get("GRIPPY_BASE_URL", cls.base_url),
+            api_key=os.environ.get("GRIPPY_API_KEY", cls.api_key),


🔴 CRITICAL: Hardcoded secret detected in API key assignment

Confidence: 100%

A generic secret (likely API key) is assigned directly in code at line 37: 'api_key: str = "lm-studio"'. Even though this appears to be a placeholder or local default, committing API keys or secrets-even development ones-into tracked source is a critical security risk. If this value is ever used in production or cloud environments, it could be exfiltrated, leading to unauthorized access. This pattern is caught by secret-scanning tools and may cause repository blocks on some platforms.

Suggestion: Remove hardcoded secrets. Use environment variables or a secure configuration system and ensure 'api_key' is never checked into version control. If this is a fake/dev key, document that and enforce in code that it can never be used outside test/dev.

— Committed secrets-even fake ones-trip repository scanners instantly. Don't do this.

github-actions · 2026-03-13T15:05:53Z

benchmarks/martian/fetch_diffs.py

+def fetch_all(config: BenchConfig | None = None) -> None:
+    """Fetch diffs for all golden PRs."""
+    config = config or BenchConfig()
+    token = os.environ.get("GITHUB_TOKEN", "")


🔴 CRITICAL: Potential secret detected in variable assignment

Confidence: 100%

An assignment 'token = ...' is present at line 71. This variable is used for GitHub API authentication. If this code includes a real token or if a developer ever forgets to rely solely on environment variables, a valid credential could be exposed. Public or internal exposure of tokens grants attackers access priveleges up to the token's scope.

Suggestion: Never assign tokens directly in code. Retrieve from secure sources (e.g., environment variables) and validate that no static strings or development tokens are ever committed.

— Even an empty or placeholder token field leads to accidental real-key commits down the road. Apply strict discipline.

github-actions · 2026-03-13T15:05:53Z

benchmarks/martian/run_grippy.py

+    agent = create_reviewer(
+        model_id=config.model_id,
+        base_url=config.base_url,
+        api_key=config.api_key,


🔴 CRITICAL: Hardcoded API key assignment in code

Confidence: 100%

Line 82 of run_grippy.py (in creation of the reviewer agent) passes 'api_key' as a code argument. If this value is ever a real key (and not strictly environment-provided as intended), it will be embedded in the repository and deployment artifacts. This is a security risk, particularly for environments where audit or review keys have broad access.

Suggestion: Only pass API keys from secure configuration or environment-not by assignment or default argument. Add checks to prevent running with any static/test default key outside of regulated dev/test CI.

— Security reviewers panic when they see API key params assigned anywhere but config. Clean this up.

github-actions · 2026-03-13T15:05:53Z

benchmarks/martian/run_indexed.py

+            transport=config.transport,
+            model="text-embedding-qwen3-embedding-4b",
+            base_url=config.base_url,
+            api_key=config.api_key,


🔴 CRITICAL: API key assignment for model access must not use static values

Confidence: 100%

At line 101, an API key is passed as a parameter in code that sets up an embedding model. Passing API keys or secrets as arguments that could be defaulted or hardcoded results in secrets being captured in config files, logs, or by accident. If a real key slips in here, a leak is inevitable.

Suggestion: Ensure API keys come exclusively from secure sources (e.g., environment variables) and are never set by positional or default argument in code. Consider explicit guardrails to detect if a default/fake key is used outside CI or dev-only execution.

— This is how real secrets sneak into repositories: misguided convenience for local testing. Resist the temptation.

github-actions · 2026-03-13T15:05:53Z

benchmarks/martian/run_indexed.py

+    agent = create_reviewer(
+        model_id=config.model_id,
+        base_url=config.base_url,
+        api_key=config.api_key,


🔴 CRITICAL: API key parameter in agent creation must avoid static assignment

Confidence: 100%

The code at line 205 assigns 'api_key' directly as a keyword argument when constructing the reviewer agent. Any static value (not sourced from the environment) is a possible accidental leak. Reviewing/testing harnesses often tempt developers to hardcode test creds, creating long-lived security debt.

Suggestion: Only ever source 'api_key' from OS environment or ephemeral config. Log a warning (or raise) if a default is used by accident in production or main.

— CI scripts set bad precedents. Allowing static keys here will eventually bite you.

github-actions · 2026-03-13T15:05:53Z

tests/test_bench_martian_run.py

+        "title": "SQL injection risk",
+        "description": "Query concatenates user input without parameterization.",
+        "suggestion": "Use parameterized queries.",
+        "evidence": "line 42: query = f'SELECT * FROM users WHERE id={user_id}'",


🟠 HIGH: Test contains SQL injection risk in query template

Confidence: 85%

A unit test defines an 'evidence' string containing: 'query = f'SELECT * FROM users WHERE id={user_id}''. While this appears in test code and the danger is only theoretical here, this is an example of query construction via string formatting with unsanitized input-classic SQL injection. Even as a test artifact, this pattern can confuse code search tools that look for actual code risk, and could encourage mis-copying into production logic.

Suggestion: Clarify in the test (via comments or structure) that this is only test data, not real code. Ensure no test logic executes actual queries constructed this way, nor exports this pattern in code samples/examples.

— Even in a test data string, this pattern matches actual vuln signatures. Be explicit that this is inert.

github-actions

Grippy requests changes — FAIL (0/100)

github-actions · 2026-03-13T15:05:57Z

❌ Grippy Review — FAIL

Score: 75/100 | Findings: 6

_{Commit: e85d186}

Implements dev-path harness for running Grippy against the withmartian/ code-review-benchmark 50-PR golden dataset. Five sequential steps: 1. fetch_diffs.py — GitHub API diff fetcher with disk caching 2. run_grippy.py — Grippy reviewer with frozen config, safe resume 3. extract.py — Candidate extraction (inline direct, general LLM) 4. judge.py — Martian judge prompt with greedy best-confidence matching 5. report.py — Metrics, per-repo breakdown, unified failure accounting Key design decisions: - Martian prompts vendored verbatim from pinned commit 012d682 with SHA-256 checksums - BenchConfig frozen dataclass with full provenance stamping - Benchmark comment formatter strips metadata to avoid biasing judge - Anthropic imports are lazy (only at LLM call time) - Golden comments vendored (50 PRs across 5 OSS repos) - Unique failure deduplication across phases - >10% failure rate warning blocks public claims Design doc: docs/plans/2026-03-12-martian-benchmark-harness-design.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Coverage across all phases: - config: frozen immutability, env override, stamp provenance, full SHA-256 - fetch: golden PR parsing, URL extraction (calcom/cal.com), diff caching - run: inline detection, comment formatting, production parity - extract: prompt verification, inline passthrough, general LLM extraction - judge: greedy matching, metrics (perfect/partial/zero), custom judge fn - report: unified failure accounting, unique dedup, >10% warning, clean path - contracts: Grippy API surface, Anthropic SDK response shape, prompt checksums Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Claude Sonnet sometimes wraps JSON responses in ```json...``` code fences. Both _llm_judge() and _llm_extract_default() now strip these before parsing, matching the pattern in Martian's own implementation. Also adds benchmarks/martian/output-*/ to .gitignore for smoke test variant output directories (diff-only, indexed, etc). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ints LM Studio, Ollama, and vLLM cannot combine response_format (structured output grammar) with tool-calling grammars in the same API request. Add _LocalModel subclass that strips response_format when tools are present, relying on system-prompt JSON instructions + retry layer validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…t al) Anthropic's API rejects requests where the compiled grammar from GrippyReview's JSON schema exceeds the size limit. For providers where structured_outputs=False (Anthropic, Google, Groq, Mistral), skip output_schema — the prompt chain (output-schema.md) already provides the full schema, and retry.py handles parsing + Pydantic validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the first LLM attempt fails validation (e.g. negative scores), the retry message previously lost all PR diff context, causing the model to hallucinate a generic review. Now retry messages include the original PR context alongside the error feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Mirrors the production review pipeline: builds CodebaseIndex + graph context, falls back to lite toolkit (grep/read/list) when embedding model is unavailable. Includes embedding model probe to avoid slow retry loops on VRAM contention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions

Grippy requests changes — FAIL (25/100)

Superseded by review of a8e1e37

detect-secrets flagged test-key strings in test_grippy_agent.py. These are fake credentials for unit tests, not real secrets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions

Grippy requests changes — FAIL (75/100)

Superseded by review of e85d186

github-code-quality bot found potential problems Mar 13, 2026

View reviewed changes

benchmarks/martian/run_indexed.py

except ValueError:

continue

try:

imports = extract_imports(py_f)

github-actions bot reviewed Mar 13, 2026

View reviewed changes

github-actions bot previously requested changes Mar 13, 2026

View reviewed changes

Nelson Spence (Fieldnote-Echo) and others added 7 commits March 13, 2026 13:44

Nelson Spence (Fieldnote-Echo) force-pushed the feat/martian-benchmark-harness branch from bdedb74 to a8e1e37 Compare March 13, 2026 18:45

github-actions bot previously requested changes Mar 13, 2026

View reviewed changes

fix: add pragma allowlist for test fake credentials

e85d186

detect-secrets flagged test-key strings in test_grippy_agent.py. These are fake credentials for unit tests, not real secrets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot requested changes Mar 13, 2026

View reviewed changes

Nelson Spence (Fieldnote-Echo) self-assigned this Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Martian benchmark harness — 5-phase evaluation pipeline#51

feat: Martian benchmark harness — 5-phase evaluation pipeline#51
Nelson Spence (Fieldnote-Echo) wants to merge 8 commits intomainfrom
feat/martian-benchmark-harness

Nelson Spence (Fieldnote-Echo) commented Mar 13, 2026

Uh oh!

github-actions bot Mar 13, 2026

Uh oh!

github-actions bot Mar 13, 2026

Uh oh!

github-actions bot Mar 13, 2026

Uh oh!

github-actions bot Mar 13, 2026

Uh oh!

github-actions bot Mar 13, 2026

Uh oh!

github-actions bot Mar 13, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Nelson Spence (Fieldnote-Echo) commented Mar 13, 2026

Summary

Test plan

Uh oh!

github-actions bot Mar 13, 2026

Choose a reason for hiding this comment

🔴 CRITICAL: Hardcoded secret detected in API key assignment

Uh oh!

github-actions bot Mar 13, 2026

Choose a reason for hiding this comment

🔴 CRITICAL: Potential secret detected in variable assignment

Uh oh!

github-actions bot Mar 13, 2026

Choose a reason for hiding this comment

🔴 CRITICAL: Hardcoded API key assignment in code

Uh oh!

github-actions bot Mar 13, 2026

Choose a reason for hiding this comment

🔴 CRITICAL: API key assignment for model access must not use static values

Uh oh!

github-actions bot Mar 13, 2026

Choose a reason for hiding this comment

🔴 CRITICAL: API key parameter in agent creation must avoid static assignment

Uh oh!

github-actions bot Mar 13, 2026

Choose a reason for hiding this comment

🟠 HIGH: Test contains SQL injection risk in query template

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Grippy Review — FAIL

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Mar 13, 2026 •

edited

Loading