Skip to content

feat: Martian benchmark harness — 5-phase evaluation pipeline#51

Draft
Nelson Spence (Fieldnote-Echo) wants to merge 8 commits intomainfrom
feat/martian-benchmark-harness
Draft

feat: Martian benchmark harness — 5-phase evaluation pipeline#51
Nelson Spence (Fieldnote-Echo) wants to merge 8 commits intomainfrom
feat/martian-benchmark-harness

Conversation

@Fieldnote-Echo
Copy link
Member

Summary

  • Adds a full 5-phase Martian benchmark harness for evaluating Grippy's review quality against known-vulnerability diffs
  • Includes indexed benchmark runner with lite toolkit fallback for resource-constrained environments
  • 33 tests covering the benchmark pipeline

Also includes 3 bug fixes (being cherry-picked to main separately):

  • fix: resolve tool + structured-output grammar conflict on local endpoints
  • fix: suppress output_schema for non-structured providers (Anthropic et al)
  • fix: preserve diff context in retry messages

Test plan

  • uv run pytest tests/test_bench*.py -v passes
  • Benchmark runner completes a sample evaluation cycle
  • Bug fixes verified independently on main after cherry-pick

🤖 Generated with Claude Code

except ValueError:
continue
try:
imports = extract_imports(py_f)
model_id=os.environ.get("GRIPPY_MODEL_ID", cls.model_id),
transport=os.environ.get("GRIPPY_TRANSPORT", cls.transport),
base_url=os.environ.get("GRIPPY_BASE_URL", cls.base_url),
api_key=os.environ.get("GRIPPY_API_KEY", cls.api_key),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL: Hardcoded secret detected in API key assignment

Confidence: 100%

A generic secret (likely API key) is assigned directly in code at line 37: 'api_key: str = "lm-studio"'. Even though this appears to be a placeholder or local default, committing API keys or secrets-even development ones-into tracked source is a critical security risk. If this value is ever used in production or cloud environments, it could be exfiltrated, leading to unauthorized access. This pattern is caught by secret-scanning tools and may cause repository blocks on some platforms.

Suggestion: Remove hardcoded secrets. Use environment variables or a secure configuration system and ensure 'api_key' is never checked into version control. If this is a fake/dev key, document that and enforce in code that it can never be used outside test/dev.

— Committed secrets-even fake ones-trip repository scanners instantly. Don't do this.

def fetch_all(config: BenchConfig | None = None) -> None:
"""Fetch diffs for all golden PRs."""
config = config or BenchConfig()
token = os.environ.get("GITHUB_TOKEN", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL: Potential secret detected in variable assignment

Confidence: 100%

An assignment 'token = ...' is present at line 71. This variable is used for GitHub API authentication. If this code includes a real token or if a developer ever forgets to rely solely on environment variables, a valid credential could be exposed. Public or internal exposure of tokens grants attackers access priveleges up to the token's scope.

Suggestion: Never assign tokens directly in code. Retrieve from secure sources (e.g., environment variables) and validate that no static strings or development tokens are ever committed.

— Even an empty or placeholder token field leads to accidental real-key commits down the road. Apply strict discipline.

agent = create_reviewer(
model_id=config.model_id,
base_url=config.base_url,
api_key=config.api_key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL: Hardcoded API key assignment in code

Confidence: 100%

Line 82 of run_grippy.py (in creation of the reviewer agent) passes 'api_key' as a code argument. If this value is ever a real key (and not strictly environment-provided as intended), it will be embedded in the repository and deployment artifacts. This is a security risk, particularly for environments where audit or review keys have broad access.

Suggestion: Only pass API keys from secure configuration or environment-not by assignment or default argument. Add checks to prevent running with any static/test default key outside of regulated dev/test CI.

— Security reviewers panic when they see API key params assigned anywhere but config. Clean this up.

transport=config.transport,
model="text-embedding-qwen3-embedding-4b",
base_url=config.base_url,
api_key=config.api_key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL: API key assignment for model access must not use static values

Confidence: 100%

At line 101, an API key is passed as a parameter in code that sets up an embedding model. Passing API keys or secrets as arguments that could be defaulted or hardcoded results in secrets being captured in config files, logs, or by accident. If a real key slips in here, a leak is inevitable.

Suggestion: Ensure API keys come exclusively from secure sources (e.g., environment variables) and are never set by positional or default argument in code. Consider explicit guardrails to detect if a default/fake key is used outside CI or dev-only execution.

— This is how real secrets sneak into repositories: misguided convenience for local testing. Resist the temptation.

agent = create_reviewer(
model_id=config.model_id,
base_url=config.base_url,
api_key=config.api_key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL: API key parameter in agent creation must avoid static assignment

Confidence: 100%

The code at line 205 assigns 'api_key' directly as a keyword argument when constructing the reviewer agent. Any static value (not sourced from the environment) is a possible accidental leak. Reviewing/testing harnesses often tempt developers to hardcode test creds, creating long-lived security debt.

Suggestion: Only ever source 'api_key' from OS environment or ephemeral config. Log a warning (or raise) if a default is used by accident in production or main.

— CI scripts set bad precedents. Allowing static keys here will eventually bite you.

"title": "SQL injection risk",
"description": "Query concatenates user input without parameterization.",
"suggestion": "Use parameterized queries.",
"evidence": "line 42: query = f'SELECT * FROM users WHERE id={user_id}'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 HIGH: Test contains SQL injection risk in query template

Confidence: 85%

A unit test defines an 'evidence' string containing: 'query = f'SELECT * FROM users WHERE id={user_id}''. While this appears in test code and the danger is only theoretical here, this is an example of query construction via string formatting with unsanitized input-classic SQL injection. Even as a test artifact, this pattern can confuse code search tools that look for actual code risk, and could encourage mis-copying into production logic.

Suggestion: Clarify in the test (via comments or structure) that this is only test data, not real code. Ensure no test logic executes actual queries constructed this way, nor exports this pattern in code samples/examples.

— Even in a test data string, this pattern matches actual vuln signatures. Be explicit that this is inert.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grippy requests changes — FAIL (0/100)

@github-actions
Copy link
Contributor

github-actions bot commented Mar 13, 2026

❌ Grippy Review — FAIL

Score: 75/100 | Findings: 6


Commit: e85d186

Implements dev-path harness for running Grippy against the withmartian/
code-review-benchmark 50-PR golden dataset. Five sequential steps:

1. fetch_diffs.py — GitHub API diff fetcher with disk caching
2. run_grippy.py — Grippy reviewer with frozen config, safe resume
3. extract.py — Candidate extraction (inline direct, general LLM)
4. judge.py — Martian judge prompt with greedy best-confidence matching
5. report.py — Metrics, per-repo breakdown, unified failure accounting

Key design decisions:
- Martian prompts vendored verbatim from pinned commit 012d682 with SHA-256 checksums
- BenchConfig frozen dataclass with full provenance stamping
- Benchmark comment formatter strips metadata to avoid biasing judge
- Anthropic imports are lazy (only at LLM call time)
- Golden comments vendored (50 PRs across 5 OSS repos)
- Unique failure deduplication across phases
- >10% failure rate warning blocks public claims

Design doc: docs/plans/2026-03-12-martian-benchmark-harness-design.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Coverage across all phases:
- config: frozen immutability, env override, stamp provenance, full SHA-256
- fetch: golden PR parsing, URL extraction (calcom/cal.com), diff caching
- run: inline detection, comment formatting, production parity
- extract: prompt verification, inline passthrough, general LLM extraction
- judge: greedy matching, metrics (perfect/partial/zero), custom judge fn
- report: unified failure accounting, unique dedup, >10% warning, clean path
- contracts: Grippy API surface, Anthropic SDK response shape, prompt checksums

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Sonnet sometimes wraps JSON responses in ```json...``` code
fences. Both _llm_judge() and _llm_extract_default() now strip these
before parsing, matching the pattern in Martian's own implementation.

Also adds benchmarks/martian/output-*/ to .gitignore for smoke test
variant output directories (diff-only, indexed, etc).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ints

LM Studio, Ollama, and vLLM cannot combine response_format (structured
output grammar) with tool-calling grammars in the same API request. Add
_LocalModel subclass that strips response_format when tools are present,
relying on system-prompt JSON instructions + retry layer validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t al)

Anthropic's API rejects requests where the compiled grammar from
GrippyReview's JSON schema exceeds the size limit. For providers where
structured_outputs=False (Anthropic, Google, Groq, Mistral), skip
output_schema — the prompt chain (output-schema.md) already provides
the full schema, and retry.py handles parsing + Pydantic validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the first LLM attempt fails validation (e.g. negative scores),
the retry message previously lost all PR diff context, causing the
model to hallucinate a generic review. Now retry messages include
the original PR context alongside the error feedback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mirrors the production review pipeline: builds CodebaseIndex + graph
context, falls back to lite toolkit (grep/read/list) when embedding
model is unavailable. Includes embedding model probe to avoid slow
retry loops on VRAM contention.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grippy requests changes — FAIL (25/100)

@github-actions github-actions bot dismissed their stale review March 13, 2026 18:48

Superseded by review of a8e1e37

detect-secrets flagged test-key strings in test_grippy_agent.py.
These are fake credentials for unit tests, not real secrets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grippy requests changes — FAIL (75/100)

@github-actions github-actions bot dismissed their stale review March 13, 2026 18:59

Superseded by review of e85d186

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant