feat: Martian benchmark harness — 5-phase evaluation pipeline#51
feat: Martian benchmark harness — 5-phase evaluation pipeline#51Nelson Spence (Fieldnote-Echo) wants to merge 8 commits intomainfrom
Conversation
| model_id=os.environ.get("GRIPPY_MODEL_ID", cls.model_id), | ||
| transport=os.environ.get("GRIPPY_TRANSPORT", cls.transport), | ||
| base_url=os.environ.get("GRIPPY_BASE_URL", cls.base_url), | ||
| api_key=os.environ.get("GRIPPY_API_KEY", cls.api_key), |
There was a problem hiding this comment.
🔴 CRITICAL: Hardcoded secret detected in API key assignment
Confidence: 100%
A generic secret (likely API key) is assigned directly in code at line 37: 'api_key: str = "lm-studio"'. Even though this appears to be a placeholder or local default, committing API keys or secrets-even development ones-into tracked source is a critical security risk. If this value is ever used in production or cloud environments, it could be exfiltrated, leading to unauthorized access. This pattern is caught by secret-scanning tools and may cause repository blocks on some platforms.
Suggestion: Remove hardcoded secrets. Use environment variables or a secure configuration system and ensure 'api_key' is never checked into version control. If this is a fake/dev key, document that and enforce in code that it can never be used outside test/dev.
— Committed secrets-even fake ones-trip repository scanners instantly. Don't do this.
| def fetch_all(config: BenchConfig | None = None) -> None: | ||
| """Fetch diffs for all golden PRs.""" | ||
| config = config or BenchConfig() | ||
| token = os.environ.get("GITHUB_TOKEN", "") |
There was a problem hiding this comment.
🔴 CRITICAL: Potential secret detected in variable assignment
Confidence: 100%
An assignment 'token = ...' is present at line 71. This variable is used for GitHub API authentication. If this code includes a real token or if a developer ever forgets to rely solely on environment variables, a valid credential could be exposed. Public or internal exposure of tokens grants attackers access priveleges up to the token's scope.
Suggestion: Never assign tokens directly in code. Retrieve from secure sources (e.g., environment variables) and validate that no static strings or development tokens are ever committed.
— Even an empty or placeholder token field leads to accidental real-key commits down the road. Apply strict discipline.
| agent = create_reviewer( | ||
| model_id=config.model_id, | ||
| base_url=config.base_url, | ||
| api_key=config.api_key, |
There was a problem hiding this comment.
🔴 CRITICAL: Hardcoded API key assignment in code
Confidence: 100%
Line 82 of run_grippy.py (in creation of the reviewer agent) passes 'api_key' as a code argument. If this value is ever a real key (and not strictly environment-provided as intended), it will be embedded in the repository and deployment artifacts. This is a security risk, particularly for environments where audit or review keys have broad access.
Suggestion: Only pass API keys from secure configuration or environment-not by assignment or default argument. Add checks to prevent running with any static/test default key outside of regulated dev/test CI.
— Security reviewers panic when they see API key params assigned anywhere but config. Clean this up.
| transport=config.transport, | ||
| model="text-embedding-qwen3-embedding-4b", | ||
| base_url=config.base_url, | ||
| api_key=config.api_key, |
There was a problem hiding this comment.
🔴 CRITICAL: API key assignment for model access must not use static values
Confidence: 100%
At line 101, an API key is passed as a parameter in code that sets up an embedding model. Passing API keys or secrets as arguments that could be defaulted or hardcoded results in secrets being captured in config files, logs, or by accident. If a real key slips in here, a leak is inevitable.
Suggestion: Ensure API keys come exclusively from secure sources (e.g., environment variables) and are never set by positional or default argument in code. Consider explicit guardrails to detect if a default/fake key is used outside CI or dev-only execution.
— This is how real secrets sneak into repositories: misguided convenience for local testing. Resist the temptation.
| agent = create_reviewer( | ||
| model_id=config.model_id, | ||
| base_url=config.base_url, | ||
| api_key=config.api_key, |
There was a problem hiding this comment.
🔴 CRITICAL: API key parameter in agent creation must avoid static assignment
Confidence: 100%
The code at line 205 assigns 'api_key' directly as a keyword argument when constructing the reviewer agent. Any static value (not sourced from the environment) is a possible accidental leak. Reviewing/testing harnesses often tempt developers to hardcode test creds, creating long-lived security debt.
Suggestion: Only ever source 'api_key' from OS environment or ephemeral config. Log a warning (or raise) if a default is used by accident in production or main.
— CI scripts set bad precedents. Allowing static keys here will eventually bite you.
| "title": "SQL injection risk", | ||
| "description": "Query concatenates user input without parameterization.", | ||
| "suggestion": "Use parameterized queries.", | ||
| "evidence": "line 42: query = f'SELECT * FROM users WHERE id={user_id}'", |
There was a problem hiding this comment.
🟠 HIGH: Test contains SQL injection risk in query template
Confidence: 85%
A unit test defines an 'evidence' string containing: 'query = f'SELECT * FROM users WHERE id={user_id}''. While this appears in test code and the danger is only theoretical here, this is an example of query construction via string formatting with unsanitized input-classic SQL injection. Even as a test artifact, this pattern can confuse code search tools that look for actual code risk, and could encourage mis-copying into production logic.
Suggestion: Clarify in the test (via comments or structure) that this is only test data, not real code. Ensure no test logic executes actual queries constructed this way, nor exports this pattern in code samples/examples.
— Even in a test data string, this pattern matches actual vuln signatures. Be explicit that this is inert.
❌ Grippy Review — FAILScore: 75/100 | Findings: 6 Commit: e85d186 |
Implements dev-path harness for running Grippy against the withmartian/ code-review-benchmark 50-PR golden dataset. Five sequential steps: 1. fetch_diffs.py — GitHub API diff fetcher with disk caching 2. run_grippy.py — Grippy reviewer with frozen config, safe resume 3. extract.py — Candidate extraction (inline direct, general LLM) 4. judge.py — Martian judge prompt with greedy best-confidence matching 5. report.py — Metrics, per-repo breakdown, unified failure accounting Key design decisions: - Martian prompts vendored verbatim from pinned commit 012d682 with SHA-256 checksums - BenchConfig frozen dataclass with full provenance stamping - Benchmark comment formatter strips metadata to avoid biasing judge - Anthropic imports are lazy (only at LLM call time) - Golden comments vendored (50 PRs across 5 OSS repos) - Unique failure deduplication across phases - >10% failure rate warning blocks public claims Design doc: docs/plans/2026-03-12-martian-benchmark-harness-design.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Coverage across all phases: - config: frozen immutability, env override, stamp provenance, full SHA-256 - fetch: golden PR parsing, URL extraction (calcom/cal.com), diff caching - run: inline detection, comment formatting, production parity - extract: prompt verification, inline passthrough, general LLM extraction - judge: greedy matching, metrics (perfect/partial/zero), custom judge fn - report: unified failure accounting, unique dedup, >10% warning, clean path - contracts: Grippy API surface, Anthropic SDK response shape, prompt checksums Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Sonnet sometimes wraps JSON responses in ```json...``` code fences. Both _llm_judge() and _llm_extract_default() now strip these before parsing, matching the pattern in Martian's own implementation. Also adds benchmarks/martian/output-*/ to .gitignore for smoke test variant output directories (diff-only, indexed, etc). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ints LM Studio, Ollama, and vLLM cannot combine response_format (structured output grammar) with tool-calling grammars in the same API request. Add _LocalModel subclass that strips response_format when tools are present, relying on system-prompt JSON instructions + retry layer validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t al) Anthropic's API rejects requests where the compiled grammar from GrippyReview's JSON schema exceeds the size limit. For providers where structured_outputs=False (Anthropic, Google, Groq, Mistral), skip output_schema — the prompt chain (output-schema.md) already provides the full schema, and retry.py handles parsing + Pydantic validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the first LLM attempt fails validation (e.g. negative scores), the retry message previously lost all PR diff context, causing the model to hallucinate a generic review. Now retry messages include the original PR context alongside the error feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mirrors the production review pipeline: builds CodebaseIndex + graph context, falls back to lite toolkit (grep/read/list) when embedding model is unavailable. Includes embedding model probe to avoid slow retry loops on VRAM contention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bdedb74 to
a8e1e37
Compare
Superseded by review of a8e1e37
detect-secrets flagged test-key strings in test_grippy_agent.py. These are fake credentials for unit tests, not real secrets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Superseded by review of e85d186
Summary
Also includes 3 bug fixes (being cherry-picked to main separately):
Test plan
uv run pytest tests/test_bench*.py -vpasses🤖 Generated with Claude Code