feat: output policy v2 — review-policy rewrite by Fieldnote-Echo · Pull Request #77 · Project-Navi/grippy-code-review

Nelson Spence (Fieldnote-Echo) · 2026-03-15T07:57:29Z

Summary

New module output_policy.py: narration suppression, confidence gate, evidence posting gate, score recomputation, verdict derivation, display caps — three-set model (suppressed / scoring / display)
Schema additions: summary_only_findings on GrippyReview, 7 telemetry fields on ReviewMeta (all backward-compatible defaults)
Wiring: retry.py owns fail-open, review.py + mcp_server.py pass mode/diff through
Rendering: snippet evidence as code blocks, summary-only section, policy-bypassed warning, capped-count annotation
Serialization: mcp_response.py includes summary_only_findings

Why

Dogfood QA across PRs #71-#76 found 83% noise rate: 10/12 posted findings were narration-as-finding, scores were decorative (LLM-generated, not derived from findings), and 9 real issues were missed. Root cause: confidence-filter.md defines a comprehensive post-generation filter but none of it was implemented in code.

Three-set model

Set	Contents	Used For
Suppressed	Narration, below-threshold, empty-evidence	Discarded (counted in telemetry)
Scoring set	All post-suppression findings (inline + summary-only)	Score recomputation, verdict derivation
Display set	Inline-eligible subset, after per-file/per-review/LOW caps	GitHub inline posting

Score and verdict are computed from the scoring set (uncapped). Display caps only control what gets posted inline.

Key design decisions

mode is verdict authority (not review.audit_type — that's LLM output)
Empty evidence → suppress; non-empty but ungrounded → summary-only (still scored)
parse_diff from rules/context.py reused — zero duplicated diff parsing
Single fail-open owner in retry.py — filter_review() raises normally

Test plan

34 new tests in test_grippy_output_policy.py (97% coverage on new module)
Full suite: 1214 passed, 0 failed
Lint, format, mypy: all clean
2 regression fixes in existing tests (score recomputation + serialize_audit keys)

P1 (separate PR): line-range evidence grounding + trust boundary auto-escalation.

🤖 Generated with Claude Code

…evidence posting gate Three-set model: suppressed (narration/below-threshold/empty-evidence), scoring set (inline + summary-only), display set (capped for posting). - Narration detection: 11 phrase patterns + empty/dismissive suggestion check - Confidence threshold: severity-specific minimums with category penalty (governance/observability -15) - Evidence grounding: empty evidence suppresses; non-empty but ungrounded moves to summary-only; grounded stays inline-eligible - Delegates to parse_diff from rules/context.py — no duplicated diff parsing - Score recomputation from scoring set with per-category caps - Verdict derivation from recomputed score + mode (call-site authority) - Display caps: 5/file, 20/review, 3 LOW — caps don't affect score Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

GrippyReview gains summary_only_findings (scored but not inline-eligible). ReviewMeta gains: score_before_policy, verdict_before_policy, policy_bypassed, policy_bypass_reason, narration_suppressed_count, threshold_suppressed_count, display_capped_count. All new fields have defaults — backward compatible with existing reviews. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

retry.py: add mode/diff params to run_review(), call filter_review() after successful parse. Single fail-open owner — catches exceptions, sets policy_bypassed=True + reason, returns unfiltered review. review.py: pass mode and diff through to run_review(), pass summary_only_findings + policy meta to post_review(). mcp_server.py: pass mode and diff to run_review() in audit_diff path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github_review.py: - build_review_comment() renders evidence as fenced code block - format_summary_comment() gains summary_only_findings, policy_bypassed, display_capped_count params - Summary-only findings rendered in collapsible details section - Policy-bypassed warning annotation when filter_review() failed - Display-capped count annotation when findings omitted for brevity - post_review() passes new params through to summary mcp_response.py: - serialize_audit() includes summary_only_findings in output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

New: tests/test_grippy_output_policy.py — 34 tests covering: - Narration suppression (3), confidence threshold (5), evidence grounding (4) - Three-set invariant (2), score recomputation (3), verdict derivation (4) - Display caps (2), summary-only persistence (2), telemetry (5) - Fail-open integration (1), diff helper (2), end-to-end (1) Updated: test_grippy_retry.py — score assertions updated for recomputed scores (output policy recomputes 0-finding reviews to 100). Updated: test_grippy_mcp_response.py — added summary_only_findings to expected serialize_audit() keys. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-03-15T08:00:15Z

Codecov Report

❌ Patch coverage is 96.96970% with 8 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/grippy/output_policy.py	97.15%	5 Missing ⚠️
src/grippy/github_review.py	95.58%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-03-15T08:02:09Z

tests/test_grippy_output_policy.py

+        "title": "SQL injection risk",
+        "description": "User input interpolated into query.",
+        "suggestion": "Use parameterized queries.",
+        "evidence": 'f"SELECT * FROM {user_input}"',


🟠 HIGH: SQL injection risk: query built from interpolated input

Confidence: 95%

"evidence": 'f"SELECT * FROM {user_input}"',

Test case demonstrates unsafe construction of a SQL query using direct string interpolation with user-controlled input. If this pattern appears in production code, it would permit malicious inputs to execute arbitrary SQL. Even in tests, examples with vulnerable patterns can be adapted or misunderstood elsewhere.

Suggestion: Change the sample to use parameterized query placeholders (e.g., cursor.execute('SELECT ... WHERE q=?', (user_input,))) and clarify in test descriptions that string interpolation is unsafe.

— Test code should not illustrate unsafe query patterns, even as fixtures.

github-actions · 2026-03-15T08:02:09Z

tests/test_grippy_output_policy.py

+@@ -10,3 +10,5 @@ def handle_request():
+     user_input = request.args.get("q")
+-    old_code()
+    query = f"SELECT * FROM {user_input}"


🟠 HIGH: SQL injection risk: query built from interpolated input

Confidence: 95%

+ query = f"SELECT * FROM {user_input}"

Within the test diff, a query is constructed using f-string interpolation with user input (query = f"SELECT * FROM {user_input}"). Though present in test scaffolding, this propagates a dangerous anti-pattern. Developers may copy such patterns, leading to vulnerabilities in real code.

Suggestion: Update the test example to use a safe, parameterized pattern. Add a code comment noting the SQL injection risk of this construction.

— Real or test-unsafe SQL interpolation shouldn't appear unqualified.

github-actions · 2026-03-15T08:02:09Z

tests/test_grippy_output_policy.py

+
+    def test_grounded_evidence_inline(self) -> None:
+        """10. Grounded evidence -> inline-eligible, no penalty."""
+        f = _f(evidence='query = f"SELECT * FROM {user_input}"')


🟠 HIGH: SQL injection risk: query built from interpolated input

Confidence: 95%

f = _f(evidence='query = f"SELECT * FROM {user_input}"')

A test fixture demonstrates SQL query construction via variable interpolation, mirroring real-risk production vulnerabilities. This reinforces unsafe practices when tests are used as reference.

Suggestion: Where fixture patterns must illustrate risky code, add explicit comments (e.g., '# SQL injection - do NOT use in production'). Otherwise, prefer parameterized examples.

— Unsafe query patterns in test helpers risk normalization of dangerous code.

github-actions · 2026-03-15T08:02:09Z

tests/test_grippy_output_policy.py

+    def test_display_caps_dont_inflate_score(self) -> None:
+        """13. 8 LOWs -> all 8 in scoring set -> display caps to 3 -> score unchanged."""
+        findings = [
+            _f(id=f"F-{i:03d}", severity="LOW", confidence=70, evidence="query user_input SELECT")


🟠 HIGH: SQL injection risk: query built from interpolated input

Confidence: 95%

_f(id=f"F-{i:03d}", severity="LOW", confidence=70, evidence="query user_input SELECT")

A bulk test creates several LOW-severity findings, all using the same interpolated query pattern that would be considered a SQL injection in actual application code. Tests modeling or instantiating these injectable patterns can unintentionally introduce vulnerabilities.

Suggestion: Ensure all test fixtures recommend or model parameterized SQL as the norm. If an unsafe pattern is needed for a negative test, clearly contaminate it with warnings.

— Test datasets shouldn't seed unsafe SQL building blocks, even for 'fake' findings.

github-actions · 2026-03-15T08:02:09Z

tests/test_grippy_output_policy.py

+
+    def test_summary_only_scored_not_posted(self) -> None:
+        """14. Summary-only findings contribute to score but not inline posting."""
+        grounded = _f(id="F-001", evidence='query = f"SELECT * FROM {user_input}"')


🟠 HIGH: SQL injection risk: query built from interpolated input

Confidence: 95%

grounded = _f(id="F-001", evidence='query = f"SELECT * FROM {user_input}"')

Direct unparameterized query construction with user input in a test case sets a poor example and may promote misuse in copied code.

Suggestion: Use parameterized SQL in all examples; explicitly flag intentionally unsafe lines as anti-patterns.

— If you must show the risky pattern, make sure to flag it with a prominent warning.

github-actions

Grippy requests changes — FAIL (50/100)

github-actions · 2026-03-15T08:02:12Z

✅ Grippy Review — PASS

Score: 100/100 | Findings: 0

_{Commit: e4c936d}

Tests for the three new format_summary_comment params: - policy_bypassed warning annotation - display_capped_count omission annotation - summary_only_findings collapsible section Tests for build_review_comment evidence rendering: - Non-empty evidence rendered as fenced code block - Whitespace-only evidence produces no code block Fixes quality-gate: github_review.py patch coverage was 25% (21 uncovered lines in new rendering paths), now covered by unit tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Suppresses false-positive rule engine findings on test fixture strings containing SQL injection patterns (evidence examples, diff fixtures). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

github-actions · 2026-03-15T08:11:33Z

tests/test_grippy_output_policy.py

+                id=f"F-{i:03d}",
+                severity="MEDIUM",
+                confidence=85,
+                evidence='query = f"SELECT * FROM {user_input}"',


🟠 HIGH: String-interpolated SQL evident in output policy test fixture

Confidence: 95%

evidence='query = f"SELECT * FROM {user_input}"',

In the test on line 329, an interpolated query string is set in a test Finding's evidence. Again, risky in application code; here, meant for policy filter coverage.

Suggestion: Reinforce in test or code comments that interpolated SQL is present intentionally and isn't proper for production.

— Test pattern triggers correct policy filter-just ensure this NEVER migrates to real code paths.

github-actions · 2026-03-15T08:11:33Z

tests/test_grippy_output_policy.py

+                id=f"F-{i:03d}",
+                severity="LOW",
+                confidence=70,
+                evidence='query = f"SELECT * FROM {user_input}"',


🟠 HIGH: Unsafe SQL query interpolation in LOW finding example

Confidence: 95%

evidence='query = f"SELECT * FROM {user_input}"',

At line 347, a test generates several LOW severity Finding examples, but the evidence string represents unsafe SQL query composition with interpolated input. Even when used for test policy validation, this pattern is a real-world vulnerability.

Suggestion: Do not use or promote interpolated SQL in example code outside of clearly scoped, documented negative test cases.

— Exposure via test data, but important to show the anti-pattern is caught. Always document intent.

github-actions

Grippy requests changes — FAIL (50/100)

Superseded by review of 6f51356

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

Two fixes for the noise problem: 1. `# nogrip` now suppresses LLM findings (not just rule engine). The output policy parses nogrip pragmas from the diff and suppresses findings whose line range includes a nogrip line. Bare `# nogrip` suppresses all findings; targeted `# nogrip: SEC-001` only suppresses matching rule_id. 2. Stale thread resolution no longer requires `isOutdated` from GitHub. Any absent (no longer in current findings), unresolved thread is resolved. This fixes persistent stale comments that never got cleaned up because the line didn't change between pushes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

github-actions · 2026-03-15T08:28:09Z

src/grippy/github_review.py

+            "> \u26a0\ufe0f **Output policy was bypassed due to an internal error."
+            " Findings are unfiltered.**"
+        )
+        lines.append("")


🟡 MEDIUM: Malformed summary-only findings rendering code style

Confidence: 90%

#### {sev_emoji} {f.severity.value}: {f_title} description, suggestion only -- omits confidence/grippy_note

The rendering logic for 'summary_only_findings' in the summary comment diverges from the format used for normal findings and omits key fields (e.g. confidence or grippy_note). This may cause confusion/inconsistency for consumers or break downstream automation expecting uniform markdown structure.

Suggestion: Make the rendering structure for summary-only findings match the normal finding comment (including confidence % and grippy_note). Consider extracting a shared formatting helper.

— Consistency in report formatting helps prevent confusion and bugs.

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

…ntity keys - Evidence thin-token fail-open → fail-closed: <2 meaningful tokens routes to summary-only, not inline - Split telemetry: threshold_suppressed_count → confidence_suppressed_count + evidence_suppressed_count; nogrip excluded from quality-judgment total - Thread resolution: summary-only findings protect their threads from auto-resolve - Identity key hash: markers now include sha256(title:severity:category)[:8] discriminator with conservative backward compat for old markers - Dashboard count split: shows "N total (X inline, Y summary-only)" - Nogrip acceptance test: proves PR #77 SQL fixture scenario → PASS Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

github-actions · 2026-03-15T15:24:47Z

tests/test_grippy_output_policy.py

+                confidence=95,
+                line_start=11,
+                line_end=11,
+                evidence='query = f"SELECT * FROM {user_input}"',


🟠 HIGH: SQL injection risk in test: Interpolated user input in query

Confidence: 95%

query = f"SELECT * FROM {user_input}"

A test fixture on line 661 constructs a SQL query using direct string interpolation with user input (query = f"SELECT * FROM {user_input}"). Even though this occurs in a test context, it reinforces a dangerous pattern where user input is not parameterized, which can lead to SQL injection vulnerabilities in production code if copied or misunderstood by future maintainers.

Suggestion: Even within tests, use parameterized queries (e.g., cursor.execute(sql, params)) to demonstrate and reinforce safe patterns. Consider rewriting the fixture code to use parameter binding instead of direct interpolation.

— This would be critical in production. In tests, it's only high-but the pattern is risky.

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

project-navi-bot

All required CI checks passed. Auto-approved by navi-bot.

…idence - fetch_thread_states: gh -F doesn't parse JSON arrays — switch to --input stdin with full query+variables payload - Add # nogrip to SQL evidence string in nogrip acceptance test to prevent rule engine false positive (the exact PR #77 failure mode) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ntity keys - Evidence thin-token fail-open → fail-closed: <2 meaningful tokens routes to summary-only, not inline - Split telemetry: threshold_suppressed_count → confidence_suppressed_count + evidence_suppressed_count; nogrip excluded from quality-judgment total - Thread resolution: summary-only findings protect their threads from auto-resolve - Identity key hash: markers now include sha256(title:severity:category)[:8] discriminator with conservative backward compat for old markers - Dashboard count split: shows "N total (X inline, Y summary-only)" - Nogrip acceptance test: proves PR #77 SQL fixture scenario → PASS Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…idence - fetch_thread_states: gh -F doesn't parse JSON arrays — switch to --input stdin with full query+variables payload - Add # nogrip to SQL evidence string in nogrip acceptance test to prevent rule engine false positive (the exact PR #77 failure mode) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Nelson Spence (Fieldnote-Echo) and others added 5 commits March 15, 2026 02:56

Nelson Spence (Fieldnote-Echo) self-assigned this Mar 15, 2026

Nelson Spence (Fieldnote-Echo) requested a review from project-navi-bot March 15, 2026 07:57

github-actions bot reviewed Mar 15, 2026

View reviewed changes

github-actions bot previously requested changes Mar 15, 2026

View reviewed changes

Nelson Spence (Fieldnote-Echo) and others added 2 commits March 15, 2026 03:05

style: add # nogrip to test fixture SQL patterns

6f51356

Suppresses false-positive rule engine findings on test fixture strings containing SQL injection patterns (evidence examples, diff fixtures). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

project-navi-bot approved these changes Mar 15, 2026

View reviewed changes

github-actions bot reviewed Mar 15, 2026

View reviewed changes

github-actions bot requested changes Mar 15, 2026

View reviewed changes

project-navi-bot approved these changes Mar 15, 2026

View reviewed changes

project-navi-bot previously approved these changes Mar 15, 2026

View reviewed changes

Nelson Spence (Fieldnote-Echo) dismissed project-navi-bot’s stale review via 86cf613 March 15, 2026 08:23

project-navi-bot approved these changes Mar 15, 2026

View reviewed changes

github-actions bot reviewed Mar 15, 2026

View reviewed changes

project-navi-bot approved these changes Mar 15, 2026

View reviewed changes

project-navi-bot previously approved these changes Mar 15, 2026

View reviewed changes

Nelson Spence (Fieldnote-Echo) dismissed project-navi-bot’s stale review via 049f3ba March 15, 2026 15:19

project-navi-bot approved these changes Mar 15, 2026

View reviewed changes

github-actions bot reviewed Mar 15, 2026

View reviewed changes

project-navi-bot approved these changes Mar 15, 2026

View reviewed changes

project-navi-bot previously approved these changes Mar 15, 2026

View reviewed changes

Nelson Spence (Fieldnote-Echo) dismissed project-navi-bot’s stale review via e4c936d March 15, 2026 16:05

Nelson Spence (Fieldnote-Echo) closed this Mar 15, 2026

Nelson Spence (Fieldnote-Echo) mentioned this pull request Mar 15, 2026

feat: output policy v2 — three-set model, evidence grounding, identity keys #78

Open

7 tasks

Conversation

Nelson Spence (Fieldnote-Echo) commented Mar 15, 2026

Summary

Why

Three-set model

Key design decisions

Test plan

Uh oh!

codecov bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot Mar 15, 2026

Choose a reason for hiding this comment

🟠 HIGH: SQL injection risk: query built from interpolated input

Uh oh!

github-actions bot Mar 15, 2026

Choose a reason for hiding this comment

🟠 HIGH: SQL injection risk: query built from interpolated input

Uh oh!

github-actions bot Mar 15, 2026

Choose a reason for hiding this comment

🟠 HIGH: SQL injection risk: query built from interpolated input

Uh oh!

github-actions bot Mar 15, 2026

Choose a reason for hiding this comment

🟠 HIGH: SQL injection risk: query built from interpolated input

Uh oh!

github-actions bot Mar 15, 2026

Choose a reason for hiding this comment

🟠 HIGH: SQL injection risk: query built from interpolated input

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Grippy Review — PASS

Uh oh!

project-navi-bot left a comment

Choose a reason for hiding this comment

Uh oh!

project-navi-bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 15, 2026

Choose a reason for hiding this comment

🟠 HIGH: String-interpolated SQL evident in output policy test fixture

Uh oh!

github-actions bot Mar 15, 2026

Choose a reason for hiding this comment

🟠 HIGH: Unsafe SQL query interpolation in LOW finding example

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

project-navi-bot left a comment

Choose a reason for hiding this comment

Uh oh!

project-navi-bot left a comment

Choose a reason for hiding this comment

Uh oh!

project-navi-bot left a comment

Choose a reason for hiding this comment

Uh oh!

project-navi-bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 15, 2026

Choose a reason for hiding this comment

🟡 MEDIUM: Malformed summary-only findings rendering code style

Uh oh!

project-navi-bot left a comment

Choose a reason for hiding this comment

Uh oh!

project-navi-bot left a comment

Choose a reason for hiding this comment

Uh oh!

project-navi-bot left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 15, 2026 •

edited

Loading

github-actions bot commented Mar 15, 2026 •

edited

Loading