feat: output policy v2 — review-policy rewrite#77
feat: output policy v2 — review-policy rewrite#77Nelson Spence (Fieldnote-Echo) wants to merge 10 commits intomainfrom
Conversation
…evidence posting gate Three-set model: suppressed (narration/below-threshold/empty-evidence), scoring set (inline + summary-only), display set (capped for posting). - Narration detection: 11 phrase patterns + empty/dismissive suggestion check - Confidence threshold: severity-specific minimums with category penalty (governance/observability -15) - Evidence grounding: empty evidence suppresses; non-empty but ungrounded moves to summary-only; grounded stays inline-eligible - Delegates to parse_diff from rules/context.py — no duplicated diff parsing - Score recomputation from scoring set with per-category caps - Verdict derivation from recomputed score + mode (call-site authority) - Display caps: 5/file, 20/review, 3 LOW — caps don't affect score Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GrippyReview gains summary_only_findings (scored but not inline-eligible). ReviewMeta gains: score_before_policy, verdict_before_policy, policy_bypassed, policy_bypass_reason, narration_suppressed_count, threshold_suppressed_count, display_capped_count. All new fields have defaults — backward compatible with existing reviews. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
retry.py: add mode/diff params to run_review(), call filter_review() after successful parse. Single fail-open owner — catches exceptions, sets policy_bypassed=True + reason, returns unfiltered review. review.py: pass mode and diff through to run_review(), pass summary_only_findings + policy meta to post_review(). mcp_server.py: pass mode and diff to run_review() in audit_diff path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
github_review.py: - build_review_comment() renders evidence as fenced code block - format_summary_comment() gains summary_only_findings, policy_bypassed, display_capped_count params - Summary-only findings rendered in collapsible details section - Policy-bypassed warning annotation when filter_review() failed - Display-capped count annotation when findings omitted for brevity - post_review() passes new params through to summary mcp_response.py: - serialize_audit() includes summary_only_findings in output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New: tests/test_grippy_output_policy.py — 34 tests covering: - Narration suppression (3), confidence threshold (5), evidence grounding (4) - Three-set invariant (2), score recomputation (3), verdict derivation (4) - Display caps (2), summary-only persistence (2), telemetry (5) - Fail-open integration (1), diff helper (2), end-to-end (1) Updated: test_grippy_retry.py — score assertions updated for recomputed scores (output policy recomputes 0-finding reviews to 100). Updated: test_grippy_mcp_response.py — added summary_only_findings to expected serialize_audit() keys. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
tests/test_grippy_output_policy.py
Outdated
| "title": "SQL injection risk", | ||
| "description": "User input interpolated into query.", | ||
| "suggestion": "Use parameterized queries.", | ||
| "evidence": 'f"SELECT * FROM {user_input}"', |
There was a problem hiding this comment.
🟠 HIGH: SQL injection risk: query built from interpolated input
Confidence: 95%
"evidence": 'f"SELECT * FROM {user_input}"',
Test case demonstrates unsafe construction of a SQL query using direct string interpolation with user-controlled input. If this pattern appears in production code, it would permit malicious inputs to execute arbitrary SQL. Even in tests, examples with vulnerable patterns can be adapted or misunderstood elsewhere.
Suggestion: Change the sample to use parameterized query placeholders (e.g., cursor.execute('SELECT ... WHERE q=?', (user_input,))) and clarify in test descriptions that string interpolation is unsafe.
— Test code should not illustrate unsafe query patterns, even as fixtures.
tests/test_grippy_output_policy.py
Outdated
| @@ -10,3 +10,5 @@ def handle_request(): | ||
| user_input = request.args.get("q") | ||
| - old_code() | ||
| + query = f"SELECT * FROM {user_input}" |
There was a problem hiding this comment.
🟠 HIGH: SQL injection risk: query built from interpolated input
Confidence: 95%
+ query = f"SELECT * FROM {user_input}"
Within the test diff, a query is constructed using f-string interpolation with user input (query = f"SELECT * FROM {user_input}"). Though present in test scaffolding, this propagates a dangerous anti-pattern. Developers may copy such patterns, leading to vulnerabilities in real code.
Suggestion: Update the test example to use a safe, parameterized pattern. Add a code comment noting the SQL injection risk of this construction.
— Real or test-unsafe SQL interpolation shouldn't appear unqualified.
tests/test_grippy_output_policy.py
Outdated
|
|
||
| def test_grounded_evidence_inline(self) -> None: | ||
| """10. Grounded evidence -> inline-eligible, no penalty.""" | ||
| f = _f(evidence='query = f"SELECT * FROM {user_input}"') |
There was a problem hiding this comment.
🟠 HIGH: SQL injection risk: query built from interpolated input
Confidence: 95%
f = _f(evidence='query = f"SELECT * FROM {user_input}"')
A test fixture demonstrates SQL query construction via variable interpolation, mirroring real-risk production vulnerabilities. This reinforces unsafe practices when tests are used as reference.
Suggestion: Where fixture patterns must illustrate risky code, add explicit comments (e.g., '# SQL injection - do NOT use in production'). Otherwise, prefer parameterized examples.
— Unsafe query patterns in test helpers risk normalization of dangerous code.
tests/test_grippy_output_policy.py
Outdated
| def test_display_caps_dont_inflate_score(self) -> None: | ||
| """13. 8 LOWs -> all 8 in scoring set -> display caps to 3 -> score unchanged.""" | ||
| findings = [ | ||
| _f(id=f"F-{i:03d}", severity="LOW", confidence=70, evidence="query user_input SELECT") |
There was a problem hiding this comment.
🟠 HIGH: SQL injection risk: query built from interpolated input
Confidence: 95%
_f(id=f"F-{i:03d}", severity="LOW", confidence=70, evidence="query user_input SELECT")
A bulk test creates several LOW-severity findings, all using the same interpolated query pattern that would be considered a SQL injection in actual application code. Tests modeling or instantiating these injectable patterns can unintentionally introduce vulnerabilities.
Suggestion: Ensure all test fixtures recommend or model parameterized SQL as the norm. If an unsafe pattern is needed for a negative test, clearly contaminate it with warnings.
— Test datasets shouldn't seed unsafe SQL building blocks, even for 'fake' findings.
tests/test_grippy_output_policy.py
Outdated
|
|
||
| def test_summary_only_scored_not_posted(self) -> None: | ||
| """14. Summary-only findings contribute to score but not inline posting.""" | ||
| grounded = _f(id="F-001", evidence='query = f"SELECT * FROM {user_input}"') |
There was a problem hiding this comment.
🟠 HIGH: SQL injection risk: query built from interpolated input
Confidence: 95%
grounded = _f(id="F-001", evidence='query = f"SELECT * FROM {user_input}"')
Direct unparameterized query construction with user input in a test case sets a poor example and may promote misuse in copied code.
Suggestion: Use parameterized SQL in all examples; explicitly flag intentionally unsafe lines as anti-patterns.
— If you must show the risky pattern, make sure to flag it with a prominent warning.
✅ Grippy Review — PASSScore: 100/100 | Findings: 0 Commit: e4c936d |
Tests for the three new format_summary_comment params: - policy_bypassed warning annotation - display_capped_count omission annotation - summary_only_findings collapsible section Tests for build_review_comment evidence rendering: - Non-empty evidence rendered as fenced code block - Whitespace-only evidence produces no code block Fixes quality-gate: github_review.py patch coverage was 25% (21 uncovered lines in new rendering paths), now covered by unit tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Suppresses false-positive rule engine findings on test fixture strings containing SQL injection patterns (evidence examples, diff fixtures). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
tests/test_grippy_output_policy.py
Outdated
| id=f"F-{i:03d}", | ||
| severity="MEDIUM", | ||
| confidence=85, | ||
| evidence='query = f"SELECT * FROM {user_input}"', |
There was a problem hiding this comment.
🟠 HIGH: String-interpolated SQL evident in output policy test fixture
Confidence: 95%
evidence='query = f"SELECT * FROM {user_input}"',
In the test on line 329, an interpolated query string is set in a test Finding's evidence. Again, risky in application code; here, meant for policy filter coverage.
Suggestion: Reinforce in test or code comments that interpolated SQL is present intentionally and isn't proper for production.
— Test pattern triggers correct policy filter-just ensure this NEVER migrates to real code paths.
tests/test_grippy_output_policy.py
Outdated
| id=f"F-{i:03d}", | ||
| severity="LOW", | ||
| confidence=70, | ||
| evidence='query = f"SELECT * FROM {user_input}"', |
There was a problem hiding this comment.
🟠 HIGH: Unsafe SQL query interpolation in LOW finding example
Confidence: 95%
evidence='query = f"SELECT * FROM {user_input}"',
At line 347, a test generates several LOW severity Finding examples, but the evidence string represents unsafe SQL query composition with interpolated input. Even when used for test policy validation, this pattern is a real-world vulnerability.
Suggestion: Do not use or promote interpolated SQL in example code outside of clearly scoped, documented negative test cases.
— Exposure via test data, but important to show the anti-pattern is caught. Always document intent.
Superseded by review of 6f51356
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
Two fixes for the noise problem: 1. `# nogrip` now suppresses LLM findings (not just rule engine). The output policy parses nogrip pragmas from the diff and suppresses findings whose line range includes a nogrip line. Bare `# nogrip` suppresses all findings; targeted `# nogrip: SEC-001` only suppresses matching rule_id. 2. Stale thread resolution no longer requires `isOutdated` from GitHub. Any absent (no longer in current findings), unresolved thread is resolved. This fixes persistent stale comments that never got cleaned up because the line didn't change between pushes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
| "> \u26a0\ufe0f **Output policy was bypassed due to an internal error." | ||
| " Findings are unfiltered.**" | ||
| ) | ||
| lines.append("") |
There was a problem hiding this comment.
🟡 MEDIUM: Malformed summary-only findings rendering code style
Confidence: 90%
#### {sev_emoji} {f.severity.value}: {f_title}
description, suggestion only -- omits confidence/grippy_note
The rendering logic for 'summary_only_findings' in the summary comment diverges from the format used for normal findings and omits key fields (e.g. confidence or grippy_note). This may cause confusion/inconsistency for consumers or break downstream automation expecting uniform markdown structure.
Suggestion: Make the rendering structure for summary-only findings match the normal finding comment (including confidence % and grippy_note). Consider extracting a shared formatting helper.
— Consistency in report formatting helps prevent confusion and bugs.
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
…ntity keys - Evidence thin-token fail-open → fail-closed: <2 meaningful tokens routes to summary-only, not inline - Split telemetry: threshold_suppressed_count → confidence_suppressed_count + evidence_suppressed_count; nogrip excluded from quality-judgment total - Thread resolution: summary-only findings protect their threads from auto-resolve - Identity key hash: markers now include sha256(title:severity:category)[:8] discriminator with conservative backward compat for old markers - Dashboard count split: shows "N total (X inline, Y summary-only)" - Nogrip acceptance test: proves PR #77 SQL fixture scenario → PASS Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
tests/test_grippy_output_policy.py
Outdated
| confidence=95, | ||
| line_start=11, | ||
| line_end=11, | ||
| evidence='query = f"SELECT * FROM {user_input}"', |
There was a problem hiding this comment.
🟠 HIGH: SQL injection risk in test: Interpolated user input in query
Confidence: 95%
query = f"SELECT * FROM {user_input}"
A test fixture on line 661 constructs a SQL query using direct string interpolation with user input (query = f"SELECT * FROM {user_input}"). Even though this occurs in a test context, it reinforces a dangerous pattern where user input is not parameterized, which can lead to SQL injection vulnerabilities in production code if copied or misunderstood by future maintainers.
Suggestion: Even within tests, use parameterized queries (e.g., cursor.execute(sql, params)) to demonstrate and reinforce safe patterns. Consider rewriting the fixture code to use parameter binding instead of direct interpolation.
— This would be critical in production. In tests, it's only high-but the pattern is risky.
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
…idence - fetch_thread_states: gh -F doesn't parse JSON arrays — switch to --input stdin with full query+variables payload - Add # nogrip to SQL evidence string in nogrip acceptance test to prevent rule engine false positive (the exact PR #77 failure mode) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ntity keys - Evidence thin-token fail-open → fail-closed: <2 meaningful tokens routes to summary-only, not inline - Split telemetry: threshold_suppressed_count → confidence_suppressed_count + evidence_suppressed_count; nogrip excluded from quality-judgment total - Thread resolution: summary-only findings protect their threads from auto-resolve - Identity key hash: markers now include sha256(title:severity:category)[:8] discriminator with conservative backward compat for old markers - Dashboard count split: shows "N total (X inline, Y summary-only)" - Nogrip acceptance test: proves PR #77 SQL fixture scenario → PASS Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…idence - fetch_thread_states: gh -F doesn't parse JSON arrays — switch to --input stdin with full query+variables payload - Add # nogrip to SQL evidence string in nogrip acceptance test to prevent rule engine false positive (the exact PR #77 failure mode) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
output_policy.py: narration suppression, confidence gate, evidence posting gate, score recomputation, verdict derivation, display caps — three-set model (suppressed / scoring / display)summary_only_findingsonGrippyReview, 7 telemetry fields onReviewMeta(all backward-compatible defaults)retry.pyowns fail-open,review.py+mcp_server.pypassmode/diffthroughmcp_response.pyincludessummary_only_findingsWhy
Dogfood QA across PRs #71-#76 found 83% noise rate: 10/12 posted findings were narration-as-finding, scores were decorative (LLM-generated, not derived from findings), and 9 real issues were missed. Root cause:
confidence-filter.mddefines a comprehensive post-generation filter but none of it was implemented in code.Three-set model
Score and verdict are computed from the scoring set (uncapped). Display caps only control what gets posted inline.
Key design decisions
modeis verdict authority (notreview.audit_type— that's LLM output)parse_difffromrules/context.pyreused — zero duplicated diff parsingretry.py—filter_review()raises normallyTest plan
test_grippy_output_policy.py(97% coverage on new module)P1 (separate PR): line-range evidence grounding + trust boundary auto-escalation.
🤖 Generated with Claude Code