feat: output policy v2 — three-set model, evidence grounding, identity keys#78
feat: output policy v2 — three-set model, evidence grounding, identity keys#78Nelson Spence (Fieldnote-Echo) wants to merge 10 commits intomainfrom
Conversation
…evidence posting gate Three-set model: suppressed (narration/below-threshold/empty-evidence), scoring set (inline + summary-only), display set (capped for posting). - Narration detection: 11 phrase patterns + empty/dismissive suggestion check - Confidence threshold: severity-specific minimums with category penalty (governance/observability -15) - Evidence grounding: empty evidence suppresses; non-empty but ungrounded moves to summary-only; grounded stays inline-eligible - Delegates to parse_diff from rules/context.py — no duplicated diff parsing - Score recomputation from scoring set with per-category caps - Verdict derivation from recomputed score + mode (call-site authority) - Display caps: 5/file, 20/review, 3 LOW — caps don't affect score Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GrippyReview gains summary_only_findings (scored but not inline-eligible). ReviewMeta gains: score_before_policy, verdict_before_policy, policy_bypassed, policy_bypass_reason, narration_suppressed_count, threshold_suppressed_count, display_capped_count. All new fields have defaults — backward compatible with existing reviews. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
retry.py: add mode/diff params to run_review(), call filter_review() after successful parse. Single fail-open owner — catches exceptions, sets policy_bypassed=True + reason, returns unfiltered review. review.py: pass mode and diff through to run_review(), pass summary_only_findings + policy meta to post_review(). mcp_server.py: pass mode and diff to run_review() in audit_diff path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
github_review.py: - build_review_comment() renders evidence as fenced code block - format_summary_comment() gains summary_only_findings, policy_bypassed, display_capped_count params - Summary-only findings rendered in collapsible details section - Policy-bypassed warning annotation when filter_review() failed - Display-capped count annotation when findings omitted for brevity - post_review() passes new params through to summary mcp_response.py: - serialize_audit() includes summary_only_findings in output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New: tests/test_grippy_output_policy.py — 34 tests covering: - Narration suppression (3), confidence threshold (5), evidence grounding (4) - Three-set invariant (2), score recomputation (3), verdict derivation (4) - Display caps (2), summary-only persistence (2), telemetry (5) - Fail-open integration (1), diff helper (2), end-to-end (1) Updated: test_grippy_retry.py — score assertions updated for recomputed scores (output policy recomputes 0-finding reviews to 100). Updated: test_grippy_mcp_response.py — added summary_only_findings to expected serialize_audit() keys. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests for the three new format_summary_comment params: - policy_bypassed warning annotation - display_capped_count omission annotation - summary_only_findings collapsible section Tests for build_review_comment evidence rendering: - Non-empty evidence rendered as fenced code block - Whitespace-only evidence produces no code block Fixes quality-gate: github_review.py patch coverage was 25% (21 uncovered lines in new rendering paths), now covered by unit tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Suppresses false-positive rule engine findings on test fixture strings containing SQL injection patterns (evidence examples, diff fixtures). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes for the noise problem: 1. `# nogrip` now suppresses LLM findings (not just rule engine). The output policy parses nogrip pragmas from the diff and suppresses findings whose line range includes a nogrip line. Bare `# nogrip` suppresses all findings; targeted `# nogrip: SEC-001` only suppresses matching rule_id. 2. Stale thread resolution no longer requires `isOutdated` from GitHub. Any absent (no longer in current findings), unresolved thread is resolved. This fixes persistent stale comments that never got cleaned up because the line didn't change between pushes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ntity keys - Evidence thin-token fail-open → fail-closed: <2 meaningful tokens routes to summary-only, not inline - Split telemetry: threshold_suppressed_count → confidence_suppressed_count + evidence_suppressed_count; nogrip excluded from quality-judgment total - Thread resolution: summary-only findings protect their threads from auto-resolve - Identity key hash: markers now include sha256(title:severity:category)[:8] discriminator with conservative backward compat for old markers - Dashboard count split: shows "N total (X inline, Y summary-only)" - Nogrip acceptance test: proves PR #77 SQL fixture scenario → PASS Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…idence - fetch_thread_states: gh -F doesn't parse JSON arrays — switch to --input stdin with full query+variables payload - Add # nogrip to SQL evidence string in nogrip acceptance test to prevent rule engine false positive (the exact PR #77 failure mode) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
| @@ -408,6 +454,29 @@ def format_summary_comment( | |||
| lines.append("</details>") | |||
There was a problem hiding this comment.
🟡 MEDIUM: Potential side effect: summary_only_findings may be None
Confidence: 95%
summary_only_count=len(summary_only_findings) if summary_only_findings else 0,
In function format_summary_comment, summary_only_findings is checked with 'if summary_only_findings' before using len(summary_only_findings), but if it's None (the default), len() would raise a TypeError. This could lead to exceptions if the argument is not explicitly passed.
Suggestion: Ensure summary_only_findings is always a list (never None), or update the check to assign 0 if it is None. For example, use 'len(summary_only_findings) if summary_only_findings else 0' or default to an empty list at function entry.
— If summary_only_findings is unexpectedly None, you get a TypeError. Worth hardening.
| @@ -225,19 +243,24 @@ def build_review_comment(finding: Finding) -> dict[str, str | int]: | |||
| # --- GitHub comment fetching --- | |||
There was a problem hiding this comment.
🔵 LOW: Default param for summary_only_findings is mutable None
Confidence: 92%
summary_only_findings: list[Finding] | None = None,
The parameter summary_only_findings has a default of None in several functions ('def format_summary_comment', 'def post_review'), but is later treated as a list. Using None requires repeated None checks, and is risk-prone.
Suggestion: Default these parameters to an empty list if possible, or at the start of the function, assign summary_only_findings = summary_only_findings or [].
— Not wrong, but making the default an empty list would simplify downstream logic.
| @@ -189,6 +199,7 @@ class GrippyReview(BaseModel): | |||
| pr: PRMetadata | |||
There was a problem hiding this comment.
🔵 LOW: Possible mutable default with summary_only_findings Field
Confidence: 88%
summary_only_findings: list[Finding] = Field(default_factory=list)
The summary_only_findings field is defined as Field(default_factory=list), making the default state an empty list. This is the correct approach to avoid mutable default arguments, but be careful if mutating this list in-place in code paths.
Suggestion: Avoid mutating the summary_only_findings list in-place on the review object; always assign a fresh list unless intentionally appending.
— Use default_factory for lists-that's good-but avoid accidental shared reference pitfalls elsewhere.
✅ Grippy Review — PASSScore: 91/100 | Findings: 3 Delta: 3 new Commit: 6f944fd |
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6f944fd7a9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
| if diff: | ||
| added_lines, nogrip_index = _parse_diff_context(diff) | ||
| else: | ||
| added_lines, nogrip_index = None, {} |
There was a problem hiding this comment.
Treat empty diff as provided context
In filter_review, using if diff: sends empty-string diffs down the “no diff provided” branch, which bypasses both nogrip suppression and evidence grounding. This is reachable when the caller passes an empty diff string (for example after ignore filtering removes all file hunks), so non-empty-evidence findings become inline-eligible and can still affect verdicts even though there is no added code to ground against. Check diff is not None here so an empty diff is parsed as empty context instead of fail-open inline behavior.
Useful? React with 👍 / 👎.
| matches = sum(1 for t in tokens if t in added_text) | ||
| return matches >= 2 |
There was a problem hiding this comment.
Deduplicate evidence tokens before overlap counting
_evidence_is_grounded counts token matches with sum(1 for t in tokens if t in added_text), so repeated evidence tokens are double-counted. A finding with evidence like "query query" can satisfy the >= 2 match threshold even if only one distinct token overlaps with the diff, causing thin evidence to be treated as grounded and posted inline. Count unique normalized tokens (or intersect token sets) to enforce the intended “at least two tokens” requirement.
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
project-navi-bot
left a comment
There was a problem hiding this comment.
All required CI checks passed. Auto-approved by navi-bot.
Summary
output_policy.py): narration suppression, severity-specific confidence thresholds, evidence grounding (empty → suppress, thin → summary-only, grounded → inline), nogrip pragma support, score recomputation from scoring set, verdict derivation from modesha256(title:severity:category)[:8]discriminator — prevents dedup collisions for distinct findings at same file/line/categoryconfidence_suppressed_count,evidence_suppressed_count,nogrip_suppressed_counttracked independently. Nogrip excluded from quality-judgment totalfetch_thread_statesfixed to use stdin for JSON array variablesFindings: N total (X inline, Y summary-only)when summary-only > 0Test plan
🤖 Generated with Claude Code