Skip to content

feat: output policy v2 — three-set model, evidence grounding, identity keys#78

Open
Nelson Spence (Fieldnote-Echo) wants to merge 10 commits intomainfrom
feat/output-policy-v2-clean
Open

feat: output policy v2 — three-set model, evidence grounding, identity keys#78
Nelson Spence (Fieldnote-Echo) wants to merge 10 commits intomainfrom
feat/output-policy-v2-clean

Conversation

@Fieldnote-Echo
Copy link
Copy Markdown
Member

Summary

  • Output policy (output_policy.py): narration suppression, severity-specific confidence thresholds, evidence grounding (empty → suppress, thin → summary-only, grounded → inline), nogrip pragma support, score recomputation from scoring set, verdict derivation from mode
  • Three-set model: suppressed (discarded), scoring set (inline + summary-only), display set (capped). Score/verdict computed from scoring set, not display set
  • Identity keys: comment markers now include sha256(title:severity:category)[:8] discriminator — prevents dedup collisions for distinct findings at same file/line/category
  • Split telemetry: confidence_suppressed_count, evidence_suppressed_count, nogrip_suppressed_count tracked independently. Nogrip excluded from quality-judgment total
  • Thread resolution: summary-only findings protect their threads from auto-resolve. fetch_thread_states fixed to use stdin for JSON array variables
  • Dashboard: shows Findings: N total (X inline, Y summary-only) when summary-only > 0

Test plan

  • 39 output policy tests (narration, confidence, evidence, nogrip, three-set, score, verdict, caps, telemetry, fail-open, e2e)
  • Nogrip acceptance test proves PR feat: output policy v2 — review-policy rewrite #77 scenario (SQL in fixtures → PASS)
  • Thread resolution test: summary-only findings protect threads
  • Identity key tests: 4-tuple markers with backward compat
  • Full suite: 1181 passed, 0 failed
  • Lint, format, mypy clean
  • Rule engine: 0 findings on PR diff

🤖 Generated with Claude Code

…evidence posting gate

Three-set model: suppressed (narration/below-threshold/empty-evidence),
scoring set (inline + summary-only), display set (capped for posting).

- Narration detection: 11 phrase patterns + empty/dismissive suggestion check
- Confidence threshold: severity-specific minimums with category penalty
  (governance/observability -15)
- Evidence grounding: empty evidence suppresses; non-empty but ungrounded
  moves to summary-only; grounded stays inline-eligible
- Delegates to parse_diff from rules/context.py — no duplicated diff parsing
- Score recomputation from scoring set with per-category caps
- Verdict derivation from recomputed score + mode (call-site authority)
- Display caps: 5/file, 20/review, 3 LOW — caps don't affect score

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GrippyReview gains summary_only_findings (scored but not inline-eligible).
ReviewMeta gains: score_before_policy, verdict_before_policy,
policy_bypassed, policy_bypass_reason, narration_suppressed_count,
threshold_suppressed_count, display_capped_count.

All new fields have defaults — backward compatible with existing reviews.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
retry.py: add mode/diff params to run_review(), call filter_review()
after successful parse. Single fail-open owner — catches exceptions,
sets policy_bypassed=True + reason, returns unfiltered review.

review.py: pass mode and diff through to run_review(), pass
summary_only_findings + policy meta to post_review().

mcp_server.py: pass mode and diff to run_review() in audit_diff path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
github_review.py:
- build_review_comment() renders evidence as fenced code block
- format_summary_comment() gains summary_only_findings, policy_bypassed,
  display_capped_count params
- Summary-only findings rendered in collapsible details section
- Policy-bypassed warning annotation when filter_review() failed
- Display-capped count annotation when findings omitted for brevity
- post_review() passes new params through to summary

mcp_response.py:
- serialize_audit() includes summary_only_findings in output

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New: tests/test_grippy_output_policy.py — 34 tests covering:
- Narration suppression (3), confidence threshold (5), evidence grounding (4)
- Three-set invariant (2), score recomputation (3), verdict derivation (4)
- Display caps (2), summary-only persistence (2), telemetry (5)
- Fail-open integration (1), diff helper (2), end-to-end (1)

Updated: test_grippy_retry.py — score assertions updated for recomputed
scores (output policy recomputes 0-finding reviews to 100).

Updated: test_grippy_mcp_response.py — added summary_only_findings to
expected serialize_audit() keys.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests for the three new format_summary_comment params:
- policy_bypassed warning annotation
- display_capped_count omission annotation
- summary_only_findings collapsible section

Tests for build_review_comment evidence rendering:
- Non-empty evidence rendered as fenced code block
- Whitespace-only evidence produces no code block

Fixes quality-gate: github_review.py patch coverage was 25% (21 uncovered
lines in new rendering paths), now covered by unit tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Suppresses false-positive rule engine findings on test fixture strings
containing SQL injection patterns (evidence examples, diff fixtures).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes for the noise problem:

1. `# nogrip` now suppresses LLM findings (not just rule engine).
   The output policy parses nogrip pragmas from the diff and suppresses
   findings whose line range includes a nogrip line. Bare `# nogrip`
   suppresses all findings; targeted `# nogrip: SEC-001` only
   suppresses matching rule_id.

2. Stale thread resolution no longer requires `isOutdated` from GitHub.
   Any absent (no longer in current findings), unresolved thread is
   resolved. This fixes persistent stale comments that never got cleaned
   up because the line didn't change between pushes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ntity keys

- Evidence thin-token fail-open → fail-closed: <2 meaningful tokens
  routes to summary-only, not inline
- Split telemetry: threshold_suppressed_count → confidence_suppressed_count
  + evidence_suppressed_count; nogrip excluded from quality-judgment total
- Thread resolution: summary-only findings protect their threads from
  auto-resolve
- Identity key hash: markers now include sha256(title:severity:category)[:8]
  discriminator with conservative backward compat for old markers
- Dashboard count split: shows "N total (X inline, Y summary-only)"
- Nogrip acceptance test: proves PR #77 SQL fixture scenario → PASS

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…idence

- fetch_thread_states: gh -F doesn't parse JSON arrays — switch to
  --input stdin with full query+variables payload
- Add # nogrip to SQL evidence string in nogrip acceptance test to
  prevent rule engine false positive (the exact PR #77 failure mode)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@project-navi-bot project-navi-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All required CI checks passed. Auto-approved by navi-bot.

Copy link
Copy Markdown
Collaborator

@project-navi-bot project-navi-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All required CI checks passed. Auto-approved by navi-bot.

@@ -408,6 +454,29 @@ def format_summary_comment(
lines.append("</details>")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM: Potential side effect: summary_only_findings may be None

Confidence: 95%

summary_only_count=len(summary_only_findings) if summary_only_findings else 0,

In function format_summary_comment, summary_only_findings is checked with 'if summary_only_findings' before using len(summary_only_findings), but if it's None (the default), len() would raise a TypeError. This could lead to exceptions if the argument is not explicitly passed.

Suggestion: Ensure summary_only_findings is always a list (never None), or update the check to assign 0 if it is None. For example, use 'len(summary_only_findings) if summary_only_findings else 0' or default to an empty list at function entry.

— If summary_only_findings is unexpectedly None, you get a TypeError. Worth hardening.

@@ -225,19 +243,24 @@ def build_review_comment(finding: Finding) -> dict[str, str | int]:
# --- GitHub comment fetching ---
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 LOW: Default param for summary_only_findings is mutable None

Confidence: 92%

summary_only_findings: list[Finding] | None = None,

The parameter summary_only_findings has a default of None in several functions ('def format_summary_comment', 'def post_review'), but is later treated as a list. Using None requires repeated None checks, and is risk-prone.

Suggestion: Default these parameters to an empty list if possible, or at the start of the function, assign summary_only_findings = summary_only_findings or [].

— Not wrong, but making the default an empty list would simplify downstream logic.

@@ -189,6 +199,7 @@ class GrippyReview(BaseModel):
pr: PRMetadata
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 LOW: Possible mutable default with summary_only_findings Field

Confidence: 88%

summary_only_findings: list[Finding] = Field(default_factory=list)

The summary_only_findings field is defined as Field(default_factory=list), making the default state an empty list. This is the correct approach to avoid mutable default arguments, but be careful if mutating this list in-place in code paths.

Suggestion: Avoid mutating the summary_only_findings list in-place on the review object; always assign a fresh list unless intentionally appending.

— Use default_factory for lists-that's good-but avoid accidental shared reference pitfalls elsewhere.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Grippy Review — PASS

Score: 91/100 | Findings: 3

Delta: 3 new


Commit: 6f944fd

Copy link
Copy Markdown
Collaborator

@project-navi-bot project-navi-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All required CI checks passed. Auto-approved by navi-bot.

Copy link
Copy Markdown
Collaborator

@project-navi-bot project-navi-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All required CI checks passed. Auto-approved by navi-bot.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6f944fd7a9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment on lines +363 to +366
if diff:
added_lines, nogrip_index = _parse_diff_context(diff)
else:
added_lines, nogrip_index = None, {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat empty diff as provided context

In filter_review, using if diff: sends empty-string diffs down the “no diff provided” branch, which bypasses both nogrip suppression and evidence grounding. This is reachable when the caller passes an empty diff string (for example after ignore filtering removes all file hunks), so non-empty-evidence findings become inline-eligible and can still affect verdicts even though there is no added code to ground against. Check diff is not None here so an empty diff is parsed as empty context instead of fail-open inline behavior.

Useful? React with 👍 / 👎.

Comment on lines +165 to +166
matches = sum(1 for t in tokens if t in added_text)
return matches >= 2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Deduplicate evidence tokens before overlap counting

_evidence_is_grounded counts token matches with sum(1 for t in tokens if t in added_text), so repeated evidence tokens are double-counted. A finding with evidence like "query query" can satisfy the >= 2 match threshold even if only one distinct token overlaps with the diff, causing thin evidence to be treated as grounded and posted inline. Count unique normalized tokens (or intersect token sets) to enforce the intended “at least two tokens” requirement.

Useful? React with 👍 / 👎.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 15, 2026

Codecov Report

❌ Patch coverage is 96.96970% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/grippy/output_policy.py 97.15% 5 Missing ⚠️
src/grippy/github_review.py 95.58% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Collaborator

@project-navi-bot project-navi-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All required CI checks passed. Auto-approved by navi-bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants