Skip to content

fix(detect): Filter bright rectangles from TOC leak bbox matching#218

Draft
mlissner wants to merge 2 commits into
detect-ssn-under-white-rect-20260418from
fix-toc-bbox-highlight-false-positive-20260418
Draft

fix(detect): Filter bright rectangles from TOC leak bbox matching#218
mlissner wants to merge 2 commits into
detect-ssn-under-white-rect-20260418from
fix-toc-bbox-highlight-false-positive-20260418

Conversation

@mlissner
Copy link
Copy Markdown
Member

Summary

Fix false TOC leak detections on PDFs with colored highlight annotations (pink, yellow, gray highlights over text in academic papers, etc.).

Root cause: get_good_rectangles() returns all opaque filled rectangles, including colored highlights. Their bboxes were being added to redaction_bboxes before any color filtering, polluting the TOC leak matcher.

Fix: Filter rectangle bboxes by _is_dark_color on fill color before adding to redaction_bboxes. This excludes bright highlights while keeping dark rectangles (including applied redactions with no text underneath).

Note: The initial suggestion was to move bbox collection after the pixmap filter, but that would have broken TOC leak detection for properly applied redactions (text removed, black bar left) since those have no text and don't survive the pixmap pipeline.

Test plan

  • New test test_highlight_annotations_toc_no_results with real academic paper PDF
  • Existing test_toc_leak (7-page synthetic) still passes — applied redaction on page 1 still detected
  • Existing test_toc_leak_real (JOSH MERRITT) still passes
  • All 42 tests pass
  • mypy and pre-commit pass
  • Merge after feat(detect): Detect SSNs hidden under white rectangles #217 lands

🤖 Generated with Claude Code

mlissner and others added 2 commits April 18, 2026 09:37
Colored highlight annotations (pink, yellow, gray) pass
get_good_rectangles but aren't redactions. Their bboxes were
polluting the TOC leak matcher, causing false positives on PDFs
with highlighted text and TOC entries.

Now only dark rectangles (checked via _is_dark_color on fill color)
contribute bboxes for TOC matching. This preserves detection of
applied redactions (text removed, black bar left) which have dark
fill colors even though no text survives the pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ives

Academic paper with colored highlight annotations and TOC entries.
Verifies that highlighted text doesn't trigger false TOC leak
detections.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant