Skip to content

feat(detect): Detect TOC/bookmark entries that leak redacted content#212

Open
mlissner wants to merge 3 commits into
detect-image-redactions-20260407from
2-detect-toc-bookmark-leaks-20260407
Open

feat(detect): Detect TOC/bookmark entries that leak redacted content#212
mlissner wants to merge 3 commits into
detect-image-redactions-20260407from
2-detect-toc-bookmark-leaks-20260407

Conversation

@mlissner
Copy link
Copy Markdown
Member

@mlissner mlissner commented Apr 7, 2026

Summary

  • Detect PDF bookmarks that reveal redacted heading content
  • Adds get_redaction_bboxes() to collect all redaction-shaped objects (rectangles, images, annotations, cross-hatches, X-replacement text) in a single pass — shared between bad-redaction detection and TOC leak detection for performance
  • get_toc_leaks() spatially matches bookmark targets against redaction bboxes (20pt y-tolerance), then checks if the bookmark title contains words absent from the page text
  • Synthetic 7-page test PDF covers all redaction types: applied redaction, black bar, dark image, X-replacement, unapplied Redact, dark highlight, and a no-redaction control
  • Real-world test with the JOSH MERRITT declaration

Fixes #2

Test plan

🤖 Generated with Claude Code

mlissner and others added 3 commits April 12, 2026 09:24
When someone redacts a heading but forgets to sanitize the PDF
bookmarks, the bookmark still contains the original text.

Adds get_redaction_bboxes() to collect all redaction-shaped objects
(rectangles, images, annotations, cross-hatches, X-replacement text)
in a single pass. get_toc_leaks() spatially matches bookmark targets
against these bboxes, then checks if the bookmark title contains
words absent from the page text.

Also adds X-replacement text locations to the bbox collection so
TOC leaks are detected even when the only redaction evidence is
replaced characters.

Fixes #2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Synthetic 7-page PDF covering all redaction types (applied
redaction, black bar, dark image, X-replacement, unapplied Redact,
dark highlight, and a no-redaction control). Each page is
self-documenting.

Also adds real-world court filing where JOSH MERRITT's name leaks
through the bookmark despite being X'd out and covered by images.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mlissner mlissner force-pushed the 2-detect-toc-bookmark-leaks-20260407 branch from 4bb99e9 to 72eedab Compare April 12, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant