fix(detect): Filter bright rectangles from TOC leak bbox matching#218
Draft
mlissner wants to merge 2 commits into
Draft
fix(detect): Filter bright rectangles from TOC leak bbox matching#218mlissner wants to merge 2 commits into
mlissner wants to merge 2 commits into
Conversation
Colored highlight annotations (pink, yellow, gray) pass get_good_rectangles but aren't redactions. Their bboxes were polluting the TOC leak matcher, causing false positives on PDFs with highlighted text and TOC entries. Now only dark rectangles (checked via _is_dark_color on fill color) contribute bboxes for TOC matching. This preserves detection of applied redactions (text removed, black bar left) which have dark fill colors even though no text survives the pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ives Academic paper with colored highlight annotations and TOC entries. Verifies that highlighted text doesn't trigger false TOC leak detections. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix false TOC leak detections on PDFs with colored highlight annotations (pink, yellow, gray highlights over text in academic papers, etc.).
Root cause:
get_good_rectangles()returns all opaque filled rectangles, including colored highlights. Their bboxes were being added toredaction_bboxesbefore any color filtering, polluting the TOC leak matcher.Fix: Filter rectangle bboxes by
_is_dark_coloron fill color before adding toredaction_bboxes. This excludes bright highlights while keeping dark rectangles (including applied redactions with no text underneath).Note: The initial suggestion was to move bbox collection after the pixmap filter, but that would have broken TOC leak detection for properly applied redactions (text removed, black bar left) since those have no text and don't survive the pixmap pipeline.
Test plan
test_highlight_annotations_toc_no_resultswith real academic paper PDFtest_toc_leak(7-page synthetic) still passes — applied redaction on page 1 still detectedtest_toc_leak_real(JOSH MERRITT) still passes🤖 Generated with Claude Code