Skip to content

feat(detect): Detect SSNs hidden under white rectangles#217

Open
mlissner wants to merge 5 commits into
fix-crosshatch-and-control-chars-20260412from
detect-ssn-under-white-rect-20260418
Open

feat(detect): Detect SSNs hidden under white rectangles#217
mlissner wants to merge 5 commits into
fix-crosshatch-and-control-chars-20260412from
detect-ssn-under-white-rect-20260418

Conversation

@mlissner
Copy link
Copy Markdown
Member

Summary

Detect Social Security Numbers hidden under white rectangles, as seen in this bankruptcy form where form cell backgrounds cover the SSN text.

Two changes:

  1. PII bypass of pixmap filter — redactions containing PII patterns (SSNs) skip the pixmap check entirely. White rectangles hiding SSNs render as non-unicolor due to form grid lines, causing the pixmap filter to incorrectly drop them. Since the text passed intersection checks (a rectangle was drawn on top of it), and it matches a PII pattern, it should always be flagged.

  2. Occlusion threshold >>= — the white covering rectangles in this PDF are 8pt tall while character bboxes are 10pt, giving exactly 80.0% occlusion. The strict > check rejected them.

Test plan

🤖 Generated with Claude Code

mlissner and others added 2 commits April 18, 2026 08:56
PII-containing redactions (SSNs for now) bypass the pixmap filter.
White rectangles hiding SSNs render as non-unicolor due to form
grid lines, but the text is fully hidden and extractable.

Also changes the occlusion threshold from > to >= so characters
that are exactly 80% occluded are counted (the covering rectangles
in this PDF are slightly shorter than the character bboxes).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Real-world bankruptcy form where an SSN is covered by white form
cell backgrounds. The text is invisible but extractable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread xray/pdf_utils.py Outdated
Keeps the PII/pixmap split logic in its own function per review
feedback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread xray/pdf_utils.py Outdated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mlissner mlissner marked this pull request as ready for review April 18, 2026 16:14
SSN detection bypasses the pixmap filter and is distinct from
normal TEXT_UNDER_RECTANGLE. Give it its own type so consumers
know a PII pattern was matched.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant