fix(detect): Fix cross-hatch false positives and filter control chars by mlissner · Pull Request #216 · freelawproject/x-ray

mlissner · 2026-04-12T17:28:13Z

Summary

Fixes for false positives from garbled font encoding in this PDF, plus a new developer tool:

Cross-hatch minimum size — add 4pt width/height check to _is_x_hatch_drawing. Thin vertical margin lines (0.75pt wide) were passing the line-pair check and matching text on the page as "cross-hatched redactions."
Control character filter — filter text containing non-whitespace control characters (\x11, \x13, etc.) in filter_redactions_by_text. Real text never contains these — they indicate garbled font encoding that happened to avoid the U+FFFD replacement character.
tools/trim-pdf.py — new tool to extract specific pages from large PDFs with compression, for creating test assets that stay under the 5MB pre-commit limit. Supports comma-separated pages and ranges (e.g., --pages 0,5,10-15).

Test plan

test_thin_margin_lines_not_cross_hatches verifies both fixes with real-world PDF
Cross-hatched redactions still detected on bad_cross_hatched_redactions.pdf (16 redactions)
All 40 tests pass
mypy and pre-commit pass
Merge after perf: 12x speedup on image-heavy PDFs #215 lands

🤖 Generated with Claude Code

Two fixes for garbled font encoding false positives: 1. Add minimum 4pt width/height to _is_x_hatch_drawing. Thin vertical margin lines (0.75pt wide) were passing the line-pair check and being treated as cross-hatch redactions. 2. Filter text containing non-whitespace control characters (< 0x20, excluding tab/newline/return). These indicate garbled font encoding that happened to avoid U+FFFD. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Verifies that thin vertical margin lines (~0.75pt) aren't mistaken for cross-hatched redactions, and that garbled text with control characters is filtered. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extracts specific pages with compression. Supports comma-separated page numbers and ranges (e.g., --pages 0,5,10-15). Useful when source PDFs exceed the 5MB pre-commit file size limit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mlissner and others added 2 commits April 12, 2026 10:27

mlissner marked this pull request as ready for review April 12, 2026 17:30

mlissner mentioned this pull request Apr 18, 2026

feat(detect): Detect SSNs hidden under white rectangles #217

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(detect): Fix cross-hatch false positives and filter control chars#216

fix(detect): Fix cross-hatch false positives and filter control chars#216
mlissner wants to merge 3 commits into
perf-optimize-image-detection-20260412from
fix-crosshatch-and-control-chars-20260412

mlissner commented Apr 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mlissner commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mlissner commented Apr 12, 2026 •

edited

Loading