Skip to content

fix(detect): Fix cross-hatch false positives and filter control chars#216

Open
mlissner wants to merge 3 commits into
perf-optimize-image-detection-20260412from
fix-crosshatch-and-control-chars-20260412
Open

fix(detect): Fix cross-hatch false positives and filter control chars#216
mlissner wants to merge 3 commits into
perf-optimize-image-detection-20260412from
fix-crosshatch-and-control-chars-20260412

Conversation

@mlissner

@mlissner mlissner commented Apr 12, 2026

Copy link
Copy Markdown
Member

Summary

Fixes for false positives from garbled font encoding in this PDF, plus a new developer tool:

  1. Cross-hatch minimum size — add 4pt width/height check to _is_x_hatch_drawing. Thin vertical margin lines (0.75pt wide) were passing the line-pair check and matching text on the page as "cross-hatched redactions."

  2. Control character filter — filter text containing non-whitespace control characters (\x11, \x13, etc.) in filter_redactions_by_text. Real text never contains these — they indicate garbled font encoding that happened to avoid the U+FFFD replacement character.

  3. tools/trim-pdf.py — new tool to extract specific pages from large PDFs with compression, for creating test assets that stay under the 5MB pre-commit limit. Supports comma-separated pages and ranges (e.g., --pages 0,5,10-15).

Test plan

  • test_thin_margin_lines_not_cross_hatches verifies both fixes with real-world PDF
  • Cross-hatched redactions still detected on bad_cross_hatched_redactions.pdf (16 redactions)
  • All 40 tests pass
  • mypy and pre-commit pass
  • Merge after perf: 12x speedup on image-heavy PDFs #215 lands

🤖 Generated with Claude Code

mlissner and others added 2 commits April 12, 2026 10:27
Two fixes for garbled font encoding false positives:

1. Add minimum 4pt width/height to _is_x_hatch_drawing. Thin
   vertical margin lines (0.75pt wide) were passing the line-pair
   check and being treated as cross-hatch redactions.

2. Filter text containing non-whitespace control characters
   (< 0x20, excluding tab/newline/return). These indicate garbled
   font encoding that happened to avoid U+FFFD.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies that thin vertical margin lines (~0.75pt) aren't mistaken
for cross-hatched redactions, and that garbled text with control
characters is filtered.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mlissner mlissner marked this pull request as ready for review April 12, 2026 17:30
Extracts specific pages with compression. Supports comma-separated
page numbers and ranges (e.g., --pages 0,5,10-15). Useful when
source PDFs exceed the 5MB pre-commit file size limit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant