Skip to content

fix(detect): Filter map hatching and Word internal bookmarks#220

Open
mlissner wants to merge 3 commits into
fix-low-occlusion-threshold-20260418from
fix-map-hatching-and-word-bookmarks-20260418
Open

fix(detect): Filter map hatching and Word internal bookmarks#220
mlissner wants to merge 3 commits into
fix-low-occlusion-threshold-20260418from
fix-map-hatching-and-word-bookmarks-20260418

Conversation

@mlissner
Copy link
Copy Markdown
Member

@mlissner mlissner commented Apr 18, 2026

Summary

Fixes cross-hatch and TOC bookmark false positives from this 178-page filing (22 → 0 false detections) and this 555-page filing (1 → 0):

  1. Cross-hatch color check — only dark-colored line patterns qualify as redaction cross-hatching. Map/chart hatching uses colored lines (salmon, blue, gray).

  2. Cross-hatch line density cap — drawings with >100 lines are dense fill patterns (maps pack 776 lines into a 34x13pt area), not redaction X-hatching.

  3. Cross-hatch minimum size — raised from 4pt to 6pt. Character-sized decorative patterns (5x7pt) were passing; real cross-hatching is 17pt+ tall.

  4. Underscore bookmark filter — TOC entries starting with _ are Word internal bookmarks (_Hlk110177622, _References, _GoBack, etc.), not real headings.

Test plan

  • test_map_hatching_not_cross_hatches with real-world map page
  • test_small_crosshatch_decorations_no_results with real-world decorative patterns
  • test_word_internal_bookmarks_no_results with synthetic PDF
  • Real cross-hatch test still detects 16 redactions
  • All 46 tests pass
  • mypy and pre-commit pass
  • Merge after fix(detect): Lower occlusion threshold from 80% to 75% #219 lands

🤖 Generated with Claude Code

mlissner and others added 2 commits April 18, 2026 12:38
Three fixes:

1. Cross-hatch color check — only dark-colored line patterns qualify
   as redaction cross-hatching. Map/chart hatching uses colored lines.

2. Cross-hatch line density cap — drawings with >100 lines are dense
   fill patterns, not redaction X-hatching (which has ~1 X per 17pt).

3. Skip TOC entries starting with underscore — these are Word
   internal bookmarks (_Hlk, _Ref, _Toc, _GoBack, etc.), not real
   headings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Map hatching PDF extracted from a real court filing with geographic
charts. Word bookmark PDF is synthetic with _Hlk and _References
entries pointing at a dark rectangle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mlissner mlissner marked this pull request as ready for review April 18, 2026 19:40
Character-sized decorative patterns (5x7pt) were passing the
cross-hatch check. Real cross-hatched redactions are 17pt+ tall.
Raising the minimum to 6pt filters these while staying conservative.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mlissner mlissner mentioned this pull request Apr 21, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant