fix(detect): Filter map hatching and Word internal bookmarks#220
Open
mlissner wants to merge 3 commits into
Open
fix(detect): Filter map hatching and Word internal bookmarks#220mlissner wants to merge 3 commits into
mlissner wants to merge 3 commits into
Conversation
Three fixes: 1. Cross-hatch color check — only dark-colored line patterns qualify as redaction cross-hatching. Map/chart hatching uses colored lines. 2. Cross-hatch line density cap — drawings with >100 lines are dense fill patterns, not redaction X-hatching (which has ~1 X per 17pt). 3. Skip TOC entries starting with underscore — these are Word internal bookmarks (_Hlk, _Ref, _Toc, _GoBack, etc.), not real headings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Map hatching PDF extracted from a real court filing with geographic charts. Word bookmark PDF is synthetic with _Hlk and _References entries pointing at a dark rectangle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Character-sized decorative patterns (5x7pt) were passing the cross-hatch check. Real cross-hatched redactions are 17pt+ tall. Raising the minimum to 6pt filters these while staying conservative. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes cross-hatch and TOC bookmark false positives from this 178-page filing (22 → 0 false detections) and this 555-page filing (1 → 0):
Cross-hatch color check — only dark-colored line patterns qualify as redaction cross-hatching. Map/chart hatching uses colored lines (salmon, blue, gray).
Cross-hatch line density cap — drawings with >100 lines are dense fill patterns (maps pack 776 lines into a 34x13pt area), not redaction X-hatching.
Cross-hatch minimum size — raised from 4pt to 6pt. Character-sized decorative patterns (5x7pt) were passing; real cross-hatching is 17pt+ tall.
Underscore bookmark filter — TOC entries starting with
_are Word internal bookmarks (_Hlk110177622,_References,_GoBack, etc.), not real headings.Test plan
test_map_hatching_not_cross_hatcheswith real-world map pagetest_small_crosshatch_decorations_no_resultswith real-world decorative patternstest_word_internal_bookmarks_no_resultswith synthetic PDF🤖 Generated with Claude Code