Skip to content

feat(detect): Detect dark Highlight annotations as bad redactions#208

Open
mlissner wants to merge 2 commits into
filter-cmecf-header-stamps-20260407from
185-detect-dark-highlight-annotations-20260407
Open

feat(detect): Detect dark Highlight annotations as bad redactions#208
mlissner wants to merge 2 commits into
filter-cmecf-header-stamps-20260407from
185-detect-dark-highlight-annotations-20260407

Conversation

@mlissner
Copy link
Copy Markdown
Member

@mlissner mlissner commented Apr 7, 2026

Summary

  • Detect dark Highlight annotations (type 8) used as makeshift redactions
  • Some documents use black highlight annotations to obscure text instead of proper redaction tools — the text remains fully readable underneath
  • Uses _is_dark_color luminance check on the annotation's stroke color to only flag dark highlights, leaving legitimate bright-colored highlights (yellow, green, pink) alone
  • Tested against three real-world PDFs from Lithuanian public procurement site

Fixes #185

Test plan

  • New test test_dark_highlight_annotations verifies detection
  • Verified against three PDFs: names, emails, phone numbers all detected
  • All 33 tests pass
  • mypy and pre-commit pass
  • Merge after feat(detect): Filter CM/ECF header stamps #207 lands

🤖 Generated with Claude Code

mlissner and others added 2 commits April 7, 2026 03:05
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Some documents use black or dark Highlight annotations to obscure
text instead of proper redaction tools. The text remains fully
readable and extractable underneath.

Uses _is_dark_color to only flag dark highlights — bright-colored
highlights (yellow, green, pink) are legitimate markup.

Fixes #185

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant