-
-
Notifications
You must be signed in to change notification settings - Fork 1
Usage
- Visit https://kiku-jw.github.io/DocStripper/
- Click "Upload Your Documents" or drag & drop files
- Choose cleaning mode:
- Fast Clean: Instant rule-based cleaning
- Smart Clean: AI-powered cleaning (requires WebGPU)
- Choose cleaning mode type:
- Conservative (default): Safe defaults, preserves lists and tables
- Aggressive: More aggressive cleaning with merge and whitespace normalization
- Configure cleaning options (optional - hidden in "Advanced Options" menu)
- Click "Start Cleaning"
- Review side-by-side preview (Original | Cleaned)
- Download or copy the cleaned results
Note: All individual cleaning options are hidden in the "Advanced Options" collapsible menu by default. Only the Fast/Smart Clean mode selector and Conservative/Aggressive mode selector are visible for a cleaner interface.
- Speed: Instant
- Method: Rule-based pattern matching
- Best for: Standard documents with predictable patterns
- Speed: Slower (depends on document size)
- Method: AI-powered with on-device LLM
- Requirements: WebGPU support, ~100-200 MB one-time download
- Best for: Complex documents with unusual patterns
-
Mode-Aware: Conservative/Aggressive modes affect LLM prompts
- Conservative: Cautious prompts that preserve structure
- Aggressive: Thorough prompts that allow merging and normalization
-
Post-Processing: After LLM processing, applies:
- Dehyphenation (if enabled)
- Merge broken lines (if enabled in Aggressive mode)
- Whitespace normalization (if enabled in Aggressive mode)
Safe defaults for most users:
- ✅ Removes noise (headers, footers, page numbers, duplicates)
- ✅ Dehyphenates broken words (safe: only lowercase continuations)
- ✅ Removes repeating headers/footers across pages (≥70% threshold)
- ✅ Preserves lists and tables
- ✅ Never merges lines or normalizes whitespace
Best for: Most documents, especially those with lists or tables
More aggressive cleaning:
- ✅ All Conservative features enabled
- ✅ Merges broken lines (with list/table protection)
- ✅ Normalizes whitespace (with table protection)
⚠️ Use with caution: may affect formatting in some documents
Best for: Simple text documents without complex formatting
- Remove Empty Lines: Removes blank and whitespace-only lines
- Remove Page Numbers: Removes lines with only digits (1, 2, 3...), Roman numerals (I, II, III), or letters (A, B, C)
- Remove Headers/Footers: Removes common patterns (Page X of Y, Confidential, etc.)
- Remove Repeating Headers/Footers: Removes headers/footers that appear on ≥70% of pages (detected automatically)
- Remove Duplicates: Collapses consecutive identical lines
- Remove Punctuation Lines: Removes lines with only symbols (---, ***, ===) or single bullets (•, *, ·)
- Preserve Paragraph Spacing: Keeps one empty line between paragraphs
All cleaning options are available in the "Advanced Options" menu:
- Remove Empty Lines: Removes blank and whitespace-only lines
- Remove Page Numbers: Removes lines with only digits, Roman numerals, or letters
- Remove Headers/Footers: Removes common patterns (includes repeating headers/footers)
- Remove Duplicates: Collapses consecutive identical lines
- Remove Punctuation Lines: Removes lines with only symbols or single bullets
- Preserve Paragraph Spacing: Keeps one empty line between paragraphs
- Dehyphenate: Safe dehyphenation - joins "auto-\nmatic" → "automatic" (only lowercase continuations)
- Merge Broken Lines: Merges lines broken mid-sentence (protects lists and tables) - Enabled in Aggressive mode
- Normalize Whitespace: Collapses multiple spaces, normalizes tabs (protects tables) - Enabled in Aggressive mode
- Keep Table Spacing: Preserves spacing in detected table blocks when normalizing whitespace
- Side-by-Side Preview: Compare Original | Cleaned text side-by-side
- Virtualization: Large files (>1MB) use virtualization for smooth scrolling
- Statistics: Detailed stats showing what was removed (lines, duplicates, headers, etc.)
-
Download: Download cleaned file with
_cleaned.txtsuffix - Copy: Copy cleaned text to clipboard with one click
Your cleaning preferences are automatically saved in your browser's localStorage:
- Cleaning mode (Fast/Smart)
- Cleaning mode type (Conservative/Aggressive)
- All checkbox settings
Settings persist across page reloads and browser sessions.
# Clean a single file
python tool.py document.txt
# Clean multiple files
python tool.py file1.txt file2.txt file3.docx
# Preview changes without modifying files
python tool.py --dry-run document.txt
# Undo last operation
python tool.py --undo-
.txt- Plain text files -
.docx- Microsoft Word documents -
.pdf- PDF files- Web: Automatic (uses PDF.js library - no installation needed)
-
CLI: Requires
pdftotextfrom poppler-utils
python tool.py [OPTIONS] [FILES...]
Options:
-h, --help Show help message
--dry-run Preview changes without modifying files
--undo Restore files from last operationNote: CLI version uses Conservative mode by default. Advanced features (merge lines, whitespace normalization) are available programmatically but disabled by default for safety.
python tool.py report.txtpython tool.py document1.txt document2.docx document3.pdfpython tool.py --dry-run important_document.txtpython tool.py --undo- Original files are backed up with
.bakextension - Processed files replace originals
- Statistics are shown in console:
- Lines removed
- Duplicates collapsed
- Empty lines removed
- Headers/footers removed
- Punctuation lines removed
- Dehyphenated tokens
- Repeating headers/footers removed
- Merged lines (if enabled)
- Operation log saved to
.strip-log
-
Always test on copies first - Use
--dry-runor test on copies - Backup important files - The tool creates backups, but extra backups never hurt
- Review statistics - Check what was removed before finalizing
- Use appropriate mode - Fast Clean for simple documents, Smart Clean for complex ones
- Start with Conservative mode - It's safer and preserves formatting
- Use side-by-side preview - Always review changes before downloading
- Check lists and tables - Ensure they're preserved correctly, especially in Aggressive mode
See Troubleshooting Guide for common issues and solutions.
See SELF_TESTS.md for manual test steps and expected results.
Test fixtures are available in the examples/ directory:
-
fixture1_headers_footers.txt- Headers/footers + page numbers -
fixture2_hyphenation.txt- Hyphenation + mid-sentence wraps -
fixture3_lists_tables.txt- Lists & pseudo-tables