-
-
Notifications
You must be signed in to change notification settings - Fork 1
FAQ
DocStripper is an AI-powered batch document cleaner that automatically removes noise from text documents, including page numbers, headers, footers, duplicate lines, and empty lines.
Yes! DocStripper is completely free and open-source (MIT License).
Absolutely! All processing happens entirely in your browser (web app) or on your computer (CLI). Your files never leave your device - no uploads, no server-side processing.
-
.txt- Plain text files -
.docx- Microsoft Word documents -
.pdf- PDF files- Web: Automatic support (uses PDF.js library)
-
CLI: Requires
pdftotextfrom poppler-utils
No! The web application works entirely in your browser. Just visit the website and start using it.
- Chrome/Edge 113+ (recommended)
- Firefox 110+
- Safari 18+
- Fast Clean: Instant rule-based cleaning using pattern matching
-
Smart Clean: AI-powered cleaning using on-device LLM (requires WebGPU, slower but more intelligent)
- Mode-aware: Conservative/Aggressive modes influence the LLM prompts
- Post-processing: After LLM processing, applies dehyphenation, merge lines, and whitespace normalization (if enabled)
- Provides more intelligent cleaning for complex documents with unusual patterns
- Conservative Mode (default): Safe defaults that preserve lists and tables. Removes noise but never merges lines or normalizes whitespace. Recommended for most users.
- Aggressive Mode: More aggressive cleaning that merges broken lines and normalizes whitespace, but still protects lists and tables. Use with caution on complex documents.
Dehyphenation safely joins words broken across lines. For example, "auto-\nmatic" becomes "automatic". It only works when the continuation starts with lowercase letters to avoid false positives.
Headers or footers that appear on ≥70% of pages are automatically detected and removed. This helps clean documents where headers/footers aren't in the known pattern list but repeat across pages.
In Aggressive mode, lines broken mid-sentence are merged back together. For example:
This is a sentence
broken across lines.
becomes:
This is a sentence broken across lines.
Lists and tables are protected from merging to preserve formatting.
In Aggressive mode, whitespace normalization:
- Collapses multiple spaces into single spaces
- Normalizes tabs to spaces
- Trims trailing spaces
Table blocks are detected and protected to preserve table formatting.
Yes! The web app shows a side-by-side preview: Original | Cleaned. You can scroll through both versions to see exactly what was removed or changed.
Smart Clean downloads a small AI model (~100-200 MB) to run locally in your browser. This download happens only once and is cached for future use.
Yes! After the initial download, Smart Clean works completely offline.
The tool will automatically fall back to Fast Clean mode.
No! The CLI tool uses only Python standard library - no external dependencies required (except optional poppler-utils for PDF support).
-
macOS:
brew install poppler -
Ubuntu/Debian:
sudo apt-get install poppler-utils - Windows: Download from poppler-windows releases
Yes! Use python tool.py --undo to restore files from the last operation.
Yes! Original files are automatically backed up with .bak extension before processing.
Fast Clean uses pattern matching to identify and remove:
- Page numbers (lines with only digits)
- Headers/footers (common patterns)
- Duplicate consecutive lines
- Empty lines
- Punctuation-only lines
Smart Clean uses an on-device Large Language Model (LLM) to intelligently analyze each line and decide whether to keep, remove, or modify it based on context.
Two-stage process:
-
LLM Processing: The AI model analyzes the text based on:
- Your selected cleaning options (removeEmptyLines, removePageNumbers, etc.)
- The Conservative/Aggressive mode (affects prompt instructions)
- Conservative mode: Cautious prompts that preserve structure
- Aggressive mode: Thorough prompts that allow merging and normalization
-
Post-Processing: After LLM processing, applies:
- Dehyphenation (if enabled)
- Merge broken lines (if enabled in Aggressive mode)
- Whitespace normalization (if enabled in Aggressive mode)
This ensures consistent behavior between Fast Clean and Smart Clean modes.
Yes! Both modes support customizable cleaning options. In Smart Clean mode, unchecked options are preserved in the output.
Note: Individual cleaning options are hidden in the "Advanced Options" collapsible menu by default. You can expand it to configure specific options, but the default settings work well for most documents.
There's no hard limit, but very large files (>10MB) may take longer to process, especially in Smart Clean mode.
- Check your browser version (should be modern)
- Ensure JavaScript is enabled
- Try disabling browser extensions
- Clear browser cache
- This is normal for large documents
- Consider using Fast Clean for very large files
- Ensure WebGPU is enabled in your browser
- Check the file path is correct
- Ensure you're in the right directory
- Use absolute paths if relative paths don't work
- Ensure poppler-utils is installed
- Check that
pdftotextis in your PATH - Verify the PDF file is not corrupted
See our Contributing Guide for details. We welcome:
- Bug reports
- Feature requests
- Code contributions
- Documentation improvements
Use the GitHub Issues page and select the "Bug Report" template.
Absolutely! Use the Feature Request template.
- Check GitHub Discussions
- Open an Issue
- Visit the Product Hunt page