DocStripper automatically removes noise from text documents. Remove page numbers, headers/footers, duplicate lines, and empty lines from .txt, .docx, and .pdf files. Choose between Fast Clean (instant) or Smart Clean (AI-powered). Works entirely in your browser - 100% private, no uploads, no sign-ups.
🌐 Try it online → — No installation needed!
📦 Latest Release: v2.1.0 — UX enhancements & distribution ready
- ⚡ Fast Clean — Instant rule-based cleaning
- 🤖 Smart Clean (Beta) — AI-powered cleaning with on-device LLM
- 🎚️ 4 Cleaning Temperaments — Gentle (safe), Moderate, Thorough, Aggressive
- ⚙️ WebWorker Processing — Large files processed in background (no UI freezing)
- 🔄 Side-by-Side Preview — Compare Original | Cleaned
- 💾 Settings Persistence — Your preferences are saved automatically
- 🔒 100% Private — All processing happens in your browser, works completely offline
- 📡 Works Offline Badge — Visual indicator that everything stays on your device
- 📊 Real-time Statistics — See exactly what was removed
- 📥 Batch Download (ZIP) — Download multiple cleaned files at once
- 🎨 Dark Theme — Toggle between light and dark themes
- 📱 Mobile Responsive — Works great on mobile devices
- Visit https://kikuai-lab.github.io/DocStripper/
- Upload your files
- Choose Fast Clean (instant) or Smart Clean (AI-powered)
- Adjust Cleaning Temperament slider: Gentle (recommended), Moderate, Thorough, or Aggressive
- Click "Start Cleaning"
- Download or copy the cleaned results
Option 1: PyPI (Recommended)
pip install docstripper
docstripper document.txtOption 2: Homebrew (macOS)
brew tap KikuAI-Lab/docstripper
brew install docstripper
docstripper document.txtOption 3: Manual Installation
git clone https://github.com/KikuAI-Lab/DocStripper.git
cd DocStripper
python tool.py document.txtSee INSTALL.md for detailed installation instructions.
# Clean a file
python tool.py document.txt
# Clean multiple files
python tool.py file1.txt file2.txt file3.docx
# Preview changes (dry-run)
python tool.py --dry-run document.txt
# Undo last operation
python tool.py --undo
# Pipe stdin to stdout (no file writes)
cat input.pdf | python tool.py - --stdout > output.txt
# Keep headers/footers if needed
python tool.py --keep-headers input.pdf --stdoutBefore:
Page 1 of 10
Confidential - Internal Use Only
Executive Summary
This is auto-
matic text processing.
Important content here.
Important content here.
1
2
3
After (Gentle Mode):
Executive Summary
This is automatic text processing.
Important content here.
More content.
Key Changes:
- ✅ Page numbers removed
- ✅ Headers/footers removed
- ✅ Repeating headers removed
- ✅ Duplicates collapsed
- ✅ Hyphenation fixed
- ✅ Empty lines removed
Gentle (Recommended - Default)
- ✅ Page numbers (1, 2, 3...)
- ✅ Headers/footers ("Page X of Y", "Confidential", etc.)
- ✅ Repeating headers/footers across pages
- ✅ Duplicate lines
- ✅ Empty lines
- ✅ Punctuation-only lines (---, ***, ===)
- ✅ Hyphenation fixed (auto-\nmatic → automatic)
- ✅ Preserves paragraph spacing
- ❌ Line merging disabled (preserves formatting)
- ❌ Whitespace normalization disabled
- ❌ Unicode normalization disabled
Moderate
- All Gentle features plus:
- ✅ Merges broken lines (protects lists and tables)
- ✅ Preserves paragraph spacing
Thorough
- All Moderate features plus:
- ✅ Normalizes whitespace (protects tables)
- ✅ Normalizes Unicode punctuation (smart quotes, dashes → ASCII)
- ✅ Preserves paragraph spacing (better readability)
Aggressive
- All Thorough features plus:
- ✅ Normalizes Unicode punctuation
- ❌ Removes paragraph spacing (more compact output)
--no-merge-lines— disable merging broken lines--no-dehyphenate— disable de-hyphenation across line breaks--no-normalize-ws— disable whitespace normalization--no-normalize-unicode— disable Unicode punctuation normalization--keep-headers— keep headers/footers/page numbers--stdout— write cleaned text to stdout instead of modifying files (supports-for stdin)
Protection Features:
- ✅ Lists are never merged or broken
- ✅ Tables preserve spacing
- ✅ Content headers never removed
| Format | Status | Notes |
|---|---|---|
.txt |
✅ Full | UTF-8, Latin-1 |
.docx |
✅ Basic | Text extraction only (Web + CLI) |
.pdf |
✅ Basic | Text extraction only (Web + CLI). Web uses PDF.js automatically. CLI requires pdftotext (poppler-utils) |
PDF Support:
- macOS:
brew install poppler - Ubuntu/Debian:
sudo apt-get install poppler-utils - Windows: Download from poppler-windows releases
- Modern web browser (Chrome, Firefox, Safari, Edge)
- No installation or dependencies required
- Works completely offline after first load
- Python 3.9+ (for CLI tool)
- PDF support (optional):
pdftotextfrom poppler-utils
See GitHub Releases for release notes and changelog.
MIT License — see LICENSE.txt for details.
Contributions are welcome! See Contributing Guide for guidelines.
Made with ❤️ for clean documents
⭐ Star this repo | 🌐 Try online | 🚀 Product Hunt | 🐛 Report Bug
Support this project and help keep it free:
☕ Support on Gumroad | ☕ Buy Me a Coffee | 🙏 Thanks.dev | 💚 Ko-fi
- 📰 Blog & Updates: t.me/kiku_AI
- 💬 Discord: discord.gg/4Kxs97JvsU
- 💼 LinkedIn: linkedin.com/in/kiku-jw
- 🌐 About.me: about.me/kiku_jw