🧹 DocStripper

AI-powered batch document cleaner — Remove noise from text documents automatically

DocStripper automatically removes noise from text documents. Remove page numbers, headers/footers, duplicate lines, and empty lines from .txt, .docx, and .pdf files. Choose between Fast Clean (instant) or Smart Clean (AI-powered). Works entirely in your browser - 100% private, no uploads, no sign-ups.

🌐 Try it online → — No installation needed!

📦 Latest Release: v2.1.0 — UX enhancements & distribution ready

✨ Features

⚡ Fast Clean — Instant rule-based cleaning
🤖 Smart Clean (Beta) — AI-powered cleaning with on-device LLM
🎚️ 4 Cleaning Temperaments — Gentle (safe), Moderate, Thorough, Aggressive
⚙️ WebWorker Processing — Large files processed in background (no UI freezing)
🔄 Side-by-Side Preview — Compare Original | Cleaned
💾 Settings Persistence — Your preferences are saved automatically
🔒 100% Private — All processing happens in your browser, works completely offline
📡 Works Offline Badge — Visual indicator that everything stays on your device
📊 Real-time Statistics — See exactly what was removed
📥 Batch Download (ZIP) — Download multiple cleaned files at once
🎨 Dark Theme — Toggle between light and dark themes
📱 Mobile Responsive — Works great on mobile devices

🎯 Quick Start

Web App (Recommended)

Visit https://kikuai-lab.github.io/DocStripper/
Upload your files
Choose Fast Clean (instant) or Smart Clean (AI-powered)
Adjust Cleaning Temperament slider: Gentle (recommended), Moderate, Thorough, or Aggressive
Click "Start Cleaning"
Download or copy the cleaned results

CLI Tool

Installation Options

Option 1: PyPI (Recommended)

pip install docstripper
docstripper document.txt

Option 2: Homebrew (macOS)

brew tap KikuAI-Lab/docstripper
brew install docstripper
docstripper document.txt

Option 3: Manual Installation

git clone https://github.com/KikuAI-Lab/DocStripper.git
cd DocStripper
python tool.py document.txt

See INSTALL.md for detailed installation instructions.

Usage

# Clean a file
python tool.py document.txt

# Clean multiple files
python tool.py file1.txt file2.txt file3.docx

# Preview changes (dry-run)
python tool.py --dry-run document.txt

# Undo last operation
python tool.py --undo
 
# Pipe stdin to stdout (no file writes)
cat input.pdf | python tool.py - --stdout > output.txt

# Keep headers/footers if needed
python tool.py --keep-headers input.pdf --stdout

📖 Example

Before:

Page 1 of 10
Confidential - Internal Use Only
Executive Summary
This is auto-
matic text processing.
Important content here.
Important content here.

1
2
3

After (Gentle Mode):

Executive Summary
This is automatic text processing.
Important content here.
More content.

Key Changes:

✅ Page numbers removed
✅ Headers/footers removed
✅ Repeating headers removed
✅ Duplicates collapsed
✅ Hyphenation fixed
✅ Empty lines removed

🎨 What Gets Removed?

Cleaning Temperaments

Gentle (Recommended - Default)

✅ Page numbers (1, 2, 3...)
✅ Headers/footers ("Page X of Y", "Confidential", etc.)
✅ Repeating headers/footers across pages
✅ Duplicate lines
✅ Empty lines
✅ Punctuation-only lines (---, ***, ===)
✅ Hyphenation fixed (auto-\nmatic → automatic)
✅ Preserves paragraph spacing
❌ Line merging disabled (preserves formatting)
❌ Whitespace normalization disabled
❌ Unicode normalization disabled

Moderate

All Gentle features plus:
✅ Merges broken lines (protects lists and tables)
✅ Preserves paragraph spacing

Thorough

All Moderate features plus:
✅ Normalizes whitespace (protects tables)
✅ Normalizes Unicode punctuation (smart quotes, dashes → ASCII)
✅ Preserves paragraph spacing (better readability)

Aggressive

All Thorough features plus:
✅ Normalizes Unicode punctuation
❌ Removes paragraph spacing (more compact output)

CLI Flags (defaults ON)

--no-merge-lines — disable merging broken lines
--no-dehyphenate — disable de-hyphenation across line breaks
--no-normalize-ws — disable whitespace normalization
--no-normalize-unicode — disable Unicode punctuation normalization
--keep-headers — keep headers/footers/page numbers
--stdout — write cleaned text to stdout instead of modifying files (supports - for stdin)

Protection Features:

✅ Lists are never merged or broken
✅ Tables preserve spacing
✅ Content headers never removed

🛠️ Supported Formats

Format	Status	Notes
`.txt`	✅ Full	UTF-8, Latin-1
`.docx`	✅ Basic	Text extraction only (Web + CLI)
`.pdf`	✅ Basic	Text extraction only (Web + CLI). Web uses PDF.js automatically. CLI requires `pdftotext` (poppler-utils)

PDF Support:

macOS: brew install poppler
Ubuntu/Debian: sudo apt-get install poppler-utils
Windows: Download from poppler-windows releases

🔧 Requirements

Web App

Modern web browser (Chrome, Firefox, Safari, Edge)
No installation or dependencies required
Works completely offline after first load

CLI Tool

Python 3.9+ (for CLI tool)
PDF support (optional): pdftotext from poppler-utils

📝 Changelog

See GitHub Releases for release notes and changelog.

📝 License

MIT License — see LICENSE.txt for details.

🤝 Contributing

Contributions are welcome! See Contributing Guide for guidelines.

Made with ❤️ for clean documents

⭐ Star this repo | 🌐 Try online | 🚀 Product Hunt | 🐛 Report Bug

💝 Support

Support this project and help keep it free:

☕ Support on Gumroad | ☕ Buy Me a Coffee | 🙏 Thanks.dev | 💚 Ko-fi

🔗 Connect

📰 Blog & Updates: t.me/kiku_AI
💬 Discord: discord.gg/4Kxs97JvsU
💼 LinkedIn: linkedin.com/in/kiku-jw
🌐 About.me: about.me/kiku_jw

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
.github		.github
docs		docs
scripts		scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
INSTALL.md		INSTALL.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
docstripper.rb		docstripper.rb
setup.py		setup.py
tool.py		tool.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

🧹 DocStripper

✨ Features

🎯 Quick Start

Web App (Recommended)

CLI Tool

Installation Options

Usage

📖 Example

🎨 What Gets Removed?

Cleaning Temperaments

CLI Flags (defaults ON)

🛠️ Supported Formats

🔧 Requirements

Web App

CLI Tool

📝 Changelog

📝 License

🤝 Contributing

💝 Support

🔗 Connect

About

Uh oh!

Releases 5

Sponsor this project

Uh oh!

Packages

Languages

Uh oh!

License

KikuAI-Lab/DocStripper

Folders and files

Latest commit

History

Repository files navigation

🧹 DocStripper

✨ Features

🎯 Quick Start

Web App (Recommended)

CLI Tool

Installation Options

Usage

📖 Example

🎨 What Gets Removed?

Cleaning Temperaments

CLI Flags (defaults ON)

🛠️ Supported Formats

🔧 Requirements

Web App

CLI Tool

📝 Changelog

📝 License

🤝 Contributing

💝 Support

🔗 Connect

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Sponsor this project

Uh oh!

Packages 0

Languages

Packages