🧹 DocStripper

AI-powered batch document cleaner — Remove noise from text documents automatically with Fast or Smart Clean modes

DocStripper — AI-powered batch document cleaner that automatically removes noise from text documents. Remove page numbers, headers/footers, duplicate lines, and empty lines from .txt, .docx, and .pdf files. Choose between Fast Clean (instant rule-based) or Smart Clean (AI-powered with on-device LLM). Works entirely in your browser - 100% private, no uploads, no sign-ups. Perfect for students, researchers, and anyone working with scanned documents or PDFs.

🌐 Try it online → — No installation needed!

Web App Features:

⚡ Fast Clean — Instant rule-based cleaning
🤖 Smart Clean (Beta) — AI-powered cleaning with on-device LLM
- Requires WebGPU support (most modern browsers)
- One-time download of ~100-200 MB (model weights)
- Works offline after first load
- Fully customizable via cleaning options

✨ Features

Web Application

🚀 Fast Clean — Rule-based cleaning (instant)
🤖 Smart Clean (Beta) — AI-powered cleaning using on-device LLM (WebLLM)
⚙️ Customizable Options — Configure what gets removed
🔒 100% Private — All processing happens in your browser
📊 Real-time Statistics — See exactly what was removed
📥 Download & Copy — Download cleaned files or copy to clipboard
🎨 Dark Theme — Toggle between light and dark themes

CLI Tool

🚀 Fast & Lightweight — Uses only Python stdlib, no external packages
🔒 Privacy-First — All processing happens offline
📊 Dry-Run Mode — Preview changes before applying
🔄 Undo Support — Restore files from backups
🌍 Cross-Platform — Works on Windows, macOS, and Linux
📚 Multiple Formats — Supports .txt, .docx, and .pdf files

🎯 Quick Start

Installation

git clone https://github.com/kiku-jw/DocStripper.git
cd DocStripper

Usage

# Clean a single file
python tool.py document.txt

# Clean multiple files
python tool.py file1.txt file2.txt file3.docx

# Preview changes (dry-run)
python tool.py --dry-run document.txt

# Undo last operation
python tool.py --undo

📖 Example

Before:

Page 1 of 10
Confidential

Important content here.
Important content here.

1
2
3

Page 2 of 10

After:

Important content here.

🎨 What Gets Removed?

Page numbers — Lines with only digits (1, 2, 3...), Roman numerals (I, II, III), or letters (A, B, C)
Headers/Footers — Common patterns like "Page X of Y", "Confidential", "DRAFT", "INTERNAL USE ONLY"
Duplicate lines — Consecutive identical lines
Empty lines — Whitespace-only lines (optional: preserve paragraph spacing)
Punctuation lines — Lines with only symbols (---, ***, ===)

🛠️ Supported Formats

Format	Status	Notes
`.txt`	✅ Full	UTF-8, Latin-1
`.docx`	✅ Basic	Text extraction only
`.pdf`	✅ Basic	Requires `pdftotext` (poppler-utils)

PDF Support Installation:

macOS: brew install poppler
Ubuntu/Debian: sudo apt-get install poppler-utils
Windows: Download from poppler-windows releases

📊 Command Line Options

python tool.py [OPTIONS] [FILES...]

Options:
  -h, --help     Show help message
  --dry-run      Preview changes without modifying files
  --undo         Restore files from last operation

🔧 Requirements

Python 3.9+
PDF support (optional): pdftotext from poppler-utils

📝 License

This project is licensed under the MIT License — see the LICENSE.txt file for details.

🤝 Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

Made with ❤️ for clean documents

⭐ Star this repo | 🌐 Try online | 🚀 Product Hunt | 🐛 Report Bug

💝 Support

Support this project and help keep it free:

☕ Buy Me a Coffee | 🙏 Thanks.dev | 💚 Ko-fi

🔗 Connect

📰 Blog & Updates: t.me/kiku_blog
💬 Discord: discord.gg/4Kxs97JvsU
💼 LinkedIn: linkedin.com/in/kiku-jw
🌐 About.me: about.me/kiku_jw

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github		.github
docs		docs
examples		examples
test_files		test_files
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
tool.py		tool.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧹 DocStripper

✨ Features

Web Application

CLI Tool

🎯 Quick Start

Installation

Usage

📖 Example

🎨 What Gets Removed?

🛠️ Supported Formats

📊 Command Line Options

🔧 Requirements

📝 License

🤝 Contributing

💝 Support

🔗 Connect

About

Uh oh!

Releases

Packages

Languages

License

roshninaktode/DocStripper

Folders and files

Latest commit

History

Repository files navigation

🧹 DocStripper

✨ Features

Web Application

CLI Tool

🎯 Quick Start

Installation

Usage

📖 Example

🎨 What Gets Removed?

🛠️ Supported Formats

📊 Command Line Options

🔧 Requirements

📝 License

🤝 Contributing

💝 Support

🔗 Connect

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages