Releases: KikuAI-Lab/DocStripper
v2.1.0 - UX & Distribution Release
DocStripper v2.1.0 - UX & Distribution Release
🎯 Major Improvements
Cleaning Temperament Slider (Web)
Replaced "Conservative" and "Aggressive" radio buttons with an intuitive 4-mode slider:
- Gentle (Recommended): Safe defaults, preserves formatting
- Moderate: Balanced cleaning with line merging
- Thorough: Complete cleaning with Unicode normalization, preserves paragraph spacing
- Aggressive: Maximum cleaning, removes paragraph spacing for compact output
Privacy-First UX
- 🔒 Onboarding Tooltip: First-time visitors see a privacy message ("Everything works offline, nothing sent")
- 📡 Works Offline Badge: Visual indicator in the UI corner
- Privacy-Friendly Analytics: Integrated Plausible.io (no cookies, GDPR-compliant)
Performance Improvements
- WebWorker Integration: Large files (>100KB or >5000 lines) now process in background thread
- No UI Freezing: Long PDF processing no longer blocks the interface
- Smart Mode Selection: Automatically uses WebWorker for Fast Clean on large files
Enhanced Feedback
- ✅ ZIP Download Notifications: Shows "X files cleaned" after batch download
- Settings Restoration Toast: Visual confirmation when previous settings are restored
- Support Snackbar: Non-intrusive "Buy a coffee" message after cleaning (once per session)
Distribution Ready
- Homebrew Tap: Created
kiku-jw/homebrew-docstripperrepository - PyPI Package: Ready for distribution (
pip install docstripper) - Installation Guides: Comprehensive docs in
INSTALL.mdandPUBLISH_GUIDE.md
Web Interface Enhancements
- Normalize Unicode Option: Added checkbox in Advanced Options (11 options total)
- Gumroad Support Link: Added to floating support button, snackbar, and README
- Improved Mobile Layout: Better spacing and alignment on small screens
🔧 Technical Changes
Web (docs/assets/app.js)
- Refactored cleaning mode selection to temperament slider system
- Implemented
updateTemperamentFromValue()andapplyTemperamentDefaults()methods - Added
cleaner.worker.jsWebWorker for background processing - Enhanced
normalizeUnicodeoption (converts smart quotes, dashes to ASCII) - Improved settings persistence and restoration UI feedback
CSS (docs/assets/style.css)
- Added responsive styles for temperament slider
- Onboarding tooltip animations and responsive adjustments
- Works offline badge styles with mobile hiding
- Support snackbar animations
New Files
docs/assets/cleaner.worker.js: WebWorker implementation for cleaningdocstripper.rb: Homebrew formulaINSTALL.md: Installation instructions for all methodsPUBLISH_GUIDE.md: Distribution guideHOMEBREW_TAP_SETUP.md: Homebrew tap setup instructions
🐛 Bug Fixes
- Fixed Thorough vs Aggressive mode differentiation (paragraph spacing preservation)
- Improved temperament slider step values (0, 33, 66, 100 for 4 distinct modes)
📊 Cleaning Modes Comparison
| Feature | Gentle | Moderate | Thorough | Aggressive |
|---|---|---|---|---|
| Page numbers | ✅ | ✅ | ✅ | ✅ |
| Headers/footers | ✅ | ✅ | ✅ | ✅ |
| Duplicates | ✅ | ✅ | ✅ | ✅ |
| Hyphenation fix | ✅ | ✅ | ✅ | ✅ |
| Line merging | ❌ | ✅ | ✅ | ✅ |
| Whitespace norm | ❌ | ❌ | ✅ | ✅ |
| Unicode norm | ❌ | ❌ | ✅ | ✅ |
| Paragraph spacing | ✅ | ✅ | ✅ | ❌ |
🛠️ Installation
Web
No installation needed: https://kiku-jw.github.io/DocStripper/
CLI - Homebrew (macOS)
brew tap kiku-jw/docstripper
brew install docstripper
docstripper document.txtCLI - PyPI (All Platforms)
pip install docstripper
docstripper document.txtCLI - Manual
git clone https://github.com/kiku-jw/DocStripper.git
cd DocStripper
python tool.py document.txt🙏 Support
If DocStripper saves you time, consider supporting the project:
📚 Documentation
- Updated README with cleaning temperament descriptions
- Wiki Usage guide updated
- Comprehensive installation guide (
INSTALL.md)
Full Changelog: See GitHub Commits
DocStripper v2.0.0 - Quality Release
DocStripper v2.0.0 - Quality Release
🎯 Major Improvements
Cleaning Pipeline v1 (Critical Upgrade)
Unified, production-ready cleaning logic with smart defaults enabled:
- Line Merging: Automatically merges broken lines mid-sentence (protects lists, tables, headers)
- De-hyphenation: Fixes words split across line breaks (auto-\nmatic → automatic)
- Header/Footer Removal: Removes page numbers, "Page X of Y", and repeating headers/footers across pages
- Whitespace Normalization: Collapses multiple spaces, normalizes tabs (protects tables)
- Unicode Normalization: Converts smart quotes and dashes to ASCII equivalents
Protection Mechanisms
- Lists: Never merged (bullet points, numbered lists)
- Tables: Detected and preserved (spacing maintained)
- Headers: Protected from being merged with content
CLI Enhancements
- New Flags:
--no-merge-lines,--no-dehyphenate,--no-normalize-ws,--no-normalize-unicode,--keep-headers - stdin/stdout Support: Pipe documents through DocStripper:
cat file.pdf | tool.py - --stdout > clean.txt - All cleaning options ON by default (can be disabled via flags)
Web UI Improvements
- Brief Statistics Line: Shows "Merged X lines, Dehyphenated Y tokens..." in results summary
- Consistent Options: Web checkboxes match CLI flags exactly
- Clear List Button: Quickly reset and start over
Bug Fixes
- Fixed header/footer merging issue: headers no longer get merged with content during line merging
- Improved pattern recognition for multilingual headers (Russian "Страница X из Y")
📊 What Gets Cleaned (Default Behavior)
Conservative Mode (Recommended)
✅ Page numbers (1, 2, 3...)
✅ Headers/footers ("Page X of Y", "Confidential", etc.)
✅ Repeating headers/footers across pages
✅ Duplicate lines
✅ Empty lines
✅ Punctuation-only lines (---, ***, ===)
✅ Hyphenation fixed (auto-\nmatic → automatic)
Aggressive Mode
All Conservative features plus:
✅ Merges broken lines (protects lists and tables)
✅ Normalizes whitespace (protects tables)
🛠️ Migration Guide
CLI
No breaking changes. Existing scripts continue to work, but now benefit from improved cleaning by default.
To disable specific features:
python tool.py --no-merge-lines --no-dehyphenate document.txtWeb
No changes required. Default settings are optimal for most users. Toggle "Advanced Options" to customize.
📝 Technical Details
- Cleaning Order: De-hyphenation → Line Merging → Whitespace Normalization → Unicode Normalization → Line Filtering
- Shared Logic: Web (JavaScript) and CLI (Python) implement identical cleaning rules
- Performance: Optimized for large documents (tested up to 500+ pages)
- Memory: Efficient streaming for CLI, page-wise processing for web
🙏 Credits
Based on competitor analysis and best practices from:
- PyPDF, PyMuPDF (PDF extraction)
- Unstructured, Docling (document processing)
- Document Cleaner (cleaning heuristics)
📚 Documentation
- Updated README with CLI flags and examples
- Wiki Usage guide updated with stdin/stdout examples
- Cleaning specification document added
Full Changelog: See GitHub Commits
v1.3.0 - PDF Support in Web Version
🎉 PDF Support Added to Web Version
✨ Major Feature
📄 PDF File Support - PDF files are now supported in both web and CLI versions!
- Web: Automatic PDF support using PDF.js library (no installation needed)
- CLI: PDF support via poppler-utils (as before)
🚀 New Features
- PDF file upload and processing in web application
- Automatic PDF text extraction with line structure preservation
- PDF.js library integration (v3.11.174) from CDN
- Improved text extraction algorithm that preserves line breaks
📝 Changes
- Add PDF.js library from CDN
- Implement
extractTextFromPDFfunction using PDF.js - Update
readTextFileto handle PDF files - Update file input to accept PDF files
- Improve PDF text extraction to preserve line structure
- Update documentation (README, Wiki) to reflect PDF support
🔧 Technical Details
- Web: Uses PDF.js library automatically (no installation needed)
- CLI: Requires pdftotext from poppler-utils (as before)
- PDF extraction preserves line structure by grouping text items by Y position
- Compatible with Fast Clean and Smart Clean modes
- Fully tested with various PDF formats
📚 Documentation Updates
- Updated README.md with PDF support information
- Updated Wiki pages (Home, Usage, FAQ, Installation)
- Clarified differences between Web and CLI PDF support
✅ Testing
- Tested PDF extraction with various PDF files
- Verified compatibility with Fast Clean mode
- Verified compatibility with Smart Clean mode
- Confirmed proper line structure preservation
🎯 Compatibility
- Breaking Changes: None
- Backward Compatibility: All existing features remain unchanged
- Browser Support: All modern browsers with JavaScript enabled
Try it now: https://kiku-jw.github.io/DocStripper/
v1.2.0 - Smart Clean with AI
🎉 Major Update: Smart Clean with AI
✨ New Features
-
🤖 Smart Clean (Beta) - AI-powered cleaning using on-device LLM (WebLLM)
- WebGPU-based inference for fast processing
- Dynamic prompt generation based on user settings
- Automatic fallback to Fast Clean if WebGPU unavailable
- Progress tracking and cancellation support
- Batch processing for large files (parallel chunk processing)
- Adaptive chunk sizing for optimal performance
-
🏷️ Mode Badges - Visual indicators showing which cleaning mode was used
-
📈 Enhanced Statistics - Detailed breakdown of what was removed
🚀 Improvements
- ⚡ Performance optimization with parallel batch processing
- 🎯 Better error handling with fallback mechanisms
- 📊 Adaptive chunking based on document length
- 🔧 Settings integration - cleaning options customize AI behavior
📝 Full Changelog
See CHANGELOG.md for complete details.
Try it now: https://kiku-jw.github.io/DocStripper/
v1.0.0 - Initial Release
🎉 Initial Release
DocStripper - Batch document cleaner CLI tool
Features
- 🚀 Fast & Lightweight - Uses only Python stdlib, no external packages
- 🔒 Privacy-First - All processing happens offline
- 📊 Dry-Run Mode - Preview changes before applying
- 🔄 Undo Support - Restore files from backups
- 🌍 Cross-Platform - Works on Windows, macOS, and Linux
- 📚 Multiple Formats - Supports .txt, .docx, and .pdf files
What Gets Removed
- Page numbers
- Headers/Footers
- Duplicate lines
- Empty lines
See README.md for usage instructions.