Skip to content

Releases: KikuAI-Lab/DocStripper

v2.1.0 - UX & Distribution Release

03 Nov 18:51

Choose a tag to compare

DocStripper v2.1.0 - UX & Distribution Release

🎯 Major Improvements

Cleaning Temperament Slider (Web)

Replaced "Conservative" and "Aggressive" radio buttons with an intuitive 4-mode slider:

  • Gentle (Recommended): Safe defaults, preserves formatting
  • Moderate: Balanced cleaning with line merging
  • Thorough: Complete cleaning with Unicode normalization, preserves paragraph spacing
  • Aggressive: Maximum cleaning, removes paragraph spacing for compact output

Privacy-First UX

  • 🔒 Onboarding Tooltip: First-time visitors see a privacy message ("Everything works offline, nothing sent")
  • 📡 Works Offline Badge: Visual indicator in the UI corner
  • Privacy-Friendly Analytics: Integrated Plausible.io (no cookies, GDPR-compliant)

Performance Improvements

  • WebWorker Integration: Large files (>100KB or >5000 lines) now process in background thread
  • No UI Freezing: Long PDF processing no longer blocks the interface
  • Smart Mode Selection: Automatically uses WebWorker for Fast Clean on large files

Enhanced Feedback

  • ✅ ZIP Download Notifications: Shows "X files cleaned" after batch download
  • Settings Restoration Toast: Visual confirmation when previous settings are restored
  • Support Snackbar: Non-intrusive "Buy a coffee" message after cleaning (once per session)

Distribution Ready

  • Homebrew Tap: Created kiku-jw/homebrew-docstripper repository
  • PyPI Package: Ready for distribution (pip install docstripper)
  • Installation Guides: Comprehensive docs in INSTALL.md and PUBLISH_GUIDE.md

Web Interface Enhancements

  • Normalize Unicode Option: Added checkbox in Advanced Options (11 options total)
  • Gumroad Support Link: Added to floating support button, snackbar, and README
  • Improved Mobile Layout: Better spacing and alignment on small screens

🔧 Technical Changes

Web (docs/assets/app.js)

  • Refactored cleaning mode selection to temperament slider system
  • Implemented updateTemperamentFromValue() and applyTemperamentDefaults() methods
  • Added cleaner.worker.js WebWorker for background processing
  • Enhanced normalizeUnicode option (converts smart quotes, dashes to ASCII)
  • Improved settings persistence and restoration UI feedback

CSS (docs/assets/style.css)

  • Added responsive styles for temperament slider
  • Onboarding tooltip animations and responsive adjustments
  • Works offline badge styles with mobile hiding
  • Support snackbar animations

New Files

  • docs/assets/cleaner.worker.js: WebWorker implementation for cleaning
  • docstripper.rb: Homebrew formula
  • INSTALL.md: Installation instructions for all methods
  • PUBLISH_GUIDE.md: Distribution guide
  • HOMEBREW_TAP_SETUP.md: Homebrew tap setup instructions

🐛 Bug Fixes

  • Fixed Thorough vs Aggressive mode differentiation (paragraph spacing preservation)
  • Improved temperament slider step values (0, 33, 66, 100 for 4 distinct modes)

📊 Cleaning Modes Comparison

Feature Gentle Moderate Thorough Aggressive
Page numbers
Headers/footers
Duplicates
Hyphenation fix
Line merging
Whitespace norm
Unicode norm
Paragraph spacing

🛠️ Installation

Web

No installation needed: https://kiku-jw.github.io/DocStripper/

CLI - Homebrew (macOS)

brew tap kiku-jw/docstripper
brew install docstripper
docstripper document.txt

CLI - PyPI (All Platforms)

pip install docstripper
docstripper document.txt

CLI - Manual

git clone https://github.com/kiku-jw/DocStripper.git
cd DocStripper
python tool.py document.txt

🙏 Support

If DocStripper saves you time, consider supporting the project:

📚 Documentation

  • Updated README with cleaning temperament descriptions
  • Wiki Usage guide updated
  • Comprehensive installation guide (INSTALL.md)

Full Changelog: See GitHub Commits

DocStripper v2.0.0 - Quality Release

03 Nov 17:25

Choose a tag to compare

DocStripper v2.0.0 - Quality Release

🎯 Major Improvements

Cleaning Pipeline v1 (Critical Upgrade)

Unified, production-ready cleaning logic with smart defaults enabled:

  • Line Merging: Automatically merges broken lines mid-sentence (protects lists, tables, headers)
  • De-hyphenation: Fixes words split across line breaks (auto-\nmatic → automatic)
  • Header/Footer Removal: Removes page numbers, "Page X of Y", and repeating headers/footers across pages
  • Whitespace Normalization: Collapses multiple spaces, normalizes tabs (protects tables)
  • Unicode Normalization: Converts smart quotes and dashes to ASCII equivalents

Protection Mechanisms

  • Lists: Never merged (bullet points, numbered lists)
  • Tables: Detected and preserved (spacing maintained)
  • Headers: Protected from being merged with content

CLI Enhancements

  • New Flags: --no-merge-lines, --no-dehyphenate, --no-normalize-ws, --no-normalize-unicode, --keep-headers
  • stdin/stdout Support: Pipe documents through DocStripper: cat file.pdf | tool.py - --stdout > clean.txt
  • All cleaning options ON by default (can be disabled via flags)

Web UI Improvements

  • Brief Statistics Line: Shows "Merged X lines, Dehyphenated Y tokens..." in results summary
  • Consistent Options: Web checkboxes match CLI flags exactly
  • Clear List Button: Quickly reset and start over

Bug Fixes

  • Fixed header/footer merging issue: headers no longer get merged with content during line merging
  • Improved pattern recognition for multilingual headers (Russian "Страница X из Y")

📊 What Gets Cleaned (Default Behavior)

Conservative Mode (Recommended)

✅ Page numbers (1, 2, 3...)
✅ Headers/footers ("Page X of Y", "Confidential", etc.)
✅ Repeating headers/footers across pages
✅ Duplicate lines
✅ Empty lines
✅ Punctuation-only lines (---, ***, ===)
✅ Hyphenation fixed (auto-\nmatic → automatic)

Aggressive Mode

All Conservative features plus:
✅ Merges broken lines (protects lists and tables)
✅ Normalizes whitespace (protects tables)

🛠️ Migration Guide

CLI

No breaking changes. Existing scripts continue to work, but now benefit from improved cleaning by default.

To disable specific features:

python tool.py --no-merge-lines --no-dehyphenate document.txt

Web

No changes required. Default settings are optimal for most users. Toggle "Advanced Options" to customize.

📝 Technical Details

  • Cleaning Order: De-hyphenation → Line Merging → Whitespace Normalization → Unicode Normalization → Line Filtering
  • Shared Logic: Web (JavaScript) and CLI (Python) implement identical cleaning rules
  • Performance: Optimized for large documents (tested up to 500+ pages)
  • Memory: Efficient streaming for CLI, page-wise processing for web

🙏 Credits

Based on competitor analysis and best practices from:

  • PyPDF, PyMuPDF (PDF extraction)
  • Unstructured, Docling (document processing)
  • Document Cleaner (cleaning heuristics)

📚 Documentation

  • Updated README with CLI flags and examples
  • Wiki Usage guide updated with stdin/stdout examples
  • Cleaning specification document added

Full Changelog: See GitHub Commits

v1.3.0 - PDF Support in Web Version

01 Nov 15:30

Choose a tag to compare

🎉 PDF Support Added to Web Version

✨ Major Feature

📄 PDF File Support - PDF files are now supported in both web and CLI versions!

  • Web: Automatic PDF support using PDF.js library (no installation needed)
  • CLI: PDF support via poppler-utils (as before)

🚀 New Features

  • PDF file upload and processing in web application
  • Automatic PDF text extraction with line structure preservation
  • PDF.js library integration (v3.11.174) from CDN
  • Improved text extraction algorithm that preserves line breaks

📝 Changes

  • Add PDF.js library from CDN
  • Implement extractTextFromPDF function using PDF.js
  • Update readTextFile to handle PDF files
  • Update file input to accept PDF files
  • Improve PDF text extraction to preserve line structure
  • Update documentation (README, Wiki) to reflect PDF support

🔧 Technical Details

  • Web: Uses PDF.js library automatically (no installation needed)
  • CLI: Requires pdftotext from poppler-utils (as before)
  • PDF extraction preserves line structure by grouping text items by Y position
  • Compatible with Fast Clean and Smart Clean modes
  • Fully tested with various PDF formats

📚 Documentation Updates

  • Updated README.md with PDF support information
  • Updated Wiki pages (Home, Usage, FAQ, Installation)
  • Clarified differences between Web and CLI PDF support

✅ Testing

  • Tested PDF extraction with various PDF files
  • Verified compatibility with Fast Clean mode
  • Verified compatibility with Smart Clean mode
  • Confirmed proper line structure preservation

🎯 Compatibility

  • Breaking Changes: None
  • Backward Compatibility: All existing features remain unchanged
  • Browser Support: All modern browsers with JavaScript enabled

Try it now: https://kiku-jw.github.io/DocStripper/

v1.2.0 - Smart Clean with AI

31 Oct 19:56

Choose a tag to compare

🎉 Major Update: Smart Clean with AI

✨ New Features

  • 🤖 Smart Clean (Beta) - AI-powered cleaning using on-device LLM (WebLLM)

    • WebGPU-based inference for fast processing
    • Dynamic prompt generation based on user settings
    • Automatic fallback to Fast Clean if WebGPU unavailable
    • Progress tracking and cancellation support
    • Batch processing for large files (parallel chunk processing)
    • Adaptive chunk sizing for optimal performance
  • 🏷️ Mode Badges - Visual indicators showing which cleaning mode was used

  • 📈 Enhanced Statistics - Detailed breakdown of what was removed

🚀 Improvements

  • ⚡ Performance optimization with parallel batch processing
  • 🎯 Better error handling with fallback mechanisms
  • 📊 Adaptive chunking based on document length
  • 🔧 Settings integration - cleaning options customize AI behavior

📝 Full Changelog

See CHANGELOG.md for complete details.


Try it now: https://kiku-jw.github.io/DocStripper/

v1.0.0 - Initial Release

31 Oct 19:56

Choose a tag to compare

🎉 Initial Release

DocStripper - Batch document cleaner CLI tool

Features

  • 🚀 Fast & Lightweight - Uses only Python stdlib, no external packages
  • 🔒 Privacy-First - All processing happens offline
  • 📊 Dry-Run Mode - Preview changes before applying
  • 🔄 Undo Support - Restore files from backups
  • 🌍 Cross-Platform - Works on Windows, macOS, and Linux
  • 📚 Multiple Formats - Supports .txt, .docx, and .pdf files

What Gets Removed

  • Page numbers
  • Headers/Footers
  • Duplicate lines
  • Empty lines

See README.md for usage instructions.