03 Nov 18:51

kiku-jw

a3c296f

v2.1.0 - UX & Distribution Release Latest

Latest

DocStripper v2.1.0 - UX & Distribution Release

🎯 Major Improvements

Cleaning Temperament Slider (Web)

Replaced "Conservative" and "Aggressive" radio buttons with an intuitive 4-mode slider:

Gentle (Recommended): Safe defaults, preserves formatting
Moderate: Balanced cleaning with line merging
Thorough: Complete cleaning with Unicode normalization, preserves paragraph spacing
Aggressive: Maximum cleaning, removes paragraph spacing for compact output

Privacy-First UX

🔒 Onboarding Tooltip: First-time visitors see a privacy message ("Everything works offline, nothing sent")
📡 Works Offline Badge: Visual indicator in the UI corner
Privacy-Friendly Analytics: Integrated Plausible.io (no cookies, GDPR-compliant)

Performance Improvements

WebWorker Integration: Large files (>100KB or >5000 lines) now process in background thread
No UI Freezing: Long PDF processing no longer blocks the interface
Smart Mode Selection: Automatically uses WebWorker for Fast Clean on large files

Enhanced Feedback

✅ ZIP Download Notifications: Shows "X files cleaned" after batch download
Settings Restoration Toast: Visual confirmation when previous settings are restored
Support Snackbar: Non-intrusive "Buy a coffee" message after cleaning (once per session)

Distribution Ready

Homebrew Tap: Created kiku-jw/homebrew-docstripper repository
PyPI Package: Ready for distribution (pip install docstripper)
Installation Guides: Comprehensive docs in INSTALL.md and PUBLISH_GUIDE.md

Web Interface Enhancements

Normalize Unicode Option: Added checkbox in Advanced Options (11 options total)
Gumroad Support Link: Added to floating support button, snackbar, and README
Improved Mobile Layout: Better spacing and alignment on small screens

🔧 Technical Changes

Web (`docs/assets/app.js`)

Refactored cleaning mode selection to temperament slider system
Implemented updateTemperamentFromValue() and applyTemperamentDefaults() methods
Added cleaner.worker.js WebWorker for background processing
Enhanced normalizeUnicode option (converts smart quotes, dashes to ASCII)
Improved settings persistence and restoration UI feedback

CSS (`docs/assets/style.css`)

Added responsive styles for temperament slider
Onboarding tooltip animations and responsive adjustments
Works offline badge styles with mobile hiding
Support snackbar animations

New Files

docs/assets/cleaner.worker.js: WebWorker implementation for cleaning
docstripper.rb: Homebrew formula
INSTALL.md: Installation instructions for all methods
PUBLISH_GUIDE.md: Distribution guide
HOMEBREW_TAP_SETUP.md: Homebrew tap setup instructions

🐛 Bug Fixes

Fixed Thorough vs Aggressive mode differentiation (paragraph spacing preservation)
Improved temperament slider step values (0, 33, 66, 100 for 4 distinct modes)

📊 Cleaning Modes Comparison

Feature	Gentle	Moderate	Thorough	Aggressive
Page numbers	✅	✅	✅	✅
Headers/footers	✅	✅	✅	✅
Duplicates	✅	✅	✅	✅
Hyphenation fix	✅	✅	✅	✅
Line merging	❌	✅	✅	✅
Whitespace norm	❌	❌	✅	✅
Unicode norm	❌	❌	✅	✅
Paragraph spacing	✅	✅	✅	❌

🛠️ Installation

Web

No installation needed: https://kiku-jw.github.io/DocStripper/

CLI - Homebrew (macOS)

brew tap kiku-jw/docstripper
brew install docstripper
docstripper document.txt

CLI - PyPI (All Platforms)

pip install docstripper
docstripper document.txt

CLI - Manual

git clone https://github.com/kiku-jw/DocStripper.git
cd DocStripper
python tool.py document.txt

🙏 Support

If DocStripper saves you time, consider supporting the project:

☕ Buy a coffee on Gumroad

📚 Documentation

Updated README with cleaning temperament descriptions
Wiki Usage guide updated
Comprehensive installation guide (INSTALL.md)

Full Changelog: See GitHub Commits

Assets 2

03 Nov 17:25

kiku-jw

v2.0.0

39292ec

DocStripper v2.0.0 - Quality Release

🎯 Major Improvements

Cleaning Pipeline v1 (Critical Upgrade)

Unified, production-ready cleaning logic with smart defaults enabled:

Line Merging: Automatically merges broken lines mid-sentence (protects lists, tables, headers)
De-hyphenation: Fixes words split across line breaks (auto-\nmatic → automatic)
Header/Footer Removal: Removes page numbers, "Page X of Y", and repeating headers/footers across pages
Whitespace Normalization: Collapses multiple spaces, normalizes tabs (protects tables)
Unicode Normalization: Converts smart quotes and dashes to ASCII equivalents

Protection Mechanisms

Lists: Never merged (bullet points, numbered lists)
Tables: Detected and preserved (spacing maintained)
Headers: Protected from being merged with content

CLI Enhancements

New Flags: --no-merge-lines, --no-dehyphenate, --no-normalize-ws, --no-normalize-unicode, --keep-headers
stdin/stdout Support: Pipe documents through DocStripper: cat file.pdf | tool.py - --stdout > clean.txt
All cleaning options ON by default (can be disabled via flags)

Web UI Improvements

Brief Statistics Line: Shows "Merged X lines, Dehyphenated Y tokens..." in results summary
Consistent Options: Web checkboxes match CLI flags exactly
Clear List Button: Quickly reset and start over

Bug Fixes

Fixed header/footer merging issue: headers no longer get merged with content during line merging
Improved pattern recognition for multilingual headers (Russian "Страница X из Y")

📊 What Gets Cleaned (Default Behavior)

Conservative Mode (Recommended)

✅ Page numbers (1, 2, 3...)
✅ Headers/footers ("Page X of Y", "Confidential", etc.)
✅ Repeating headers/footers across pages
✅ Duplicate lines
✅ Empty lines
✅ Punctuation-only lines (---, ***, ===)
✅ Hyphenation fixed (auto-\nmatic → automatic)

Aggressive Mode

All Conservative features plus:
✅ Merges broken lines (protects lists and tables)
✅ Normalizes whitespace (protects tables)

🛠️ Migration Guide

CLI

No breaking changes. Existing scripts continue to work, but now benefit from improved cleaning by default.

To disable specific features:

python tool.py --no-merge-lines --no-dehyphenate document.txt

Web

No changes required. Default settings are optimal for most users. Toggle "Advanced Options" to customize.

📝 Technical Details

Cleaning Order: De-hyphenation → Line Merging → Whitespace Normalization → Unicode Normalization → Line Filtering
Shared Logic: Web (JavaScript) and CLI (Python) implement identical cleaning rules
Performance: Optimized for large documents (tested up to 500+ pages)
Memory: Efficient streaming for CLI, page-wise processing for web

🙏 Credits

Based on competitor analysis and best practices from:

PyPDF, PyMuPDF (PDF extraction)
Unstructured, Docling (document processing)
Document Cleaner (cleaning heuristics)

📚 Documentation

Updated README with CLI flags and examples
Wiki Usage guide updated with stdin/stdout examples
Cleaning specification document added

Full Changelog: See GitHub Commits

Assets 2

01 Nov 15:30

kiku-jw

v1.3.0

5da4433

v1.3.0 - PDF Support in Web Version

🎉 PDF Support Added to Web Version

✨ Major Feature

📄 PDF File Support - PDF files are now supported in both web and CLI versions!

Web: Automatic PDF support using PDF.js library (no installation needed)
CLI: PDF support via poppler-utils (as before)

🚀 New Features

PDF file upload and processing in web application
Automatic PDF text extraction with line structure preservation
PDF.js library integration (v3.11.174) from CDN
Improved text extraction algorithm that preserves line breaks

📝 Changes

Add PDF.js library from CDN
Implement extractTextFromPDF function using PDF.js
Update readTextFile to handle PDF files
Update file input to accept PDF files
Improve PDF text extraction to preserve line structure
Update documentation (README, Wiki) to reflect PDF support

🔧 Technical Details

Web: Uses PDF.js library automatically (no installation needed)
CLI: Requires pdftotext from poppler-utils (as before)
PDF extraction preserves line structure by grouping text items by Y position
Compatible with Fast Clean and Smart Clean modes
Fully tested with various PDF formats

📚 Documentation Updates

Updated README.md with PDF support information
Updated Wiki pages (Home, Usage, FAQ, Installation)
Clarified differences between Web and CLI PDF support

✅ Testing

Tested PDF extraction with various PDF files
Verified compatibility with Fast Clean mode
Verified compatibility with Smart Clean mode
Confirmed proper line structure preservation

🎯 Compatibility

Breaking Changes: None
Backward Compatibility: All existing features remain unchanged
Browser Support: All modern browsers with JavaScript enabled

Try it now: https://kiku-jw.github.io/DocStripper/

Assets 2

31 Oct 19:56

kiku-jw

v1.2.0

bce03fc

v1.2.0 - Smart Clean with AI

🎉 Major Update: Smart Clean with AI

✨ New Features

🤖 Smart Clean (Beta) - AI-powered cleaning using on-device LLM (WebLLM)
- WebGPU-based inference for fast processing
- Dynamic prompt generation based on user settings
- Automatic fallback to Fast Clean if WebGPU unavailable
- Progress tracking and cancellation support
- Batch processing for large files (parallel chunk processing)
- Adaptive chunk sizing for optimal performance
🏷️ Mode Badges - Visual indicators showing which cleaning mode was used
📈 Enhanced Statistics - Detailed breakdown of what was removed

🚀 Improvements

⚡ Performance optimization with parallel batch processing
🎯 Better error handling with fallback mechanisms
📊 Adaptive chunking based on document length
🔧 Settings integration - cleaning options customize AI behavior

📝 Full Changelog

See CHANGELOG.md for complete details.

Try it now: https://kiku-jw.github.io/DocStripper/

Assets 2

31 Oct 19:56

kiku-jw

v1.0.0

9a206c6

v1.0.0 - Initial Release

🎉 Initial Release

DocStripper - Batch document cleaner CLI tool

Features

🚀 Fast & Lightweight - Uses only Python stdlib, no external packages
🔒 Privacy-First - All processing happens offline
📊 Dry-Run Mode - Preview changes before applying
🔄 Undo Support - Restore files from backups
🌍 Cross-Platform - Works on Windows, macOS, and Linux
📚 Multiple Formats - Supports .txt, .docx, and .pdf files

What Gets Removed

Page numbers
Headers/Footers
Duplicate lines
Empty lines

See README.md for usage instructions.

Assets 2

Uh oh!

Releases: KikuAI-Lab/DocStripper

v2.1.0 - UX & Distribution Release

DocStripper v2.1.0 - UX & Distribution Release

🎯 Major Improvements

Cleaning Temperament Slider (Web)

Privacy-First UX

Performance Improvements

Enhanced Feedback

Distribution Ready

Web Interface Enhancements

🔧 Technical Changes

Web (docs/assets/app.js)

CSS (docs/assets/style.css)

New Files

🐛 Bug Fixes

📊 Cleaning Modes Comparison

🛠️ Installation

Web

CLI - Homebrew (macOS)

CLI - PyPI (All Platforms)

CLI - Manual

🙏 Support

📚 Documentation

Uh oh!

DocStripper v2.0.0 - Quality Release

DocStripper v2.0.0 - Quality Release

🎯 Major Improvements

Cleaning Pipeline v1 (Critical Upgrade)

Protection Mechanisms

CLI Enhancements

Web UI Improvements

Bug Fixes

📊 What Gets Cleaned (Default Behavior)

Conservative Mode (Recommended)

Aggressive Mode

🛠️ Migration Guide

CLI

Web

📝 Technical Details

🙏 Credits

📚 Documentation

Uh oh!

v1.3.0 - PDF Support in Web Version

🎉 PDF Support Added to Web Version

✨ Major Feature

🚀 New Features

📝 Changes

🔧 Technical Details

📚 Documentation Updates

✅ Testing

🎯 Compatibility

Uh oh!

v1.2.0 - Smart Clean with AI

🎉 Major Update: Smart Clean with AI

✨ New Features

🚀 Improvements

📝 Full Changelog

Uh oh!

v1.0.0 - Initial Release

🎉 Initial Release

Features

What Gets Removed

Uh oh!

Web (`docs/assets/app.js`)

CSS (`docs/assets/style.css`)