Skip to content
kiku edited this page Nov 1, 2025 · 4 revisions

Usage Guide

Web Application

Quick Start

  1. Visit https://kiku-jw.github.io/DocStripper/
  2. Click "Upload Your Documents" or drag & drop files
  3. Choose cleaning mode:
    • Fast Clean: Instant rule-based cleaning
    • Smart Clean: AI-powered cleaning (requires WebGPU)
  4. Choose cleaning mode type:
    • Conservative (default): Safe defaults, preserves lists and tables
    • Aggressive: More aggressive cleaning with merge and whitespace normalization
  5. Configure cleaning options (optional - hidden in "Advanced Options" menu)
  6. Click "Start Cleaning"
  7. Review side-by-side preview (Original | Cleaned)
  8. Download or copy the cleaned results

Note: All individual cleaning options are hidden in the "Advanced Options" collapsible menu by default. Only the Fast/Smart Clean mode selector and Conservative/Aggressive mode selector are visible for a cleaner interface.

Cleaning Modes

Fast Clean

  • Speed: Instant
  • Method: Rule-based pattern matching
  • Best for: Standard documents with predictable patterns

Smart Clean (Beta)

  • Speed: Slower (depends on document size)
  • Method: AI-powered with on-device LLM
  • Requirements: WebGPU support, ~100-200 MB one-time download
  • Best for: Complex documents with unusual patterns
  • Mode-Aware: Conservative/Aggressive modes affect LLM prompts
    • Conservative: Cautious prompts that preserve structure
    • Aggressive: Thorough prompts that allow merging and normalization
  • Post-Processing: After LLM processing, applies:
    • Dehyphenation (if enabled)
    • Merge broken lines (if enabled in Aggressive mode)
    • Whitespace normalization (if enabled in Aggressive mode)

Cleaning Mode Types

Conservative Mode (Default - Recommended)

Safe defaults for most users:

  • ✅ Removes noise (headers, footers, page numbers, duplicates)
  • ✅ Dehyphenates broken words (safe: only lowercase continuations)
  • ✅ Removes repeating headers/footers across pages (≥70% threshold)
  • ✅ Preserves lists and tables
  • ✅ Never merges lines or normalizes whitespace

Best for: Most documents, especially those with lists or tables

Aggressive Mode

More aggressive cleaning:

  • ✅ All Conservative features enabled
  • ✅ Merges broken lines (with list/table protection)
  • ✅ Normalizes whitespace (with table protection)
  • ⚠️ Use with caution: may affect formatting in some documents

Best for: Simple text documents without complex formatting

Cleaning Options

Basic Options

  • Remove Empty Lines: Removes blank and whitespace-only lines
  • Remove Page Numbers: Removes lines with only digits (1, 2, 3...), Roman numerals (I, II, III), or letters (A, B, C)
  • Remove Headers/Footers: Removes common patterns (Page X of Y, Confidential, etc.)
  • Remove Repeating Headers/Footers: Removes headers/footers that appear on ≥70% of pages (detected automatically)
  • Remove Duplicates: Collapses consecutive identical lines
  • Remove Punctuation Lines: Removes lines with only symbols (---, ***, ===) or single bullets (•, *, ·)
  • Preserve Paragraph Spacing: Keeps one empty line between paragraphs

Advanced Options (Collapsible Menu)

All cleaning options are available in the "Advanced Options" menu:

  • Remove Empty Lines: Removes blank and whitespace-only lines
  • Remove Page Numbers: Removes lines with only digits, Roman numerals, or letters
  • Remove Headers/Footers: Removes common patterns (includes repeating headers/footers)
  • Remove Duplicates: Collapses consecutive identical lines
  • Remove Punctuation Lines: Removes lines with only symbols or single bullets
  • Preserve Paragraph Spacing: Keeps one empty line between paragraphs
  • Dehyphenate: Safe dehyphenation - joins "auto-\nmatic" → "automatic" (only lowercase continuations)
  • Merge Broken Lines: Merges lines broken mid-sentence (protects lists and tables) - Enabled in Aggressive mode
  • Normalize Whitespace: Collapses multiple spaces, normalizes tabs (protects tables) - Enabled in Aggressive mode
  • Keep Table Spacing: Preserves spacing in detected table blocks when normalizing whitespace

Preview & Results

  • Side-by-Side Preview: Compare Original | Cleaned text side-by-side
  • Virtualization: Large files (>1MB) use virtualization for smooth scrolling
  • Statistics: Detailed stats showing what was removed (lines, duplicates, headers, etc.)
  • Download: Download cleaned file with _cleaned.txt suffix
  • Copy: Copy cleaned text to clipboard with one click

Settings Persistence

Your cleaning preferences are automatically saved in your browser's localStorage:

  • Cleaning mode (Fast/Smart)
  • Cleaning mode type (Conservative/Aggressive)
  • All checkbox settings

Settings persist across page reloads and browser sessions.

CLI Tool

Basic Usage

# Clean a single file
python tool.py document.txt

# Clean multiple files
python tool.py file1.txt file2.txt file3.docx

# Preview changes without modifying files
python tool.py --dry-run document.txt

# Undo last operation
python tool.py --undo

Supported Formats

  • .txt - Plain text files
  • .docx - Microsoft Word documents
  • .pdf - PDF files
    • Web: Automatic (uses PDF.js library - no installation needed)
    • CLI: Requires pdftotext from poppler-utils

Command Options

python tool.py [OPTIONS] [FILES...]

Options:
  -h, --help     Show help message
  --dry-run      Preview changes without modifying files
  --undo         Restore files from last operation

Note: CLI version uses Conservative mode by default. Advanced features (merge lines, whitespace normalization) are available programmatically but disabled by default for safety.

Examples

Example 1: Clean a single document

python tool.py report.txt

Example 2: Clean multiple documents

python tool.py document1.txt document2.docx document3.pdf

Example 3: Preview before cleaning

python tool.py --dry-run important_document.txt

Example 4: Undo last operation

python tool.py --undo

Output

  • Original files are backed up with .bak extension
  • Processed files replace originals
  • Statistics are shown in console:
    • Lines removed
    • Duplicates collapsed
    • Empty lines removed
    • Headers/footers removed
    • Punctuation lines removed
    • Dehyphenated tokens
    • Repeating headers/footers removed
    • Merged lines (if enabled)
  • Operation log saved to .strip-log

Best Practices

  1. Always test on copies first - Use --dry-run or test on copies
  2. Backup important files - The tool creates backups, but extra backups never hurt
  3. Review statistics - Check what was removed before finalizing
  4. Use appropriate mode - Fast Clean for simple documents, Smart Clean for complex ones
  5. Start with Conservative mode - It's safer and preserves formatting
  6. Use side-by-side preview - Always review changes before downloading
  7. Check lists and tables - Ensure they're preserved correctly, especially in Aggressive mode

Troubleshooting

See Troubleshooting Guide for common issues and solutions.

Testing

See SELF_TESTS.md for manual test steps and expected results.

Test fixtures are available in the examples/ directory:

  • fixture1_headers_footers.txt - Headers/footers + page numbers
  • fixture2_hyphenation.txt - Hyphenation + mid-sentence wraps
  • fixture3_lists_tables.txt - Lists & pseudo-tables