Usage

Usage Guide

Web Application

Quick Start

Visit https://kiku-jw.github.io/DocStripper/
Click "Upload Your Documents" or drag & drop files
Choose cleaning mode:
- Fast Clean: Instant rule-based cleaning
- Smart Clean: AI-powered cleaning (requires WebGPU)
Choose cleaning mode type:
- Conservative (default): Safe defaults, preserves lists and tables
- Aggressive: More aggressive cleaning with merge and whitespace normalization
Configure cleaning options (optional - hidden in "Advanced Options" menu)
Click "Start Cleaning"
Review side-by-side preview (Original | Cleaned)
Download or copy the cleaned results

Note: All individual cleaning options are hidden in the "Advanced Options" collapsible menu by default. Only the Fast/Smart Clean mode selector and Conservative/Aggressive mode selector are visible for a cleaner interface.

Cleaning Modes

Fast Clean

Speed: Instant
Method: Rule-based pattern matching
Best for: Standard documents with predictable patterns

Smart Clean (Beta)

Speed: Slower (depends on document size)
Method: AI-powered with on-device LLM
Requirements: WebGPU support, ~100-200 MB one-time download
Best for: Complex documents with unusual patterns
Mode-Aware: Conservative/Aggressive modes affect LLM prompts
- Conservative: Cautious prompts that preserve structure
- Aggressive: Thorough prompts that allow merging and normalization
Post-Processing: After LLM processing, applies:
- Dehyphenation (if enabled)
- Merge broken lines (if enabled in Aggressive mode)
- Whitespace normalization (if enabled in Aggressive mode)

Cleaning Mode Types

Conservative Mode (Default - Recommended)

Safe defaults for most users:

✅ Removes noise (headers, footers, page numbers, duplicates)
✅ Dehyphenates broken words (safe: only lowercase continuations)
✅ Removes repeating headers/footers across pages (≥70% threshold)
✅ Preserves lists and tables
✅ Never merges lines or normalizes whitespace

Best for: Most documents, especially those with lists or tables

Aggressive Mode

More aggressive cleaning:

✅ All Conservative features enabled
✅ Merges broken lines (with list/table protection)
✅ Normalizes whitespace (with table protection)
⚠️ Use with caution: may affect formatting in some documents

Best for: Simple text documents without complex formatting

Cleaning Options

Basic Options

Remove Empty Lines: Removes blank and whitespace-only lines
Remove Page Numbers: Removes lines with only digits (1, 2, 3...), Roman numerals (I, II, III), or letters (A, B, C)
Remove Headers/Footers: Removes common patterns (Page X of Y, Confidential, etc.)
Remove Repeating Headers/Footers: Removes headers/footers that appear on ≥70% of pages (detected automatically)
Remove Duplicates: Collapses consecutive identical lines
Remove Punctuation Lines: Removes lines with only symbols (---, ***, ===) or single bullets (•, *, ·)
Preserve Paragraph Spacing: Keeps one empty line between paragraphs

Advanced Options (Collapsible Menu)

All cleaning options are available in the "Advanced Options" menu:

Remove Empty Lines: Removes blank and whitespace-only lines
Remove Page Numbers: Removes lines with only digits, Roman numerals, or letters
Remove Headers/Footers: Removes common patterns (includes repeating headers/footers)
Remove Duplicates: Collapses consecutive identical lines
Remove Punctuation Lines: Removes lines with only symbols or single bullets
Preserve Paragraph Spacing: Keeps one empty line between paragraphs
Dehyphenate: Safe dehyphenation - joins "auto-\nmatic" → "automatic" (only lowercase continuations)
Merge Broken Lines: Merges lines broken mid-sentence (protects lists and tables) - Enabled in Aggressive mode
Normalize Whitespace: Collapses multiple spaces, normalizes tabs (protects tables) - Enabled in Aggressive mode
Keep Table Spacing: Preserves spacing in detected table blocks when normalizing whitespace

Preview & Results

Side-by-Side Preview: Compare Original | Cleaned text side-by-side
Virtualization: Large files (>1MB) use virtualization for smooth scrolling
Statistics: Detailed stats showing what was removed (lines, duplicates, headers, etc.)
Download: Download cleaned file with _cleaned.txt suffix
Copy: Copy cleaned text to clipboard with one click

Settings Persistence

Your cleaning preferences are automatically saved in your browser's localStorage:

Cleaning mode (Fast/Smart)
Cleaning mode type (Conservative/Aggressive)
All checkbox settings

Settings persist across page reloads and browser sessions.

CLI Tool

Basic Usage

# Clean a single file
python tool.py document.txt

# Clean multiple files
python tool.py file1.txt file2.txt file3.docx

# Preview changes without modifying files
python tool.py --dry-run document.txt

# Undo last operation
python tool.py --undo

Supported Formats

.txt - Plain text files
.docx - Microsoft Word documents
.pdf - PDF files
- Web: Automatic (uses PDF.js library - no installation needed)
- CLI: Requires pdftotext from poppler-utils

Command Options

python tool.py [OPTIONS] [FILES...]

Options:
  -h, --help     Show help message
  --dry-run      Preview changes without modifying files
  --undo         Restore files from last operation

Note: CLI version uses Conservative mode by default. Advanced features (merge lines, whitespace normalization) are available programmatically but disabled by default for safety.

Examples

Example 1: Clean a single document

python tool.py report.txt

Example 2: Clean multiple documents

python tool.py document1.txt document2.docx document3.pdf

Example 3: Preview before cleaning

python tool.py --dry-run important_document.txt

Example 4: Undo last operation

python tool.py --undo

Output

Original files are backed up with .bak extension
Processed files replace originals
Statistics are shown in console:
- Lines removed
- Duplicates collapsed
- Empty lines removed
- Headers/footers removed
- Punctuation lines removed
- Dehyphenated tokens
- Repeating headers/footers removed
- Merged lines (if enabled)
Operation log saved to .strip-log

Best Practices

Always test on copies first - Use --dry-run or test on copies
Backup important files - The tool creates backups, but extra backups never hurt
Review statistics - Check what was removed before finalizing
Use appropriate mode - Fast Clean for simple documents, Smart Clean for complex ones
Start with Conservative mode - It's safer and preserves formatting
Use side-by-side preview - Always review changes before downloading
Check lists and tables - Ensure they're preserved correctly, especially in Aggressive mode

Troubleshooting

See Troubleshooting Guide for common issues and solutions.

Testing

See SELF_TESTS.md for manual test steps and expected results.

Test fixtures are available in the examples/ directory:

fixture1_headers_footers.txt - Headers/footers + page numbers
fixture2_hyphenation.txt - Hyphenation + mid-sentence wraps
fixture3_lists_tables.txt - Lists & pseudo-tables

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Usage

Usage Guide

Web Application

Quick Start

Cleaning Modes

Fast Clean

Smart Clean (Beta)

Cleaning Mode Types

Conservative Mode (Default - Recommended)

Aggressive Mode

Cleaning Options

Basic Options

Advanced Options (Collapsible Menu)

Preview & Results

Settings Persistence

CLI Tool

Basic Usage

Supported Formats

Command Options

Examples

Example 1: Clean a single document

Example 2: Clean multiple documents

Example 3: Preview before cleaning

Example 4: Undo last operation

Output

Best Practices

Troubleshooting

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally