Skip to content
kiku edited this page Nov 1, 2025 · 4 revisions

FAQ - Frequently Asked Questions

General Questions

What is DocStripper?

DocStripper is an AI-powered batch document cleaner that automatically removes noise from text documents, including page numbers, headers, footers, duplicate lines, and empty lines.

Is DocStripper free?

Yes! DocStripper is completely free and open-source (MIT License).

Is my data private?

Absolutely! All processing happens entirely in your browser (web app) or on your computer (CLI). Your files never leave your device - no uploads, no server-side processing.

What file formats are supported?

  • .txt - Plain text files
  • .docx - Microsoft Word documents
  • .pdf - PDF files
    • Web: Automatic support (uses PDF.js library)
    • CLI: Requires pdftotext from poppler-utils

Web Application

Do I need to install anything?

No! The web application works entirely in your browser. Just visit the website and start using it.

What browsers are supported?

  • Chrome/Edge 113+ (recommended)
  • Firefox 110+
  • Safari 18+

What's the difference between Fast Clean and Smart Clean?

  • Fast Clean: Instant rule-based cleaning using pattern matching
  • Smart Clean: AI-powered cleaning using on-device LLM (requires WebGPU, slower but more intelligent)
    • Mode-aware: Conservative/Aggressive modes influence the LLM prompts
    • Post-processing: After LLM processing, applies dehyphenation, merge lines, and whitespace normalization (if enabled)
    • Provides more intelligent cleaning for complex documents with unusual patterns

What's the difference between Conservative and Aggressive modes?

  • Conservative Mode (default): Safe defaults that preserve lists and tables. Removes noise but never merges lines or normalizes whitespace. Recommended for most users.
  • Aggressive Mode: More aggressive cleaning that merges broken lines and normalizes whitespace, but still protects lists and tables. Use with caution on complex documents.

What does "Dehyphenate" do?

Dehyphenation safely joins words broken across lines. For example, "auto-\nmatic" becomes "automatic". It only works when the continuation starts with lowercase letters to avoid false positives.

What are "Repeating Headers/Footers"?

Headers or footers that appear on ≥70% of pages are automatically detected and removed. This helps clean documents where headers/footers aren't in the known pattern list but repeat across pages.

How does merge broken lines work?

In Aggressive mode, lines broken mid-sentence are merged back together. For example:

This is a sentence
broken across lines.

becomes:

This is a sentence broken across lines.

Lists and tables are protected from merging to preserve formatting.

What does whitespace normalization do?

In Aggressive mode, whitespace normalization:

  • Collapses multiple spaces into single spaces
  • Normalizes tabs to spaces
  • Trims trailing spaces

Table blocks are detected and protected to preserve table formatting.

Can I see what changed before downloading?

Yes! The web app shows a side-by-side preview: Original | Cleaned. You can scroll through both versions to see exactly what was removed or changed.

Why does Smart Clean require a download?

Smart Clean downloads a small AI model (~100-200 MB) to run locally in your browser. This download happens only once and is cached for future use.

Can I use Smart Clean offline?

Yes! After the initial download, Smart Clean works completely offline.

What if WebGPU is not available?

The tool will automatically fall back to Fast Clean mode.

CLI Tool

Do I need to install Python packages?

No! The CLI tool uses only Python standard library - no external dependencies required (except optional poppler-utils for PDF support).

How do I install PDF support?

  • macOS: brew install poppler
  • Ubuntu/Debian: sudo apt-get install poppler-utils
  • Windows: Download from poppler-windows releases

Can I undo changes?

Yes! Use python tool.py --undo to restore files from the last operation.

Are my original files backed up?

Yes! Original files are automatically backed up with .bak extension before processing.

Technical Questions

How does Fast Clean work?

Fast Clean uses pattern matching to identify and remove:

  • Page numbers (lines with only digits)
  • Headers/footers (common patterns)
  • Duplicate consecutive lines
  • Empty lines
  • Punctuation-only lines

How does Smart Clean work?

Smart Clean uses an on-device Large Language Model (LLM) to intelligently analyze each line and decide whether to keep, remove, or modify it based on context.

Two-stage process:

  1. LLM Processing: The AI model analyzes the text based on:

    • Your selected cleaning options (removeEmptyLines, removePageNumbers, etc.)
    • The Conservative/Aggressive mode (affects prompt instructions)
    • Conservative mode: Cautious prompts that preserve structure
    • Aggressive mode: Thorough prompts that allow merging and normalization
  2. Post-Processing: After LLM processing, applies:

    • Dehyphenation (if enabled)
    • Merge broken lines (if enabled in Aggressive mode)
    • Whitespace normalization (if enabled in Aggressive mode)

This ensures consistent behavior between Fast Clean and Smart Clean modes.

Can I customize what gets removed?

Yes! Both modes support customizable cleaning options. In Smart Clean mode, unchecked options are preserved in the output.

Note: Individual cleaning options are hidden in the "Advanced Options" collapsible menu by default. You can expand it to configure specific options, but the default settings work well for most documents.

What's the maximum file size?

There's no hard limit, but very large files (>10MB) may take longer to process, especially in Smart Clean mode.

Troubleshooting

The web app isn't working

  • Check your browser version (should be modern)
  • Ensure JavaScript is enabled
  • Try disabling browser extensions
  • Clear browser cache

Smart Clean is slow

  • This is normal for large documents
  • Consider using Fast Clean for very large files
  • Ensure WebGPU is enabled in your browser

CLI says "file not found"

  • Check the file path is correct
  • Ensure you're in the right directory
  • Use absolute paths if relative paths don't work

PDF processing fails

  • Ensure poppler-utils is installed
  • Check that pdftotext is in your PATH
  • Verify the PDF file is not corrupted

Contributing

How can I contribute?

See our Contributing Guide for details. We welcome:

  • Bug reports
  • Feature requests
  • Code contributions
  • Documentation improvements

Where can I report bugs?

Use the GitHub Issues page and select the "Bug Report" template.

Can I suggest features?

Absolutely! Use the Feature Request template.

Still have questions?

Clone this wiki locally