FAQ

FAQ - Frequently Asked Questions

General Questions

What is DocStripper?

DocStripper is an AI-powered batch document cleaner that automatically removes noise from text documents, including page numbers, headers, footers, duplicate lines, and empty lines.

Is DocStripper free?

Yes! DocStripper is completely free and open-source (MIT License).

Is my data private?

Absolutely! All processing happens entirely in your browser (web app) or on your computer (CLI). Your files never leave your device - no uploads, no server-side processing.

What file formats are supported?

.txt - Plain text files
.docx - Microsoft Word documents
.pdf - PDF files
- Web: Automatic support (uses PDF.js library)
- CLI: Requires pdftotext from poppler-utils

Web Application

Do I need to install anything?

No! The web application works entirely in your browser. Just visit the website and start using it.

What browsers are supported?

Chrome/Edge 113+ (recommended)
Firefox 110+
Safari 18+

What's the difference between Fast Clean and Smart Clean?

Fast Clean: Instant rule-based cleaning using pattern matching
Smart Clean: AI-powered cleaning using on-device LLM (requires WebGPU, slower but more intelligent)
- Mode-aware: Conservative/Aggressive modes influence the LLM prompts
- Post-processing: After LLM processing, applies dehyphenation, merge lines, and whitespace normalization (if enabled)
- Provides more intelligent cleaning for complex documents with unusual patterns

What's the difference between Conservative and Aggressive modes?

Conservative Mode (default): Safe defaults that preserve lists and tables. Removes noise but never merges lines or normalizes whitespace. Recommended for most users.
Aggressive Mode: More aggressive cleaning that merges broken lines and normalizes whitespace, but still protects lists and tables. Use with caution on complex documents.

What does "Dehyphenate" do?

Dehyphenation safely joins words broken across lines. For example, "auto-\nmatic" becomes "automatic". It only works when the continuation starts with lowercase letters to avoid false positives.

What are "Repeating Headers/Footers"?

Headers or footers that appear on ≥70% of pages are automatically detected and removed. This helps clean documents where headers/footers aren't in the known pattern list but repeat across pages.

How does merge broken lines work?

In Aggressive mode, lines broken mid-sentence are merged back together. For example:

This is a sentence
broken across lines.

becomes:

This is a sentence broken across lines.

Lists and tables are protected from merging to preserve formatting.

What does whitespace normalization do?

In Aggressive mode, whitespace normalization:

Collapses multiple spaces into single spaces
Normalizes tabs to spaces
Trims trailing spaces

Table blocks are detected and protected to preserve table formatting.

Can I see what changed before downloading?

Yes! The web app shows a side-by-side preview: Original | Cleaned. You can scroll through both versions to see exactly what was removed or changed.

Why does Smart Clean require a download?

Smart Clean downloads a small AI model (~100-200 MB) to run locally in your browser. This download happens only once and is cached for future use.

Can I use Smart Clean offline?

Yes! After the initial download, Smart Clean works completely offline.

What if WebGPU is not available?

The tool will automatically fall back to Fast Clean mode.

CLI Tool

Do I need to install Python packages?

No! The CLI tool uses only Python standard library - no external dependencies required (except optional poppler-utils for PDF support).

How do I install PDF support?

macOS: brew install poppler
Ubuntu/Debian: sudo apt-get install poppler-utils
Windows: Download from poppler-windows releases

Can I undo changes?

Yes! Use python tool.py --undo to restore files from the last operation.

Are my original files backed up?

Yes! Original files are automatically backed up with .bak extension before processing.

Technical Questions

How does Fast Clean work?

Fast Clean uses pattern matching to identify and remove:

Page numbers (lines with only digits)
Headers/footers (common patterns)
Duplicate consecutive lines
Empty lines
Punctuation-only lines

How does Smart Clean work?

Smart Clean uses an on-device Large Language Model (LLM) to intelligently analyze each line and decide whether to keep, remove, or modify it based on context.

Two-stage process:

LLM Processing: The AI model analyzes the text based on:
- Your selected cleaning options (removeEmptyLines, removePageNumbers, etc.)
- The Conservative/Aggressive mode (affects prompt instructions)
- Conservative mode: Cautious prompts that preserve structure
- Aggressive mode: Thorough prompts that allow merging and normalization
Post-Processing: After LLM processing, applies:
- Dehyphenation (if enabled)
- Merge broken lines (if enabled in Aggressive mode)
- Whitespace normalization (if enabled in Aggressive mode)

This ensures consistent behavior between Fast Clean and Smart Clean modes.

Can I customize what gets removed?

Yes! Both modes support customizable cleaning options. In Smart Clean mode, unchecked options are preserved in the output.

Note: Individual cleaning options are hidden in the "Advanced Options" collapsible menu by default. You can expand it to configure specific options, but the default settings work well for most documents.

What's the maximum file size?

There's no hard limit, but very large files (>10MB) may take longer to process, especially in Smart Clean mode.

Troubleshooting

The web app isn't working

Check your browser version (should be modern)
Ensure JavaScript is enabled
Try disabling browser extensions
Clear browser cache

Smart Clean is slow

This is normal for large documents
Consider using Fast Clean for very large files
Ensure WebGPU is enabled in your browser

CLI says "file not found"

Check the file path is correct
Ensure you're in the right directory
Use absolute paths if relative paths don't work

PDF processing fails

Ensure poppler-utils is installed
Check that pdftotext is in your PATH
Verify the PDF file is not corrupted

Contributing

How can I contribute?

See our Contributing Guide for details. We welcome:

Bug reports
Feature requests
Code contributions
Documentation improvements

Where can I report bugs?

Use the GitHub Issues page and select the "Bug Report" template.

Can I suggest features?

Absolutely! Use the Feature Request template.

Still have questions?

Check GitHub Discussions
Open an Issue
Visit the Product Hunt page

Uh oh!

FAQ

FAQ - Frequently Asked Questions

General Questions

What is DocStripper?

Is DocStripper free?

Is my data private?

What file formats are supported?

Web Application

Do I need to install anything?

What browsers are supported?

What's the difference between Fast Clean and Smart Clean?

What's the difference between Conservative and Aggressive modes?

What does "Dehyphenate" do?

What are "Repeating Headers/Footers"?

How does merge broken lines work?

What does whitespace normalization do?

Can I see what changed before downloading?

Why does Smart Clean require a download?

Can I use Smart Clean offline?

What if WebGPU is not available?

CLI Tool

Do I need to install Python packages?

How do I install PDF support?

Can I undo changes?

Are my original files backed up?

Technical Questions

How does Fast Clean work?

How does Smart Clean work?

Can I customize what gets removed?

What's the maximum file size?

Troubleshooting

The web app isn't working

Smart Clean is slow

CLI says "file not found"

PDF processing fails

Contributing

How can I contribute?

Where can I report bugs?

Can I suggest features?

Still have questions?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally