API

API Documentation

Web Application API

DocStripper Class

Main class for rule-based document cleaning.

const stripper = new DocStripper(options);

Options

{
  removeEmptyLines: boolean,        // Default: true
  removePageNumbers: boolean,        // Default: true
  removeHeadersFooters: boolean,    // Default: true
  removeDuplicates: boolean,         // Default: true
  removePunctuationLines: boolean,  // Default: true
  preserveParagraphSpacing: boolean  // Default: true
}

Methods

`processFile(file: File): Promise<Object>`

Processes a file and returns cleaned content with statistics.

Parameters:

file: File object (from file input)

Returns:

{
  fileName: string,
  content: string,
  stats: {
    originalLines: number,
    cleanedLines: number,
    removedLines: number,
    duplicatesRemoved: number,
    headersFootersRemoved: number,
    pageNumbersRemoved: number,
    punctuationLinesRemoved: number,
    emptyLinesRemoved: number
  }
}

SmartCleaner Class

AI-powered document cleaning using WebLLM.

const cleaner = new SmartCleaner();

Constructor

new SmartCleaner()

Creates a new SmartCleaner instance with default settings.

Settings

{
  removeEmptyLines: boolean,           // Default: true
  removePageNumbers: boolean,           // Default: true
  removeHeadersFooters: boolean,        // Default: true
  removeDuplicates: boolean,            // Default: true
  removePunctuationLines: boolean,     // Default: true
  preserveParagraphSpacing: boolean,    // Default: true
  dehyphenate: boolean,                 // Default: true
  mergeBrokenLines: boolean,           // Default: false (enabled in Aggressive mode)
  normalizeWhitespace: boolean,         // Default: false (enabled in Aggressive mode)
  keepTableSpacing: boolean,           // Default: true
  cleaningModeType: string              // 'conservative' or 'aggressive'
}

Methods

`setSettings(settings: Object): void`

Update cleaning settings for the AI model.

Parameters:

settings: Object with cleaning options and cleaningModeType

Note: cleaningModeType affects the LLM prompt:

'conservative': Cautious prompts that preserve structure
'aggressive': Thorough prompts that allow merging and normalization

`setProgressCallback(callback: Function): void`

Set callback for progress updates.

Parameters:

callback: Function(percent: number, message: string)

`cleanText(text: string, settings: Object): Promise<Object>`

Clean text using AI model with post-processing.

Parameters:

text: String to clean
settings: Cleaning options (includes cleaningModeType)

Returns: Promise resolving to:

{
  text: string,           // Cleaned text
  stats: {
    linesRemoved: number,
    duplicatesCollapsed: number,
    emptyLinesRemoved: number,
    headerFooterRemoved: number,
    punctuationLinesRemoved: number,
    dehyphenatedTokens: number,
    mergedLines: number
  }
}

Processing Flow:

LLM analyzes text based on mode and settings
Post-processing applies:
- Dehyphenation (if enabled)
- Merge broken lines (if enabled)
- Whitespace normalization (if enabled)

`cancel(): void`

Cancel ongoing cleaning operation.

CLI API

Command Line Interface

python tool.py [OPTIONS] [FILES...]

Python API

from tool import DocStripper

stripper = DocStripper(
    remove_empty_lines=True,
    remove_page_numbers=True,
    remove_headers_footers=True,
    remove_duplicates=True,
    remove_punctuation_lines=True,
    preserve_paragraph_spacing=True
)

# Process text
cleaned_text, stats = stripper.process_text(text)

# Process file
cleaned_text, stats = stripper.process_file(file_path)

DocStripper Class

class DocStripper:
    def __init__(self, **options):
        """
        Initialize DocStripper with cleaning options.
        
        Options:
            remove_empty_lines: bool
            remove_page_numbers: bool
            remove_headers_footers: bool
            remove_duplicates: bool
            remove_punctuation_lines: bool
            preserve_paragraph_spacing: bool
        """
    
    def process_text(self, text: str) -> tuple[str, dict]:
        """
        Process text string.
        
        Returns:
            tuple: (cleaned_text, statistics)
        """
    
    def process_file(self, file_path: str) -> tuple[str, dict]:
        """
        Process file from disk.
        
        Returns:
            tuple: (cleaned_text, statistics)
        """

Integration Examples

JavaScript

// Rule-based cleaning
const file = document.getElementById('fileInput').files[0];
const stripper = new DocStripper({
  removeEmptyLines: true,
  removePageNumbers: true
});

const result = await stripper.processFile(file);
console.log(result.stats);

Python

from tool import DocStripper

stripper = DocStripper(
    remove_empty_lines=True,
    remove_page_numbers=True
)

cleaned, stats = stripper.process_file('document.txt')
print(f"Removed {stats['removed_lines']} lines")

Error Handling

Web Application

try {
  const result = await stripper.processFile(file);
} catch (error) {
  console.error('Processing failed:', error);
  // Handle error
}

CLI

try:
    cleaned, stats = stripper.process_file(file_path)
except FileNotFoundError:
    print(f"File not found: {file_path}")
except Exception as e:
    print(f"Error: {e}")

Performance Considerations

Fast Clean: O(n) where n is number of lines
Smart Clean: O(n) with overhead from LLM processing
Large files: Consider chunking for Smart Clean mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API

API Documentation

Web Application API

DocStripper Class

Options

Methods

`processFile(file: File): Promise<Object>`

SmartCleaner Class

Constructor

Settings

Methods

`setSettings(settings: Object): void`

`setProgressCallback(callback: Function): void`

`cleanText(text: string, settings: Object): Promise<Object>`

`cancel(): void`

CLI API

Command Line Interface

Python API

DocStripper Class

Integration Examples

JavaScript

Python

Error Handling

Web Application

CLI

Performance Considerations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally