Skip to content
kiku edited this page Nov 1, 2025 · 2 revisions

API Documentation

Web Application API

DocStripper Class

Main class for rule-based document cleaning.

const stripper = new DocStripper(options);

Options

{
  removeEmptyLines: boolean,        // Default: true
  removePageNumbers: boolean,        // Default: true
  removeHeadersFooters: boolean,    // Default: true
  removeDuplicates: boolean,         // Default: true
  removePunctuationLines: boolean,  // Default: true
  preserveParagraphSpacing: boolean  // Default: true
}

Methods

processFile(file: File): Promise<Object>

Processes a file and returns cleaned content with statistics.

Parameters:

  • file: File object (from file input)

Returns:

{
  fileName: string,
  content: string,
  stats: {
    originalLines: number,
    cleanedLines: number,
    removedLines: number,
    duplicatesRemoved: number,
    headersFootersRemoved: number,
    pageNumbersRemoved: number,
    punctuationLinesRemoved: number,
    emptyLinesRemoved: number
  }
}

SmartCleaner Class

AI-powered document cleaning using WebLLM.

const cleaner = new SmartCleaner();

Constructor

new SmartCleaner()

Creates a new SmartCleaner instance with default settings.

Settings

{
  removeEmptyLines: boolean,           // Default: true
  removePageNumbers: boolean,           // Default: true
  removeHeadersFooters: boolean,        // Default: true
  removeDuplicates: boolean,            // Default: true
  removePunctuationLines: boolean,     // Default: true
  preserveParagraphSpacing: boolean,    // Default: true
  dehyphenate: boolean,                 // Default: true
  mergeBrokenLines: boolean,           // Default: false (enabled in Aggressive mode)
  normalizeWhitespace: boolean,         // Default: false (enabled in Aggressive mode)
  keepTableSpacing: boolean,           // Default: true
  cleaningModeType: string              // 'conservative' or 'aggressive'
}

Methods

setSettings(settings: Object): void

Update cleaning settings for the AI model.

Parameters:

  • settings: Object with cleaning options and cleaningModeType

Note: cleaningModeType affects the LLM prompt:

  • 'conservative': Cautious prompts that preserve structure
  • 'aggressive': Thorough prompts that allow merging and normalization
setProgressCallback(callback: Function): void

Set callback for progress updates.

Parameters:

  • callback: Function(percent: number, message: string)
cleanText(text: string, settings: Object): Promise<Object>

Clean text using AI model with post-processing.

Parameters:

  • text: String to clean
  • settings: Cleaning options (includes cleaningModeType)

Returns: Promise resolving to:

{
  text: string,           // Cleaned text
  stats: {
    linesRemoved: number,
    duplicatesCollapsed: number,
    emptyLinesRemoved: number,
    headerFooterRemoved: number,
    punctuationLinesRemoved: number,
    dehyphenatedTokens: number,
    mergedLines: number
  }
}

Processing Flow:

  1. LLM analyzes text based on mode and settings
  2. Post-processing applies:
    • Dehyphenation (if enabled)
    • Merge broken lines (if enabled)
    • Whitespace normalization (if enabled)
cancel(): void

Cancel ongoing cleaning operation.

CLI API

Command Line Interface

python tool.py [OPTIONS] [FILES...]

Python API

from tool import DocStripper

stripper = DocStripper(
    remove_empty_lines=True,
    remove_page_numbers=True,
    remove_headers_footers=True,
    remove_duplicates=True,
    remove_punctuation_lines=True,
    preserve_paragraph_spacing=True
)

# Process text
cleaned_text, stats = stripper.process_text(text)

# Process file
cleaned_text, stats = stripper.process_file(file_path)

DocStripper Class

class DocStripper:
    def __init__(self, **options):
        """
        Initialize DocStripper with cleaning options.
        
        Options:
            remove_empty_lines: bool
            remove_page_numbers: bool
            remove_headers_footers: bool
            remove_duplicates: bool
            remove_punctuation_lines: bool
            preserve_paragraph_spacing: bool
        """
    
    def process_text(self, text: str) -> tuple[str, dict]:
        """
        Process text string.
        
        Returns:
            tuple: (cleaned_text, statistics)
        """
    
    def process_file(self, file_path: str) -> tuple[str, dict]:
        """
        Process file from disk.
        
        Returns:
            tuple: (cleaned_text, statistics)
        """

Integration Examples

JavaScript

// Rule-based cleaning
const file = document.getElementById('fileInput').files[0];
const stripper = new DocStripper({
  removeEmptyLines: true,
  removePageNumbers: true
});

const result = await stripper.processFile(file);
console.log(result.stats);

Python

from tool import DocStripper

stripper = DocStripper(
    remove_empty_lines=True,
    remove_page_numbers=True
)

cleaned, stats = stripper.process_file('document.txt')
print(f"Removed {stats['removed_lines']} lines")

Error Handling

Web Application

try {
  const result = await stripper.processFile(file);
} catch (error) {
  console.error('Processing failed:', error);
  // Handle error
}

CLI

try:
    cleaned, stats = stripper.process_file(file_path)
except FileNotFoundError:
    print(f"File not found: {file_path}")
except Exception as e:
    print(f"Error: {e}")

Performance Considerations

  • Fast Clean: O(n) where n is number of lines
  • Smart Clean: O(n) with overhead from LLM processing
  • Large files: Consider chunking for Smart Clean mode