-
-
Notifications
You must be signed in to change notification settings - Fork 1
API
kiku edited this page Nov 1, 2025
·
2 revisions
Main class for rule-based document cleaning.
const stripper = new DocStripper(options);{
removeEmptyLines: boolean, // Default: true
removePageNumbers: boolean, // Default: true
removeHeadersFooters: boolean, // Default: true
removeDuplicates: boolean, // Default: true
removePunctuationLines: boolean, // Default: true
preserveParagraphSpacing: boolean // Default: true
}Processes a file and returns cleaned content with statistics.
Parameters:
-
file: File object (from file input)
Returns:
{
fileName: string,
content: string,
stats: {
originalLines: number,
cleanedLines: number,
removedLines: number,
duplicatesRemoved: number,
headersFootersRemoved: number,
pageNumbersRemoved: number,
punctuationLinesRemoved: number,
emptyLinesRemoved: number
}
}AI-powered document cleaning using WebLLM.
const cleaner = new SmartCleaner();new SmartCleaner()Creates a new SmartCleaner instance with default settings.
{
removeEmptyLines: boolean, // Default: true
removePageNumbers: boolean, // Default: true
removeHeadersFooters: boolean, // Default: true
removeDuplicates: boolean, // Default: true
removePunctuationLines: boolean, // Default: true
preserveParagraphSpacing: boolean, // Default: true
dehyphenate: boolean, // Default: true
mergeBrokenLines: boolean, // Default: false (enabled in Aggressive mode)
normalizeWhitespace: boolean, // Default: false (enabled in Aggressive mode)
keepTableSpacing: boolean, // Default: true
cleaningModeType: string // 'conservative' or 'aggressive'
}Update cleaning settings for the AI model.
Parameters:
-
settings: Object with cleaning options andcleaningModeType
Note: cleaningModeType affects the LLM prompt:
-
'conservative': Cautious prompts that preserve structure -
'aggressive': Thorough prompts that allow merging and normalization
Set callback for progress updates.
Parameters:
-
callback: Function(percent: number, message: string)
Clean text using AI model with post-processing.
Parameters:
-
text: String to clean -
settings: Cleaning options (includescleaningModeType)
Returns: Promise resolving to:
{
text: string, // Cleaned text
stats: {
linesRemoved: number,
duplicatesCollapsed: number,
emptyLinesRemoved: number,
headerFooterRemoved: number,
punctuationLinesRemoved: number,
dehyphenatedTokens: number,
mergedLines: number
}
}Processing Flow:
- LLM analyzes text based on mode and settings
- Post-processing applies:
- Dehyphenation (if enabled)
- Merge broken lines (if enabled)
- Whitespace normalization (if enabled)
Cancel ongoing cleaning operation.
python tool.py [OPTIONS] [FILES...]from tool import DocStripper
stripper = DocStripper(
remove_empty_lines=True,
remove_page_numbers=True,
remove_headers_footers=True,
remove_duplicates=True,
remove_punctuation_lines=True,
preserve_paragraph_spacing=True
)
# Process text
cleaned_text, stats = stripper.process_text(text)
# Process file
cleaned_text, stats = stripper.process_file(file_path)class DocStripper:
def __init__(self, **options):
"""
Initialize DocStripper with cleaning options.
Options:
remove_empty_lines: bool
remove_page_numbers: bool
remove_headers_footers: bool
remove_duplicates: bool
remove_punctuation_lines: bool
preserve_paragraph_spacing: bool
"""
def process_text(self, text: str) -> tuple[str, dict]:
"""
Process text string.
Returns:
tuple: (cleaned_text, statistics)
"""
def process_file(self, file_path: str) -> tuple[str, dict]:
"""
Process file from disk.
Returns:
tuple: (cleaned_text, statistics)
"""// Rule-based cleaning
const file = document.getElementById('fileInput').files[0];
const stripper = new DocStripper({
removeEmptyLines: true,
removePageNumbers: true
});
const result = await stripper.processFile(file);
console.log(result.stats);from tool import DocStripper
stripper = DocStripper(
remove_empty_lines=True,
remove_page_numbers=True
)
cleaned, stats = stripper.process_file('document.txt')
print(f"Removed {stats['removed_lines']} lines")try {
const result = await stripper.processFile(file);
} catch (error) {
console.error('Processing failed:', error);
// Handle error
}try:
cleaned, stats = stripper.process_file(file_path)
except FileNotFoundError:
print(f"File not found: {file_path}")
except Exception as e:
print(f"Error: {e}")- Fast Clean: O(n) where n is number of lines
- Smart Clean: O(n) with overhead from LLM processing
- Large files: Consider chunking for Smart Clean mode