Advanced PDF processing with AI-powered OCR, text extraction, and selectable text overlays using Ollama models
- AI-Powered OCR using Ollama models (llava, moondream, etc.)
- Modular Architecture with clear separation of concerns
- Multiple Output Formats:
- SVG with selectable text overlays
- Raw text extraction
- JSON metadata
- Image Enhancement with multiple strategies
- Robust Error Handling with configurable retries
- Parallel Processing for batch operations
- CLI Interface with progress tracking
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β PDF OCR Processor β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ€
β ββββββββββββββ β βββββββββββββββββββββββββββ β
β β PDF β β β OCRProcessor β β
β β Processor βββΌββΆβ - Text extraction β β
β ββββββββββββββ β β - Ollama integration β β
β β βββββββββββββββ¬ββββββββββββ β
β ββββββββββββββ β βββββββββββββββΌββββββββββββ β
β β Image β β β SVG Generator β β
β β Enhancer βββΌββΆβ - Text overlay β β
β ββββββββββββββ β β - Searchable output β β
βββββββββββββββββββ΄ββββββββββββββββββββββββββββββββ
- Python 3.8+
- Ollama (for OCR processing)
- System dependencies:
# Ubuntu/Debian sudo apt-get install -y tesseract-ocr poppler-utils # macOS brew install tesseract poppler
# Clone the repository
git clone https://github.com/wronai/ocr.git
cd ocr
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt # For development# Process a single PDF
python -m pdf_processor --input document.pdf --output output/
# Process all PDFs in a directory
python -m pdf_processor --input ./documents --output ./output --model llava:7b
# Show help
python -m pdf_processor --helpfrom pdf_processor import PDFProcessor
from pdf_processor.processing.pdf_processor import PDFProcessorConfig
# Configure the processor
config = PDFProcessorConfig(
input_path="document.pdf",
output_dir="./output",
ocr_model="llava:7b",
dpi=300,
max_workers=4
)
# Process a document
processor = PDFProcessor(config)
result = processor.process_pdf("document.pdf")
print(f"Processed {result['pages_processed']} pages")Create a config.yaml file:
# config.yaml
input_path: ./documents # Input file or directory
output_dir: ./output # Output directory
ocr_model: llava:7b # Ollama model to use
dpi: 300 # Image resolution
max_workers: 4 # Number of worker threads
timeout: 300 # Timeout in seconds
max_retries: 3 # Max retry attempts
log_level: INFO # Logging level
log_file: pdf_processor.log # Log file path
# Image enhancement strategies
enhancement_strategies:
- original # Keep original image
- grayscale # Convert to grayscale
- adaptive_threshold # Apply adaptive thresholding
- contrast_stretch # Stretch contrast
- sharpen # Sharpen image
- denoise # Remove noiseexport OLLAMA_HOST="http://localhost:11434"
export OLLAMA_MODEL="llava:7b"
export LOG_LEVEL="DEBUG"# Process with specific DPI
python -m pdf_processor --input document.pdf --output output/ --dpi 400
# Limit number of pages to process
python -m pdf_processor --input document.pdf --output output/ --max-pages 10
# Use a specific enhancement strategy
python -m pdf_processor --input document.pdf --output output/ --enhance grayscale
# Process in verbose mode
python -m pdf_processor --input document.pdf --output output/ --verboseoriginal: Keep original image (fastest)grayscale: Convert to grayscale (good for text-heavy documents)adaptive_threshold: Apply adaptive thresholding (good for low-quality scans)contrast_stretch: Stretch contrast to improve readabilitysharpen: Apply sharpening filterdenoise: Remove image noise
pdf_processor/
βββ __init__.py # Package initialization
βββ cli.py # Command-line interface
βββ config/ # Configuration files
βββ models/ # Data models
β βββ __init__.py
β βββ ocr_result.py # OCR result data structures
β βββ retry_config.py # Retry configuration
βββ processing/ # Core processing modules
β βββ __init__.py
β βββ image_enhancement.py # Image processing
β βββ ocr_processor.py # OCR processing
β βββ pdf_processor.py # Main PDF processing
β βββ svg_generator.py # SVG output generation
βββ utils/ # Utility functions
βββ file_utils.py # File operations
βββ logging_utils.py # Logging configuration
βββ validation_utils.py # Input validation
# Install test dependencies
pip install -r requirements-dev.txt
# Run all tests
pytest
# Run tests with coverage report
pytest --cov=pdf_processor --cov-report=htmlContributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- The Ollama team for their amazing AI models
- The PyMuPDF team for excellent PDF processing
- All contributors who have helped improve this project
This project uses a script-based workflow for development tasks. All scripts are located in the scripts/ directory and can be run directly or via the Makefile.
-
Clone the repository and navigate to the project directory:
git clone https://github.com/wronai/ocr.git cd ocr -
Set up the development environment:
make install-dev
This will:
- Create and activate a virtual environment
- Install all development dependencies
- Set up pre-commit hooks
# Run tests
make test
# Run tests with coverage
make test-cov
# Format code
make format
# Run linters
make lint
# Start development server
make dev-server
# Build documentation
make docs
make docs-serve # Serve docs locallyAll development and build scripts are located in the scripts/ directory. See scripts/README.md for detailed documentation of each script.
# Build Docker image
make docker-build
# Start services with Docker Compose
make docker-run
# Stop services
make docker-stopContributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please ensure your code follows our coding standards and includes appropriate tests.
This project is licensed under the MIT License - see the LICENSE file for details.
See CHANGELOG.md for a list of changes in each version.**
python proc.py --model llava:7b --workers 4- View Results
- Open
output/*_complete.svgin your browser - Check details in
output/processing_report.json
- Open
Full documentation is available in the docs/ directory: