MVP version of PDF deep processor #73

achillet3 · 2025-04-15T13:05:46Z

PDF Image Analysis Enhancement

Overview

This PR adds advanced image analysis capabilities to the PDF processor, enabling text extraction from images embedded within PDF files. This feature is particularly valuable for scanned documents, diagrams, charts, and other visual content.

Features

Added two image analysis options:
- SmolDocling (Open Source): Uses Google's SmolDocling model for local image analysis without API requirements
- MistralOCR (API-based): Leverages Mistral's OCR API for high-quality text extraction from images
Updated PDF processor to detect and analyze embedded images
Added configuration options to enable/disable image analysis and select the analyzer type
Implemented batch processing for efficient handling of multiple images

Implementation Details

Created SmolDoclingImageAnalyzer class for local image analysis using transformers
Created MistralOCRImageAnalyzer class for API-based image analysis
Added configuration parameters in PDFProcessor to control image analysis behavior
Updated documentation with configuration examples and usage instructions
Added comprehensive unit tests for both analyzer implementations

Configuration

dispatcher_config:
  processor_config:
    PDFProcessor:
      - analyze_images: true  # Enable image analysis
      - image_analyzer_type: "smoldocling"  # Options: "smoldocling" or "mistral"

Testing

Added unit tests for both SmolDocling and MistralOCR analyzers
Tests include mocked implementations to avoid actual API calls or model loading during testing

MVP version of PDF deep processor

c55fce2

fabnemEPFL mentioned this pull request Jun 19, 2025

Make a deeper PDF processor #65

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MVP version of PDF deep processor #73

MVP version of PDF deep processor #73

Uh oh!

achillet3 commented Apr 15, 2025

Uh oh!

Uh oh!

MVP version of PDF deep processor #73

Are you sure you want to change the base?

MVP version of PDF deep processor #73

Uh oh!

Conversation

achillet3 commented Apr 15, 2025

PDF Image Analysis Enhancement

Overview

Features

Implementation Details

Configuration

Testing

Uh oh!

Uh oh!