Skip to content

MVP version of PDF deep processor #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

achillet3
Copy link

PDF Image Analysis Enhancement

Overview

This PR adds advanced image analysis capabilities to the PDF processor, enabling text extraction from images embedded within PDF files. This feature is particularly valuable for scanned documents, diagrams, charts, and other visual content.

Features

  • Added two image analysis options:
    • SmolDocling (Open Source): Uses Google's SmolDocling model for local image analysis without API requirements
    • MistralOCR (API-based): Leverages Mistral's OCR API for high-quality text extraction from images
  • Updated PDF processor to detect and analyze embedded images
  • Added configuration options to enable/disable image analysis and select the analyzer type
  • Implemented batch processing for efficient handling of multiple images

Implementation Details

  • Created SmolDoclingImageAnalyzer class for local image analysis using transformers
  • Created MistralOCRImageAnalyzer class for API-based image analysis
  • Added configuration parameters in PDFProcessor to control image analysis behavior
  • Updated documentation with configuration examples and usage instructions
  • Added comprehensive unit tests for both analyzer implementations

Configuration

dispatcher_config:
  processor_config:
    PDFProcessor:
      - analyze_images: true  # Enable image analysis
      - image_analyzer_type: "smoldocling"  # Options: "smoldocling" or "mistral"

Testing

  • Added unit tests for both SmolDocling and MistralOCR analyzers
  • Tests include mocked implementations to avoid actual API calls or model loading during testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant