Skip to content

Advanced PDF layout analysis engine for extracting figures, tables, and structured content from complex engineering documents using computer vision and machine learning.

License

Notifications You must be signed in to change notification settings

Magnet-AI/Quanta

Quanta Logo

Quanta

Advanced PDF Layout Analysis Engine

Python 3.8+ License: MIT Code style: black Stars

Issues Discussions


A powerful and intelligent PDF layout analysis engine that automatically extracts figures, tables, and structured content from PDF documents using advanced computer vision and machine learning techniques.

🎯 Problem Solved: Complex engineering documents often lose critical visual information (figures, diagrams, technical drawings) when being parsed by traditional PDF tools. This engine specifically addresses the challenge of accurately detecting and extracting visual elements from technical and engineering PDFs that contain intricate layouts, multi-column designs, and embedded graphics.

Layout Analysis Debug Overlay

Debug overlay showing detected layout elements: columns (blue), text blocks (green), figures (red), and tables (yellow)

✨ Features

  • πŸ” Multi-column Layout Detection - Automatically identifies and processes complex multi-column layouts
  • πŸ“Š Intelligent Table Recognition (Mistral OCR) - Extracts tables and text with high accuracy via Mistral Document OCR
  • πŸ–ΌοΈ Figure Extraction (Custom) - Identifies and extracts figures, diagrams, and images using custom algorithms
  • πŸ“ Text Block Analysis (Mistral + Heuristics) - Uses Mistral OCR output and in-house grouping for reading order
  • 🏷️ Caption Linking - Automatically links captions to their corresponding figures and tables
  • 🎯 High Accuracy - Advanced algorithms ensure reliable content extraction
  • ⚑ Fast Processing - Optimized for speed and efficiency
  • πŸ› οΈ Easy Integration - Simple API for integration into existing workflows
  • πŸ”§ Debug Mode - Visualize layout analysis with overlay images

πŸš€ Quick Start

Install via PyPI

pip install quanta-pdf

Basic Usage (Python)

from quanta import extract_document

result = extract_document("document.pdf", "output/")
print(f"Pages: {len(result['pages'])}")

Command Line Interface

quanta --input document.pdf --output output/

If you want Mistral OCR tables/text, set MISTRAL_API_KEY first (see below).

Environment configuration (.env)

To enable Mistral OCR for tables and text blocks, set your API key. You can either export it or place it in a .env file at your project root.

# Option A: environment variable
export MISTRAL_API_KEY="your-mistral-api-key"

# Option B: .env file (same directory where you run the code)
echo "MISTRAL_API_KEY=your-mistral-api-key" > .env

The library loads .env automatically; the CLI also picks it up when run from that directory.

πŸ“– Documentation

Core Concepts

Layout Analysis Pipeline

The engine follows a sophisticated multi-stage pipeline:

  1. PDF Rendering - Converts PDF pages to high-resolution images
  2. Column Detection - Identifies multi-column layouts using whitespace analysis
  3. Text Extraction - Extracts and groups text blocks
  4. Figure Detection - Identifies figures using vector clustering and image analysis
  5. Table & Text Recognition (Mistral OCR) - Leverages Mistral Document OCR to extract tables (CSV) and text blocks
  6. Caption Linking - Links captions to their corresponding figures/tables
  7. Reading Order - Determines proper reading sequence

Mathematical Foundations

Column Detection Algorithm:

  • Uses whitespace valley analysis to identify column boundaries
  • Applies Gaussian smoothing to detect consistent vertical gaps
  • Implements adaptive thresholding for varying document layouts

Table/Text Extraction:

  • Uses Mistral Document OCR to obtain markdown-like structured output
  • Parses tables into CSV files and groups text into blocks

Figure Detection:

  • Vector clustering using DBSCAN algorithm
  • Aspect ratio analysis to distinguish figures from tables
  • Image XObject extraction for embedded graphics

API Reference (package)

extract_document(input_pdf: str | Path, output_dir: str | Path) -> dict

Process a PDF document and extract structured content.

Parameters:

  • input_pdf: Path to the input PDF file
  • output_dir: Directory to save extracted content

Returns:

  • Dict[str, Any]: Processing results containing figures, tables, and metadata

Example:

from quanta import extract_document
result = extract_document("research_paper.pdf", "output/")
print(result["summary_path"])  # JSON summary path

🎯 Use Cases

Engineering & Technical Documents

  • Technical Drawings: Extract engineering diagrams and CAD drawings
  • Specification Sheets: Parse technical specifications and data tables
  • Engineering Reports: Process complex multi-column technical reports
  • Manufacturing Docs: Extract assembly instructions and part diagrams

Academic Research

  • Extract figures and tables from research papers
  • Analyze document structure and layout
  • Process large collections of academic PDFs

Document Digitization

  • Convert PDF documents to structured data
  • Extract content for database storage
  • Prepare documents for text analysis

Content Management

  • Automatically categorize document content
  • Extract metadata and captions
  • Generate document summaries

Data Analysis

  • Extract tabular data from reports
  • Process financial documents
  • Analyze technical specifications

πŸ”§ Advanced Configuration

Custom Parameters

from pdf_layout_engine import process_pdf

# Custom processing parameters
config = {
    'min_figure_area': 1000,
    'table_detection_threshold': 0.7,
    'column_detection_sensitivity': 0.8
}

result = process_pdf("document.pdf", "output/", config=config)

Debug Mode

Enable debug mode to visualize the layout analysis process:

python main.py --debug

This generates overlay images showing:

  • 🟦 Blue rectangles: Column boundaries
  • 🟒 Green rectangles: Text blocks
  • πŸŸ₯ Red rectangles: Figures
  • 🟑 Yellow rectangles: Tables

Output Structure

Results are organized per page under the PDF name inside output/.

Example:

output/<pdf_name>/
β”œβ”€β”€ page_01/
β”‚   β”œβ”€β”€ figures/
β”‚   β”‚   └── figure_01.png
β”‚   β”œβ”€β”€ tables/
β”‚   β”‚   └── table_01.csv          # tables saved as CSV only (no table PNGs)
β”‚   β”œβ”€β”€ text/
β”‚   β”‚   └── text_blocks.txt       # text blocks from Mistral OCR
β”‚   └── page_01.png               # full page image
β”œβ”€β”€ page_02/
β”‚   └── ...
β”œβ”€β”€ page_XX_debug_overlay.png     # debug overlay for each processed page (at root)
└── summary.json                  # high-level summary (counts, filenames)

Key points:

  • Tables are saved as CSV files only (no table images).
  • Figures are cropped from the page using custom detection and saved as PNGs.
  • Text blocks (from Mistral OCR) are written to text/text_blocks.txt per page.
  • A full-page PNG is saved in each page_XX/ directory.
  • Debug overlays (page_XX_debug_overlay.png) are saved at the PDF root inside output/<pdf_name>/.

πŸ“Š Performance

Current Benchmarks

  • Processing Speed: ~2-5 seconds per page
  • Current Accuracy: ~80% for figures and tables
  • Memory Usage: ~200MB for typical documents
  • Supported Formats: PDF 1.4 - PDF 2.0

🚧 Active Development

We're currently fine-tuning our base models to improve accuracy. The engine is in active development with regular updates to enhance detection performance. We're working towards achieving 90%+ accuracy through:

  • Model fine-tuning on engineering document datasets
  • Improved preprocessing pipelines
  • Enhanced feature extraction algorithms
  • Community feedback integration

Optimization Tips

  • Use high-resolution rendering for better accuracy
  • Adjust parameters based on document type
  • Process pages in parallel for batch operations
  • Use debug mode to tune detection parameters

πŸ–ΌοΈ Examples

Debug Overlay Analysis

Layout Analysis Debug Overlay

Debug overlay showing detected layout elements: columns (red), text blocks (green), figures (blue), and tables (yellow)

Engineering Document Processing

Engineering Document Analysis

Complex engineering document with multi-column layout and technical drawings

Extracted Figure

Extracted Figure

Automatically extracted figure from PDF document

Extracted Table

Extracted Table

Automatically extracted table with preserved formatting

Multi-Page Analysis

Multi-page Analysis

Consistent layout analysis across multiple pages of technical documents

πŸ‘₯ Contributors

Developers & Maintainers:

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=src

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Built with PyMuPDF for PDF processing
  • Uses OpenCV for computer vision operations
  • Inspired by research in document layout analysis

πŸ“ž Support


Made with ❀️ for the open source community

About

Advanced PDF layout analysis engine for extracting figures, tables, and structured content from complex engineering documents using computer vision and machine learning.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages