Quanta

Advanced PDF Layout Analysis Engine

A powerful and intelligent PDF layout analysis engine that automatically extracts figures, tables, and structured content from PDF documents using advanced computer vision and machine learning techniques.

🎯 Problem Solved: Complex engineering documents often lose critical visual information (figures, diagrams, technical drawings) when being parsed by traditional PDF tools. This engine specifically addresses the challenge of accurately detecting and extracting visual elements from technical and engineering PDFs that contain intricate layouts, multi-column designs, and embedded graphics.

Debug overlay showing detected layout elements: columns (blue), text blocks (green), figures (red), and tables (yellow)

✨ Features

🔍 Multi-column Layout Detection - Automatically identifies and processes complex multi-column layouts
📊 Intelligent Table Recognition (Mistral OCR) - Extracts tables and text with high accuracy via Mistral Document OCR
🖼️ Figure Extraction (Custom) - Identifies and extracts figures, diagrams, and images using custom algorithms
📝 Text Block Analysis (Mistral + Heuristics) - Uses Mistral OCR output and in-house grouping for reading order
🏷️ Caption Linking - Automatically links captions to their corresponding figures and tables
🎯 High Accuracy - Advanced algorithms ensure reliable content extraction
⚡ Fast Processing - Optimized for speed and efficiency
🛠️ Easy Integration - Simple API for integration into existing workflows
🔧 Debug Mode - Visualize layout analysis with overlay images

🚀 Quick Start

Install via PyPI

pip install quanta-pdf

Basic Usage (Python)

from quanta import extract_document

result = extract_document("document.pdf", "output/")
print(f"Pages: {len(result['pages'])}")

Command Line Interface

quanta --input document.pdf --output output/

If you want Mistral OCR tables/text, set MISTRAL_API_KEY first (see below).

Environment configuration (.env)

To enable Mistral OCR for tables and text blocks, set your API key. You can either export it or place it in a .env file at your project root.

# Option A: environment variable
export MISTRAL_API_KEY="your-mistral-api-key"

# Option B: .env file (same directory where you run the code)
echo "MISTRAL_API_KEY=your-mistral-api-key" > .env

The library loads .env automatically; the CLI also picks it up when run from that directory.

📖 Documentation

Core Concepts

Layout Analysis Pipeline

The engine follows a sophisticated multi-stage pipeline:

PDF Rendering - Converts PDF pages to high-resolution images
Column Detection - Identifies multi-column layouts using whitespace analysis
Text Extraction - Extracts and groups text blocks
Figure Detection - Identifies figures using vector clustering and image analysis
Table & Text Recognition (Mistral OCR) - Leverages Mistral Document OCR to extract tables (CSV) and text blocks
Caption Linking - Links captions to their corresponding figures/tables
Reading Order - Determines proper reading sequence

Mathematical Foundations

Column Detection Algorithm:

Uses whitespace valley analysis to identify column boundaries
Applies Gaussian smoothing to detect consistent vertical gaps
Implements adaptive thresholding for varying document layouts

Table/Text Extraction:

Uses Mistral Document OCR to obtain markdown-like structured output
Parses tables into CSV files and groups text into blocks

Figure Detection:

Vector clustering using DBSCAN algorithm
Aspect ratio analysis to distinguish figures from tables
Image XObject extraction for embedded graphics

API Reference (package)

`extract_document(input_pdf: str | Path, output_dir: str | Path) -> dict`

Process a PDF document and extract structured content.

Parameters:

input_pdf: Path to the input PDF file
output_dir: Directory to save extracted content

Returns:

Dict[str, Any]: Processing results containing figures, tables, and metadata

Example:

from quanta import extract_document
result = extract_document("research_paper.pdf", "output/")
print(result["summary_path"])  # JSON summary path

🎯 Use Cases

Engineering & Technical Documents

Technical Drawings: Extract engineering diagrams and CAD drawings
Specification Sheets: Parse technical specifications and data tables
Engineering Reports: Process complex multi-column technical reports
Manufacturing Docs: Extract assembly instructions and part diagrams

Academic Research

Extract figures and tables from research papers
Analyze document structure and layout
Process large collections of academic PDFs

Document Digitization

Convert PDF documents to structured data
Extract content for database storage
Prepare documents for text analysis

Content Management

Automatically categorize document content
Extract metadata and captions
Generate document summaries

Data Analysis

Extract tabular data from reports
Process financial documents
Analyze technical specifications

🔧 Advanced Configuration

Custom Parameters

from pdf_layout_engine import process_pdf

# Custom processing parameters
config = {
    'min_figure_area': 1000,
    'table_detection_threshold': 0.7,
    'column_detection_sensitivity': 0.8
}

result = process_pdf("document.pdf", "output/", config=config)

Debug Mode

Enable debug mode to visualize the layout analysis process:

python main.py --debug

This generates overlay images showing:

🟦 Blue rectangles: Column boundaries
🟢 Green rectangles: Text blocks
🟥 Red rectangles: Figures
🟡 Yellow rectangles: Tables

Output Structure

Results are organized per page under the PDF name inside output/.

Example:

output/<pdf_name>/
├── page_01/
│   ├── figures/
│   │   └── figure_01.png
│   ├── tables/
│   │   └── table_01.csv          # tables saved as CSV only (no table PNGs)
│   ├── text/
│   │   └── text_blocks.txt       # text blocks from Mistral OCR
│   └── page_01.png               # full page image
├── page_02/
│   └── ...
├── page_XX_debug_overlay.png     # debug overlay for each processed page (at root)
└── summary.json                  # high-level summary (counts, filenames)

Key points:

Tables are saved as CSV files only (no table images).
Figures are cropped from the page using custom detection and saved as PNGs.
Text blocks (from Mistral OCR) are written to text/text_blocks.txt per page.
A full-page PNG is saved in each page_XX/ directory.
Debug overlays (page_XX_debug_overlay.png) are saved at the PDF root inside output/<pdf_name>/.

📊 Performance

Current Benchmarks

Processing Speed: ~2-5 seconds per page
Current Accuracy: ~80% for figures and tables
Memory Usage: ~200MB for typical documents
Supported Formats: PDF 1.4 - PDF 2.0

🚧 Active Development

We're currently fine-tuning our base models to improve accuracy. The engine is in active development with regular updates to enhance detection performance. We're working towards achieving 90%+ accuracy through:

Model fine-tuning on engineering document datasets
Improved preprocessing pipelines
Enhanced feature extraction algorithms
Community feedback integration

Optimization Tips

Use high-resolution rendering for better accuracy
Adjust parameters based on document type
Process pages in parallel for batch operations
Use debug mode to tune detection parameters

🖼️ Examples

Debug Overlay Analysis

Debug overlay showing detected layout elements: columns (red), text blocks (green), figures (blue), and tables (yellow)

Engineering Document Processing

Complex engineering document with multi-column layout and technical drawings

Extracted Figure

Automatically extracted figure from PDF document

Extracted Table

Automatically extracted table with preserved formatting

Multi-Page Analysis

Consistent layout analysis across multiple pages of technical documents

👥 Contributors

Developers & Maintainers:

@soovittt - Core Developer
@Manushpm8 - Core Developer
@Magnet-AI - Organization

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=src

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with PyMuPDF for PDF processing
Uses OpenCV for computer vision operations
Inspired by research in document layout analysis

📞 Support

📧 Email: [email protected]
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Made with ❤️ for the open source community

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
assets		assets
docs/images		docs/images
examples		examples
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GITHUB_INFO.md		GITHUB_INFO.md
LICENSE		LICENSE
README.md		README.md
USAGE.txt		USAGE.txt
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

Magnet-AI/Quanta

Folders and files

Latest commit

History

Repository files navigation