A powerful and intelligent PDF layout analysis engine that automatically extracts figures, tables, and structured content from PDF documents using advanced computer vision and machine learning techniques.
π― Problem Solved: Complex engineering documents often lose critical visual information (figures, diagrams, technical drawings) when being parsed by traditional PDF tools. This engine specifically addresses the challenge of accurately detecting and extracting visual elements from technical and engineering PDFs that contain intricate layouts, multi-column designs, and embedded graphics.
Debug overlay showing detected layout elements: columns (blue), text blocks (green), figures (red), and tables (yellow)
- π Multi-column Layout Detection - Automatically identifies and processes complex multi-column layouts
- π Intelligent Table Recognition (Mistral OCR) - Extracts tables and text with high accuracy via Mistral Document OCR
- πΌοΈ Figure Extraction (Custom) - Identifies and extracts figures, diagrams, and images using custom algorithms
- π Text Block Analysis (Mistral + Heuristics) - Uses Mistral OCR output and in-house grouping for reading order
- π·οΈ Caption Linking - Automatically links captions to their corresponding figures and tables
- π― High Accuracy - Advanced algorithms ensure reliable content extraction
- β‘ Fast Processing - Optimized for speed and efficiency
- π οΈ Easy Integration - Simple API for integration into existing workflows
- π§ Debug Mode - Visualize layout analysis with overlay images
pip install quanta-pdffrom quanta import extract_document
result = extract_document("document.pdf", "output/")
print(f"Pages: {len(result['pages'])}")quanta --input document.pdf --output output/If you want Mistral OCR tables/text, set MISTRAL_API_KEY first (see below).
To enable Mistral OCR for tables and text blocks, set your API key. You can either export it or place it in a .env file at your project root.
# Option A: environment variable
export MISTRAL_API_KEY="your-mistral-api-key"
# Option B: .env file (same directory where you run the code)
echo "MISTRAL_API_KEY=your-mistral-api-key" > .envThe library loads .env automatically; the CLI also picks it up when run from that directory.
The engine follows a sophisticated multi-stage pipeline:
- PDF Rendering - Converts PDF pages to high-resolution images
- Column Detection - Identifies multi-column layouts using whitespace analysis
- Text Extraction - Extracts and groups text blocks
- Figure Detection - Identifies figures using vector clustering and image analysis
- Table & Text Recognition (Mistral OCR) - Leverages Mistral Document OCR to extract tables (CSV) and text blocks
- Caption Linking - Links captions to their corresponding figures/tables
- Reading Order - Determines proper reading sequence
Column Detection Algorithm:
- Uses whitespace valley analysis to identify column boundaries
- Applies Gaussian smoothing to detect consistent vertical gaps
- Implements adaptive thresholding for varying document layouts
Table/Text Extraction:
- Uses Mistral Document OCR to obtain markdown-like structured output
- Parses tables into CSV files and groups text into blocks
Figure Detection:
- Vector clustering using DBSCAN algorithm
- Aspect ratio analysis to distinguish figures from tables
- Image XObject extraction for embedded graphics
Process a PDF document and extract structured content.
Parameters:
input_pdf: Path to the input PDF fileoutput_dir: Directory to save extracted content
Returns:
Dict[str, Any]: Processing results containing figures, tables, and metadata
Example:
from quanta import extract_document
result = extract_document("research_paper.pdf", "output/")
print(result["summary_path"]) # JSON summary path- Technical Drawings: Extract engineering diagrams and CAD drawings
- Specification Sheets: Parse technical specifications and data tables
- Engineering Reports: Process complex multi-column technical reports
- Manufacturing Docs: Extract assembly instructions and part diagrams
- Extract figures and tables from research papers
- Analyze document structure and layout
- Process large collections of academic PDFs
- Convert PDF documents to structured data
- Extract content for database storage
- Prepare documents for text analysis
- Automatically categorize document content
- Extract metadata and captions
- Generate document summaries
- Extract tabular data from reports
- Process financial documents
- Analyze technical specifications
from pdf_layout_engine import process_pdf
# Custom processing parameters
config = {
'min_figure_area': 1000,
'table_detection_threshold': 0.7,
'column_detection_sensitivity': 0.8
}
result = process_pdf("document.pdf", "output/", config=config)Enable debug mode to visualize the layout analysis process:
python main.py --debugThis generates overlay images showing:
- π¦ Blue rectangles: Column boundaries
- π’ Green rectangles: Text blocks
- π₯ Red rectangles: Figures
- π‘ Yellow rectangles: Tables
Results are organized per page under the PDF name inside output/.
Example:
output/<pdf_name>/
βββ page_01/
β βββ figures/
β β βββ figure_01.png
β βββ tables/
β β βββ table_01.csv # tables saved as CSV only (no table PNGs)
β βββ text/
β β βββ text_blocks.txt # text blocks from Mistral OCR
β βββ page_01.png # full page image
βββ page_02/
β βββ ...
βββ page_XX_debug_overlay.png # debug overlay for each processed page (at root)
βββ summary.json # high-level summary (counts, filenames)
Key points:
- Tables are saved as CSV files only (no table images).
- Figures are cropped from the page using custom detection and saved as PNGs.
- Text blocks (from Mistral OCR) are written to
text/text_blocks.txtper page. - A full-page PNG is saved in each
page_XX/directory. - Debug overlays (
page_XX_debug_overlay.png) are saved at the PDF root insideoutput/<pdf_name>/.
- Processing Speed: ~2-5 seconds per page
- Current Accuracy: ~80% for figures and tables
- Memory Usage: ~200MB for typical documents
- Supported Formats: PDF 1.4 - PDF 2.0
We're currently fine-tuning our base models to improve accuracy. The engine is in active development with regular updates to enhance detection performance. We're working towards achieving 90%+ accuracy through:
- Model fine-tuning on engineering document datasets
- Improved preprocessing pipelines
- Enhanced feature extraction algorithms
- Community feedback integration
- Use high-resolution rendering for better accuracy
- Adjust parameters based on document type
- Process pages in parallel for batch operations
- Use debug mode to tune detection parameters
Debug overlay showing detected layout elements: columns (red), text blocks (green), figures (blue), and tables (yellow)
Developers & Maintainers:
- @soovittt - Core Developer
- @Manushpm8 - Core Developer
- @Magnet-AI - Organization
We welcome contributions! Please see our Contributing Guide for details.
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=srcThis project is licensed under the MIT License - see the LICENSE file for details.
- Built with PyMuPDF for PDF processing
- Uses OpenCV for computer vision operations
- Inspired by research in document layout analysis
- π§ Email: [email protected]
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
Made with β€οΈ for the open source community




