Python API

Python bindings via PyO3. Requires Rust toolchain for building from source.

Install

pip install maturin
maturin develop --release

Usage

import pdf_inspector

# Full processing: detect + extract + convert to Markdown
result = pdf_inspector.process_pdf("document.pdf")
print(result.pdf_type)      # "text_based", "scanned", "image_based", "mixed"
print(result.confidence)     # 0.0 - 1.0
print(result.page_count)     # number of pages
print(result.markdown)       # Markdown string or None

# Process specific pages only
result = pdf_inspector.process_pdf("document.pdf", pages=[1, 3, 5])

# Process from bytes (no filesystem needed)
with open("document.pdf", "rb") as f:
    result = pdf_inspector.process_pdf_bytes(f.read())

# Fast detection only (no text extraction)
result = pdf_inspector.detect_pdf("document.pdf")
if result.pdf_type == "text_based":
    print("Can extract locally!")
else:
    print(f"Pages needing OCR: {result.pages_needing_ocr}")

# Plain text extraction
text = pdf_inspector.extract_text("document.pdf")

# Positioned text items with font info
items = pdf_inspector.extract_text_with_positions("document.pdf")
for item in items[:5]:
    print(f"'{item.text}' at ({item.x:.0f}, {item.y:.0f}) size={item.font_size}")

API reference

Function	Description
`process_pdf(path, pages=None)`	Full processing (detect + extract + markdown)
`process_pdf_bytes(data, pages=None)`	Full processing from bytes
`detect_pdf(path)`	Fast detection only (returns PdfResult)
`detect_pdf_bytes(data)`	Fast detection from bytes
`classify_pdf(path)`	Lightweight classification (returns PdfClassification)
`classify_pdf_bytes(data)`	Lightweight classification from bytes
`extract_text(path)`	Plain text extraction
`extract_text_bytes(data)`	Plain text extraction from bytes
`extract_text_with_positions(path, pages=None)`	Text with X/Y coords and font info
`extract_text_with_positions_bytes(data, pages=None)`	Text with positions from bytes
`extract_text_in_regions(path, page_regions)`	Extract text in bounding-box regions
`extract_text_in_regions_bytes(data, page_regions)`	Region extraction from bytes

Types

PdfResult fields: pdf_type, markdown, page_count, processing_time_ms, pages_needing_ocr, title, confidence, is_complex_layout, pages_with_tables, pages_with_columns, has_encoding_issues

PdfClassification fields: pdf_type, page_count, pages_needing_ocr (0-indexed), confidence

TextItem fields: text, x, y, width, height, font, font_size, page, is_bold, is_italic, item_type

RegionText fields: text, needs_ocr

PageRegionTexts fields: page (0-indexed), regions (list of RegionText)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python API

Install

Usage

API reference

Types

FilesExpand file tree

python.md

Latest commit

History

python.md

File metadata and controls

Python API

Install

Usage

API reference

Types