Skip to content

Latest commit

 

History

History
490 lines (390 loc) · 11.7 KB

File metadata and controls

490 lines (390 loc) · 11.7 KB

QuickDoc - Open Source Document & AI Processor

An intelligent, configurable microservice for document text extraction, AI-powered summarization, text embedding, and token counting. Built with FastAPI and designed for scalability, QuickDoc lets you enable only the features you need to optimize resource usage.

License: MIT Python 3.10+ FastAPI Docker


Features

Document Processing

  • Multi-format Support: PDF, DOCX, ODT, RTF, Markdown, EPUB, and images (JPG, PNG, BMP, TIFF)
  • Page-by-Page Extraction: Extract text from PDFs page by page or as complete document
  • Chapter-by-Chapter Extraction: Extract text from EPUBs chapter by chapter or as complete book
  • Intelligent OCR: Automatic text extraction from scanned PDFs and images using PaddleOCR
  • Configurable Processing: Enable/disable specific document types to save resources

AI Util Services

  • Text Summarization: Advanced summarization with configurable quality levels using transformer models
  • Document Embeddings: Convert PDFs/EPUBs to page-by-page embeddings with intelligent chunking
  • Text Embeddings: Generate semantic embeddings for texts and documents
  • Token Counting: Accurate token counting for Llama 3, Mistral, and Gemini models
  • Async Processing: Non-blocking AI operations with queue management

Configuration & Resource Management

  • Modular Features: Enable only the services you need
  • Resource Optimization: Conditional model loading based on configuration
  • Configurable Models: Choose your preferred AI models via environment variables
  • Production Ready: Docker support with health checks and proper logging

Quick Start

Option 1: Docker (Recommended)

  1. Clone the repository

    git clone https://github.com/digitaldrreamer/quickdoc.git
    cd quickdoc
  2. Configure your deployment

    cp env.example .env
    # Edit .env to enable/disable features as needed
  3. Start with Docker Compose

    docker-compose up --build
  4. Test the service

    # Test document extraction
    echo "Hello, QuickDoc!" > test.md
    curl -F "file=@test.md" http://localhost:8005/extract
    
    # Check service status
    curl http://localhost:8005/health

The service will be available at http://localhost:8005 with interactive documentation at http://localhost:8005/docs. You can set the port in .env

Option 2: Manual Installation

Prerequisites

System Dependencies:

# Ubuntu/Debian
sudo apt-get update && sudo apt-get install -y \
    pandoc poppler-utils libmagic1 tesseract-ocr \
    libgl1-mesa-glx libglib2.0-0 libsm6 libxext6 libgomp1

# macOS
brew install pandoc poppler libmagic tesseract

Python Setup:

# Clone and setup
git clone https://github.com/digitaldrreamer/quickdoc.git
cd quickdoc

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configuration

# Copy and configure environment
cp env.example .env
# Edit .env file with your preferences

Run the Service

# Development
python -m uvicorn app.main:app --host 0.0.0.0 --port 8005 --reload

# Production
gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8005

Configuration

QuickDoc is highly configurable through environment variables. Copy env.example to .env and customize:

Core Features

# Enable/disable major components
ENABLE_SUMMARIZATION=true          # AI text summarization
ENABLE_EMBEDDING_MODEL=true        # Text embedding generation  
ENABLE_TOKEN_COUNTING=true         # Token counting for various models
ENABLE_DOCUMENT_PROCESSING=true    # Document text extraction

Document Processing

# Fine-grained document type control
ENABLE_PDF_PROCESSING=true         # PDF text extraction & OCR
ENABLE_DOCX_PROCESSING=true        # Word/ODT/RTF processing
ENABLE_IMAGE_OCR=true              # Image text extraction
ENABLE_MARKDOWN_PROCESSING=true    # Markdown processing

AI Models

# Specify which models to use
SUMMARIZATION_MODEL=google/flan-t5-small    # Hugging Face model for summarization
EMBEDDING_MODEL=all-MiniLM-L6-v2           # Sentence transformer model

Resource Optimization Examples

Minimal Deployment (Text extraction only):

ENABLE_SUMMARIZATION=false
ENABLE_EMBEDDING_MODEL=false
ENABLE_TOKEN_COUNTING=false
ENABLE_IMAGE_OCR=false

PDF-only Service:

ENABLE_DOCX_PROCESSING=false
ENABLE_IMAGE_OCR=false
ENABLE_MARKDOWN_PROCESSING=false

AI-only Service (no document processing):

ENABLE_DOCUMENT_PROCESSING=false

API Documentation

Document Conversion Endpoints

POST /extract

Extract text from documents.

curl -X POST -F "file=@document.pdf" http://localhost:8005/extract

Response:

{
  "text": "Extracted text content...",
  "filename": "document.pdf",
  "file_type": ".pdf",
  "character_count": 1250,
  "metrics": {
    "processing_duration_ms": 150.2,
    "memory_usage_mb": 75.3,
    "processing_method": "pdfminer"
  }
}

POST /convert-to-pdf

Convert documents to PDF format.

AI Service Endpoints

POST /ai/embed/text

Generate embeddings for text.

curl -X POST -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "normalize": true}' \
  http://localhost:8005/ai/embed/text

POST /ai/embed/document

Extract text from document and generate embeddings.

curl -X POST -F "file=@document.pdf" http://localhost:8005/ai/embed/document

POST /embed/document

Convert PDF/EPUB to page-by-page embeddings with intelligent chunking.

curl -X POST -F "file=@document.pdf" -F "chunking_strategy=semantic" http://localhost:8005/embed/document

Response:

{
  "success": true,
  "filename": "document.pdf",
  "file_type": ".pdf",
  "chunks": [
    {
      "chunk_id": "document.pdf_page_1_chunk_0",
      "text": "Chapter 1: Introduction...",
      "embedding": [0.123, -0.456, 0.789, ...],
      "metadata": {
        "page_number": 1,
        "chunk_index": 0,
        "char_count": 1250,
        "word_count": 200,
        "contains_headers": true,
        "semantic_boundary": "paragraph"
      }
    }
  ],
  "stats": {
    "total_chunks": 45,
    "total_pages": 20,
    "embedding_dimensions": 384,
    "processing_time_ms": 2340.5
  }
}

POST /ai/summarize

Summarize text with configurable quality.

curl -X POST -H "Content-Type: application/json" \
  -d '{"text": "Long text to summarize...", "max_length": 150, "quality": "high"}' \
  http://localhost:8005/ai/summarize

POST /ai/tokens/count/{model}

Count tokens for specific models (llama3, mistral, gemini).

curl -X POST -H "Content-Type: application/json" \
  -d '{"text": "Text to count tokens for"}' \
  http://localhost:8005/ai/tokens/count/llama3

Utility Endpoints

  • GET /health - Service health check with feature status
  • GET / - API overview and available endpoints
  • GET /docs - Interactive API documentation (Swagger UI)

Deployment

Docker Deployment

Basic deployment:

docker-compose up -d

With custom configuration:

# Create custom .env file
cp env.example .env
# Edit .env with your settings
docker-compose up -d

Production Deployment

Using Docker with resource limits:

version: '3.8'
services:
  quickdoc:
    image: quickdoc:latest
    ports:
      - "8005:8005"
    env_file: .env
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: '1.0'
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8005/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Environment Variables for Production:

# Resource optimization
MAX_FILE_SIZE_MB=50
SUMMARIZATION_TIMEOUT=600
MAX_SUMMARIZATION_QUEUE_SIZE=50
LOG_LEVEL=WARNING

# Security (if using external APIs)
HUGGING_FACE_HUB_TOKEN=your_secure_token

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: quickdoc
spec:
  replicas: 3
  selector:
    matchLabels:
      app: quickdoc
  template:
    metadata:
      labels:
        app: quickdoc
    spec:
      containers:
      - name: quickdoc
        image: quickdoc:latest
        ports:
        - containerPort: 8005
        env:
        - name: ENABLE_SUMMARIZATION
          value: "true"
        - name: ENABLE_EMBEDDING_MODEL
          value: "true"
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8005
          initialDelaySeconds: 30
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: quickdoc-service
spec:
  selector:
    app: quickdoc
  ports:
  - port: 80
    targetPort: 8005
  type: LoadBalancer

Development

Setting up Development Environment

# Clone and setup
git clone https://github.com/digitaldrreamer/quickdoc.git
cd quickdoc

# Setup virtual environment
python -m venv venv
source venv/bin/activate

# Install development dependencies
pip install -r requirements.txt

# Run in development mode
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8005

Running Tests

# Run the test suite
python -m pytest

# Test specific endpoints
python test_pdf_endpoints.py
python test_enhanced_pdf.py

Code Quality

The project follows Google Python Style Guide and includes:

  • Type hints throughout the codebase
  • Comprehensive error handling
  • Structured logging
  • Resource tracking and metrics
  • Async/await patterns for scalability

Technology Stack

  • Framework: FastAPI 0.104+
  • AI/ML:
    • Transformers 4.41+ (summarization)
    • Sentence Transformers 2.7+ (embeddings)
    • PaddleOCR 2.7+ (OCR)
  • Document Processing:
    • PDFMiner.six (PDF text extraction)
    • Pandoc (document conversion)
    • PyMuPDF (PDF rendering)
  • Infrastructure:
    • Docker & Docker Compose
    • Uvicorn/Gunicorn (ASGI servers)
    • Pydantic (configuration & validation)

Contributing

We welcome contributions! Please see our contributing guidelines:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes with tests
  4. Follow the coding standards: Google Python Style Guide
  5. Submit a pull request

Development Guidelines

  • Add type hints to all functions
  • Include docstrings for public methods
  • Write tests for new features
  • Update documentation for API changes
  • Use Better Comments style for inline comments
  • Use good branch names so one can understand at first glance

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

  • PaddleOCR for excellent OCR capabilities
  • Hugging Face Transformers for state-of-the-art NLP models
  • FastAPI for being so excellent
  • The open source community for inspiration and tools

Support

  • Documentation: Check /docs endpoint when service is running
  • Issues: Please report bugs via GitHub Issues
  • Discussions: Start discussions for feature requests and questions