QuickDoc - Open Source Document & AI Processor

An intelligent, configurable microservice for document text extraction, AI-powered summarization, text embedding, and token counting. Built with FastAPI and designed for scalability, QuickDoc lets you enable only the features you need to optimize resource usage.

Features

Document Processing

Multi-format Support: PDF, DOCX, ODT, RTF, Markdown, EPUB, and images (JPG, PNG, BMP, TIFF)
Page-by-Page Extraction: Extract text from PDFs page by page or as complete document
Chapter-by-Chapter Extraction: Extract text from EPUBs chapter by chapter or as complete book
Intelligent OCR: Automatic text extraction from scanned PDFs and images using PaddleOCR
Configurable Processing: Enable/disable specific document types to save resources

AI Util Services

Text Summarization: Advanced summarization with configurable quality levels using transformer models
Document Embeddings: Convert PDFs/EPUBs to page-by-page embeddings with intelligent chunking
Text Embeddings: Generate semantic embeddings for texts and documents
Token Counting: Accurate token counting for Llama 3, Mistral, and Gemini models
Async Processing: Non-blocking AI operations with queue management

Configuration & Resource Management

Modular Features: Enable only the services you need
Resource Optimization: Conditional model loading based on configuration
Configurable Models: Choose your preferred AI models via environment variables
Production Ready: Docker support with health checks and proper logging

Quick Start

Option 1: Docker (Recommended)

Clone the repository

git clone https://github.com/digitaldrreamer/quickdoc.git
cd quickdoc

Configure your deployment

cp env.example .env
# Edit .env to enable/disable features as needed

Start with Docker Compose
```
docker-compose up --build
```

Test the service

# Test document extraction
echo "Hello, QuickDoc!" > test.md
curl -F "file=@test.md" http://localhost:8005/extract

# Check service status
curl http://localhost:8005/health

The service will be available at http://localhost:8005 with interactive documentation at http://localhost:8005/docs. You can set the port in .env

Option 2: Manual Installation

Prerequisites

System Dependencies:

# Ubuntu/Debian
sudo apt-get update && sudo apt-get install -y \
    pandoc poppler-utils libmagic1 tesseract-ocr \
    libgl1-mesa-glx libglib2.0-0 libsm6 libxext6 libgomp1

# macOS
brew install pandoc poppler libmagic tesseract

Python Setup:

# Clone and setup
git clone https://github.com/digitaldrreamer/quickdoc.git
cd quickdoc

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configuration

# Copy and configure environment
cp env.example .env
# Edit .env file with your preferences

Run the Service

# Development
python -m uvicorn app.main:app --host 0.0.0.0 --port 8005 --reload

# Production
gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8005

Configuration

QuickDoc is highly configurable through environment variables. Copy env.example to .env and customize:

Core Features

# Enable/disable major components
ENABLE_SUMMARIZATION=true          # AI text summarization
ENABLE_EMBEDDING_MODEL=true        # Text embedding generation  
ENABLE_TOKEN_COUNTING=true         # Token counting for various models
ENABLE_DOCUMENT_PROCESSING=true    # Document text extraction

Document Processing

# Fine-grained document type control
ENABLE_PDF_PROCESSING=true         # PDF text extraction & OCR
ENABLE_DOCX_PROCESSING=true        # Word/ODT/RTF processing
ENABLE_IMAGE_OCR=true              # Image text extraction
ENABLE_MARKDOWN_PROCESSING=true    # Markdown processing

AI Models

# Specify which models to use
SUMMARIZATION_MODEL=google/flan-t5-small    # Hugging Face model for summarization
EMBEDDING_MODEL=all-MiniLM-L6-v2           # Sentence transformer model

Resource Optimization Examples

Minimal Deployment (Text extraction only):

ENABLE_SUMMARIZATION=false
ENABLE_EMBEDDING_MODEL=false
ENABLE_TOKEN_COUNTING=false
ENABLE_IMAGE_OCR=false

PDF-only Service:

ENABLE_DOCX_PROCESSING=false
ENABLE_IMAGE_OCR=false
ENABLE_MARKDOWN_PROCESSING=false

AI-only Service (no document processing):

ENABLE_DOCUMENT_PROCESSING=false

API Documentation

Document Conversion Endpoints

`POST /extract`

Extract text from documents.

curl -X POST -F "file=@document.pdf" http://localhost:8005/extract

Response:

{
  "text": "Extracted text content...",
  "filename": "document.pdf",
  "file_type": ".pdf",
  "character_count": 1250,
  "metrics": {
    "processing_duration_ms": 150.2,
    "memory_usage_mb": 75.3,
    "processing_method": "pdfminer"
  }
}

`POST /convert-to-pdf`

Convert documents to PDF format.

AI Service Endpoints

`POST /ai/embed/text`

Generate embeddings for text.

curl -X POST -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "normalize": true}' \
  http://localhost:8005/ai/embed/text

`POST /ai/embed/document`

Extract text from document and generate embeddings.

curl -X POST -F "file=@document.pdf" http://localhost:8005/ai/embed/document

`POST /embed/document`

Convert PDF/EPUB to page-by-page embeddings with intelligent chunking.

curl -X POST -F "file=@document.pdf" -F "chunking_strategy=semantic" http://localhost:8005/embed/document

Response:

{
  "success": true,
  "filename": "document.pdf",
  "file_type": ".pdf",
  "chunks": [
    {
      "chunk_id": "document.pdf_page_1_chunk_0",
      "text": "Chapter 1: Introduction...",
      "embedding": [0.123, -0.456, 0.789, ...],
      "metadata": {
        "page_number": 1,
        "chunk_index": 0,
        "char_count": 1250,
        "word_count": 200,
        "contains_headers": true,
        "semantic_boundary": "paragraph"
      }
    }
  ],
  "stats": {
    "total_chunks": 45,
    "total_pages": 20,
    "embedding_dimensions": 384,
    "processing_time_ms": 2340.5
  }
}

`POST /ai/summarize`

Summarize text with configurable quality.

curl -X POST -H "Content-Type: application/json" \
  -d '{"text": "Long text to summarize...", "max_length": 150, "quality": "high"}' \
  http://localhost:8005/ai/summarize

`POST /ai/tokens/count/{model}`

Count tokens for specific models (llama3, mistral, gemini).

curl -X POST -H "Content-Type: application/json" \
  -d '{"text": "Text to count tokens for"}' \
  http://localhost:8005/ai/tokens/count/llama3

Utility Endpoints

GET /health - Service health check with feature status
GET / - API overview and available endpoints
GET /docs - Interactive API documentation (Swagger UI)

Deployment

Docker Deployment

Basic deployment:

docker-compose up -d

With custom configuration:

# Create custom .env file
cp env.example .env
# Edit .env with your settings
docker-compose up -d

Production Deployment

Using Docker with resource limits:

version: '3.8'
services:
  quickdoc:
    image: quickdoc:latest
    ports:
      - "8005:8005"
    env_file: .env
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: '1.0'
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8005/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Environment Variables for Production:

# Resource optimization
MAX_FILE_SIZE_MB=50
SUMMARIZATION_TIMEOUT=600
MAX_SUMMARIZATION_QUEUE_SIZE=50
LOG_LEVEL=WARNING

# Security (if using external APIs)
HUGGING_FACE_HUB_TOKEN=your_secure_token

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: quickdoc
spec:
  replicas: 3
  selector:
    matchLabels:
      app: quickdoc
  template:
    metadata:
      labels:
        app: quickdoc
    spec:
      containers:
      - name: quickdoc
        image: quickdoc:latest
        ports:
        - containerPort: 8005
        env:
        - name: ENABLE_SUMMARIZATION
          value: "true"
        - name: ENABLE_EMBEDDING_MODEL
          value: "true"
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8005
          initialDelaySeconds: 30
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: quickdoc-service
spec:
  selector:
    app: quickdoc
  ports:
  - port: 80
    targetPort: 8005
  type: LoadBalancer

Development

Setting up Development Environment

# Clone and setup
git clone https://github.com/digitaldrreamer/quickdoc.git
cd quickdoc

# Setup virtual environment
python -m venv venv
source venv/bin/activate

# Install development dependencies
pip install -r requirements.txt

# Run in development mode
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8005

Running Tests

# Run the test suite
python -m pytest

# Test specific endpoints
python test_pdf_endpoints.py
python test_enhanced_pdf.py

Code Quality

The project follows Google Python Style Guide and includes:

Type hints throughout the codebase
Comprehensive error handling
Structured logging
Resource tracking and metrics
Async/await patterns for scalability

Technology Stack

Framework: FastAPI 0.104+
AI/ML:
- Transformers 4.41+ (summarization)
- Sentence Transformers 2.7+ (embeddings)
- PaddleOCR 2.7+ (OCR)
Document Processing:
- PDFMiner.six (PDF text extraction)
- Pandoc (document conversion)
- PyMuPDF (PDF rendering)
Infrastructure:
- Docker & Docker Compose
- Uvicorn/Gunicorn (ASGI servers)
- Pydantic (configuration & validation)

Contributing

We welcome contributions! Please see our contributing guidelines:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes with tests
Follow the coding standards: Google Python Style Guide
Submit a pull request

Development Guidelines

Add type hints to all functions
Include docstrings for public methods
Write tests for new features
Update documentation for API changes
Use Better Comments style for inline comments
Use good branch names so one can understand at first glance

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

PaddleOCR for excellent OCR capabilities
Hugging Face Transformers for state-of-the-art NLP models
FastAPI for being so excellent
The open source community for inspiration and tools

Support

Documentation: Check /docs endpoint when service is running
Issues: Please report bugs via GitHub Issues
Discussions: Start discussions for feature requests and questions

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

QuickDoc - Open Source Document & AI Processor

Features

Document Processing

AI Util Services

Configuration & Resource Management

Quick Start

Option 1: Docker (Recommended)

Option 2: Manual Installation

Prerequisites

Configuration

Run the Service

Configuration

Core Features

Document Processing

AI Models

Resource Optimization Examples

API Documentation

Document Conversion Endpoints

POST /extract

POST /convert-to-pdf

AI Service Endpoints

POST /ai/embed/text

POST /ai/embed/document

POST /embed/document

POST /ai/summarize

POST /ai/tokens/count/{model}

Utility Endpoints

Deployment

Docker Deployment

Production Deployment

Kubernetes Deployment

Development

Setting up Development Environment

Running Tests

Code Quality

Technology Stack

Contributing

Development Guidelines

License

Acknowledgments

Support

`POST /extract`

`POST /convert-to-pdf`

`POST /ai/embed/text`

`POST /ai/embed/document`

`POST /embed/document`

`POST /ai/summarize`

`POST /ai/tokens/count/{model}`