Skip to content
This repository was archived by the owner on Nov 15, 2025. It is now read-only.
This repository was archived by the owner on Nov 15, 2025. It is now read-only.

Migrate to Local Multi-Model Architecture (BLIP-2 + Whisper + CLIP + Phi-3.5) #36

@thewildofficial

Description

@thewildofficial

Problem

Currently, the project uses Google's Gemini 2.5 Flash API for Vision Language Model (VLM) analysis, which violates our constraint of not using external APIs. This dependency:

  • Requires API keys and external network calls
  • Creates privacy/security concerns
  • Adds latency and costs
  • Makes the system dependent on external services

Current Implementation:

  • src/media/vlm_analyzer.py uses google-generativeai SDK
  • requirements.txt includes google-generativeai==0.8.3
  • Configuration requires GEMINI_API_KEY environment variable
  • Model: gemini-2.5-flash or gemini-2.5-flash-lite

Revised Solution: Multi-Model Local Architecture

After evaluating hardware constraints (6GB VRAM available), we will use specialized lightweight models that collectively provide multimodal capabilities:

🏗️ Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    Backend Services Layer                    │
├─────────────────────────────────────────────────────────────┤
│  Image Analysis  │  Video Analysis  │  Audio Processing    │
│    Service       │     Service      │      Service         │
└────────┬─────────┴────────┬─────────┴──────────┬───────────┘
         │                  │                     │
    ┌────▼─────┐      ┌────▼─────┐         ┌────▼─────┐
    │  BLIP-2  │      │   CLIP   │         │ Whisper  │
    │ Caption  │      │ Embedder │         │  Base    │
    │ (~1.5GB) │      │ (~500MB) │         │ (~1.5GB) │
    └──────────┘      └──────────┘         └──────────┘
                            │
                    ┌───────▼────────┐
                    │ Phi-3.5-mini   │
                    │ 128K Context   │
                    │   (~2.5GB)     │
                    └────────────────┘
Total VRAM: ~5.5-6GB ✅

📦 Model Selection & Justification

1️⃣ Image Understanding: BLIP-2 (~1.5GB VRAM)

  • Model: Salesforce/blip2-opt-2.7b or blip2-flan-t5-xl
  • Purpose: Image captioning, visual question answering
  • Capabilities:
    • Detailed natural language image descriptions
    • Scene understanding and object detection
    • Visual question answering
    • Context-aware captioning
  • Performance: ~2-3s per image on GPU
  • Output: Rich natural language descriptions

2️⃣ Image Embeddings: CLIP ViT-B-32 (~500MB VRAM)

  • Purpose: Semantic embeddings for clustering & search
  • Why Keep: Already implemented, lightweight, excellent quality
  • Performance: ~100ms per image
  • Output: 512-dim normalized vectors
  • Note: No changes needed to existing clustering logic!

3️⃣ Audio Transcription: Whisper Base (~1.5GB VRAM)

  • Model: openai/whisper-base or whisper-small
  • Purpose: Speech-to-text, multilingual transcription
  • Capabilities:
    • 99 language support
    • Automatic language detection
    • Timestamp generation
    • Robust to background noise
  • Performance: ~1× real-time (1 min audio in ~60s)
  • Output: Transcription + timestamps + language detection

4️⃣ Text Analysis & Reasoning: Phi-3.5-mini (~2.5GB VRAM) ⭐

  • Model: microsoft/Phi-3.5-mini-instruct (3.8B parameters)
  • Purpose:
    • Extract structured metadata from BLIP captions
    • Analyze Whisper transcriptions semantically
    • Directory structure reasoning & file organization
    • Semantic tag generation
    • Long document understanding

Why Phi-3.5-mini is Perfect:

  • 128K context - Process entire documents/file collections in one pass
  • Quantization aware - FP8 inference without performance loss
  • Resource efficient - Runs on consumer laptops with 6GB VRAM
  • Better metadata extraction - Understands context vs. purely statistical approaches
  • Outperforms larger models - Beats Llama 3.1 8B on complex reasoning despite fewer parameters

Performance: ~50-100 tokens/s on 6GB GPU


🔄 Implementation Pipeline

Image Analysis Workflow

Image → BLIP-2 Caption → Phi-3.5 Metadata Extraction → Structured JSON
                    ↓
              CLIP Embedding →

Video Analysis Workflow

Video → Keyframes → BLIP-2 Captions (per frame)
                         ↓
              Phi-3.5 Temporal Analysis → Scene Understanding
                         ↓
              CLIP Embeddings → Video-level Embedding

Audio Analysis Workflow

Audio → Whisper Transcription → Phi-3.5 Semantic Analysis
                                      ↓
                              Tags + Summary + Classification

File Organization Workflow

Files → Extract Metadata (BLIP/Whisper) → Phi-3.5 128K Context
                                                ↓
                                   Analyze entire collection
                                                ↓
                                   Suggest directory structure

📊 VRAM Budget Breakdown

Model VRAM (FP16) VRAM (FP8/INT8) Purpose
CLIP ViT-B-32 ~500MB ~300MB Embeddings
BLIP-2 OPT-2.7B ~1.5GB ~1GB Image captions
Whisper Base ~1.5GB ~1GB Audio transcription
Phi-3.5-mini (FP8) ~2.5GB ~2GB Text reasoning
Total (FP16) ~6GB ~4.5GB ✅ Within budget

Optimization: Can use quantized versions to free up memory for larger batches or concurrent requests.


🐳 Docker Deployment Architecture

version: '3.8'

services:
  app:
    build: .
    ports:
      - "8080:8080"
    depends_on:
      - blip-server
      - whisper-server
      - phi35-server
    environment:
      - BLIP_ENDPOINT=http://blip-server:8001
      - WHISPER_ENDPOINT=http://whisper-server:8002
      - PHI35_ENDPOINT=http://phi35-server:8003
      - DATABASE_URL=postgresql://user:pass@postgres:5432/mammothbox

  # BLIP-2 Image Captioning Service
  blip-server:
    build: ./services/blip
    ports:
      - "8001:8001"
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]
        limits:
          memory: 4G

  # Whisper Audio Transcription Service
  whisper-server:
    build: ./services/whisper
    ports:
      - "8002:8002"
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]
        limits:
          memory: 4G

  # Phi-3.5-mini Text Reasoning Service
  phi35-server:
    image: vllm/vllm-openai:latest
    ports:
      - "8003:8003"
    environment:
      - MODEL_NAME=microsoft/Phi-3.5-mini-instruct
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]
    command:
      - --model
      - microsoft/Phi-3.5-mini-instruct
      - --max-model-len
      - "131072"
      - --port
      - "8003"
      - --dtype
      - auto
      - --gpu-memory-utilization
      - "0.4"

  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: mammothbox
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
    volumes:
      - postgres_data:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

volumes:
  postgres_data:
  model_cache:

📋 Migration Roadmap

See individual phase issues for detailed tasks:


✅ Success Criteria

  • ✅ No external API calls required
  • ✅ All models run locally on 6GB VRAM
  • ✅ Image analysis accuracy >90% of Gemini baseline
  • ✅ Audio transcription accuracy >95% for clear speech
  • ✅ Inference time <5s per media file
  • ✅ Total memory footprint ≤6GB VRAM
  • ✅ File organization suggestions >80% user acceptance
  • ✅ All existing tests passing
  • ✅ 128K context enables whole-collection analysis

🎯 Key Advantages

  1. Hardware Optimized: Precisely fits 6GB VRAM budget
  2. Modular Architecture: Independent microservices, easy to debug/scale
  3. Best-in-Class Models: Each model is optimized for its specific task
  4. Future-Proof: Can upgrade individual models without system rewrite
  5. No External Dependencies: Fully local, private, no API costs
  6. Production Ready: All models battle-tested and widely adopted
  7. Exceptional Context: Phi-3.5's 128K tokens enables sophisticated reasoning
  8. Keep What Works: CLIP embeddings/clustering unchanged!

📚 References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions