This repository was archived by the owner on Nov 15, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
This repository was archived by the owner on Nov 15, 2025. It is now read-only.
Migrate to Local Multi-Model Architecture (BLIP-2 + Whisper + CLIP + Phi-3.5) #36
Copy link
Copy link
Open
Description
Problem
Currently, the project uses Google's Gemini 2.5 Flash API for Vision Language Model (VLM) analysis, which violates our constraint of not using external APIs. This dependency:
- Requires API keys and external network calls
- Creates privacy/security concerns
- Adds latency and costs
- Makes the system dependent on external services
Current Implementation:
src/media/vlm_analyzer.pyusesgoogle-generativeaiSDKrequirements.txtincludesgoogle-generativeai==0.8.3- Configuration requires
GEMINI_API_KEYenvironment variable - Model:
gemini-2.5-flashorgemini-2.5-flash-lite
Revised Solution: Multi-Model Local Architecture
After evaluating hardware constraints (6GB VRAM available), we will use specialized lightweight models that collectively provide multimodal capabilities:
🏗️ Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ Backend Services Layer │
├─────────────────────────────────────────────────────────────┤
│ Image Analysis │ Video Analysis │ Audio Processing │
│ Service │ Service │ Service │
└────────┬─────────┴────────┬─────────┴──────────┬───────────┘
│ │ │
┌────▼─────┐ ┌────▼─────┐ ┌────▼─────┐
│ BLIP-2 │ │ CLIP │ │ Whisper │
│ Caption │ │ Embedder │ │ Base │
│ (~1.5GB) │ │ (~500MB) │ │ (~1.5GB) │
└──────────┘ └──────────┘ └──────────┘
│
┌───────▼────────┐
│ Phi-3.5-mini │
│ 128K Context │
│ (~2.5GB) │
└────────────────┘
Total VRAM: ~5.5-6GB ✅
📦 Model Selection & Justification
1️⃣ Image Understanding: BLIP-2 (~1.5GB VRAM)
- Model:
Salesforce/blip2-opt-2.7borblip2-flan-t5-xl - Purpose: Image captioning, visual question answering
- Capabilities:
- Detailed natural language image descriptions
- Scene understanding and object detection
- Visual question answering
- Context-aware captioning
- Performance: ~2-3s per image on GPU
- Output: Rich natural language descriptions
2️⃣ Image Embeddings: CLIP ViT-B-32 (~500MB VRAM)
- Purpose: Semantic embeddings for clustering & search
- Why Keep: Already implemented, lightweight, excellent quality
- Performance: ~100ms per image
- Output: 512-dim normalized vectors
- Note: No changes needed to existing clustering logic!
3️⃣ Audio Transcription: Whisper Base (~1.5GB VRAM)
- Model:
openai/whisper-baseorwhisper-small - Purpose: Speech-to-text, multilingual transcription
- Capabilities:
- 99 language support
- Automatic language detection
- Timestamp generation
- Robust to background noise
- Performance: ~1× real-time (1 min audio in ~60s)
- Output: Transcription + timestamps + language detection
4️⃣ Text Analysis & Reasoning: Phi-3.5-mini (~2.5GB VRAM) ⭐
- Model:
microsoft/Phi-3.5-mini-instruct(3.8B parameters) - Purpose:
- Extract structured metadata from BLIP captions
- Analyze Whisper transcriptions semantically
- Directory structure reasoning & file organization
- Semantic tag generation
- Long document understanding
Why Phi-3.5-mini is Perfect:
- ✅ 128K context - Process entire documents/file collections in one pass
- ✅
- ✅ Quantization aware - FP8 inference without performance loss
- ✅ Resource efficient - Runs on consumer laptops with 6GB VRAM
- ✅ Better metadata extraction - Understands context vs. purely statistical approaches
- ✅ Outperforms larger models - Beats Llama 3.1 8B on complex reasoning despite fewer parameters
Performance: ~50-100 tokens/s on 6GB GPU
🔄 Implementation Pipeline
Image Analysis Workflow
Image → BLIP-2 Caption → Phi-3.5 Metadata Extraction → Structured JSON
↓
CLIP Embedding →
Video Analysis Workflow
Video → Keyframes → BLIP-2 Captions (per frame)
↓
Phi-3.5 Temporal Analysis → Scene Understanding
↓
CLIP Embeddings → Video-level Embedding
Audio Analysis Workflow
Audio → Whisper Transcription → Phi-3.5 Semantic Analysis
↓
Tags + Summary + Classification
File Organization Workflow
Files → Extract Metadata (BLIP/Whisper) → Phi-3.5 128K Context
↓
Analyze entire collection
↓
Suggest directory structure
📊 VRAM Budget Breakdown
| Model | VRAM (FP16) | VRAM (FP8/INT8) | Purpose |
|---|---|---|---|
| CLIP ViT-B-32 | ~500MB | ~300MB | Embeddings |
| BLIP-2 OPT-2.7B | ~1.5GB | ~1GB | Image captions |
| Whisper Base | ~1.5GB | ~1GB | Audio transcription |
| Phi-3.5-mini (FP8) | ~2.5GB | ~2GB | Text reasoning |
| Total (FP16) | ~6GB | ~4.5GB | ✅ Within budget |
Optimization: Can use quantized versions to free up memory for larger batches or concurrent requests.
🐳 Docker Deployment Architecture
version: '3.8'
services:
app:
build: .
ports:
- "8080:8080"
depends_on:
- blip-server
- whisper-server
- phi35-server
environment:
- BLIP_ENDPOINT=http://blip-server:8001
- WHISPER_ENDPOINT=http://whisper-server:8002
- PHI35_ENDPOINT=http://phi35-server:8003
- DATABASE_URL=postgresql://user:pass@postgres:5432/mammothbox
# BLIP-2 Image Captioning Service
blip-server:
build: ./services/blip
ports:
- "8001:8001"
volumes:
- model_cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
limits:
memory: 4G
# Whisper Audio Transcription Service
whisper-server:
build: ./services/whisper
ports:
- "8002:8002"
volumes:
- model_cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
limits:
memory: 4G
# Phi-3.5-mini Text Reasoning Service
phi35-server:
image: vllm/vllm-openai:latest
ports:
- "8003:8003"
environment:
- MODEL_NAME=microsoft/Phi-3.5-mini-instruct
volumes:
- model_cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
command:
- --model
- microsoft/Phi-3.5-mini-instruct
- --max-model-len
- "131072"
- --port
- "8003"
- --dtype
- auto
- --gpu-memory-utilization
- "0.4"
postgres:
image: postgres:15-alpine
environment:
POSTGRES_DB: mammothbox
POSTGRES_USER: user
POSTGRES_PASSWORD: pass
volumes:
- postgres_data:/var/lib/postgresql/data
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
postgres_data:
model_cache:📋 Migration Roadmap
See individual phase issues for detailed tasks:
- Phase 1: Setup Multi-Model Infrastructure (BLIP-2 + Whisper + Phi-3.5) #43 - Phase 1: Infrastructure Setup
- Phase 2: Migrate Image Analysis to BLIP-2 + Phi-3.5 #44 - Phase 2: Image Analysis Migration
- Phase 3: Implement Audio Processing (Whisper + Phi-3.5) #45 - Phase 3: Audio Processing Implementation
- Phase 4: Intelligent File Organization with Phi-3.5 (128K Context) #46 - Phase 4: Semantic Grouping & File Organization
✅ Success Criteria
- ✅ No external API calls required
- ✅ All models run locally on 6GB VRAM
- ✅ Image analysis accuracy >90% of Gemini baseline
- ✅ Audio transcription accuracy >95% for clear speech
- ✅ Inference time <5s per media file
- ✅ Total memory footprint ≤6GB VRAM
- ✅ File organization suggestions >80% user acceptance
- ✅ All existing tests passing
- ✅ 128K context enables whole-collection analysis
🎯 Key Advantages
- Hardware Optimized: Precisely fits 6GB VRAM budget
- Modular Architecture: Independent microservices, easy to debug/scale
- Best-in-Class Models: Each model is optimized for its specific task
- Future-Proof: Can upgrade individual models without system rewrite
- No External Dependencies: Fully local, private, no API costs
- Production Ready: All models battle-tested and widely adopted
- Exceptional Context: Phi-3.5's 128K tokens enables sophisticated reasoning
- Keep What Works: CLIP embeddings/clustering unchanged!
📚 References
- BLIP-2: https://github.com/salesforce/LAVIS/tree/main/projects/blip2
- CLIP: https://github.com/openai/CLIP
- Whisper: https://github.com/openai/whisper
- Phi-3.5: https://huggingface.co/microsoft/Phi-3.5-mini-instruct
- vLLM: https://docs.vllm.ai/
- Sentence Transformers: https://www.sbert.net/
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels