Skip to content
This repository was archived by the owner on Nov 15, 2025. It is now read-only.
This repository was archived by the owner on Nov 15, 2025. It is now read-only.

Phase 2: Migrate Image Analysis to BLIP-2 + Phi-3.5 #44

@thewildofficial

Description

@thewildofficial

Overview

Replace Gemini API in image analysis with local BLIP-2 + Phi-3.5-mini pipeline while keeping existing CLIP embeddings.

Parent Issue: #36
Depends On: #43 (Phase 1 infrastructure)
Estimated Time: 1 week


Current State

  • src/media/vlm_analyzer.py uses google-generativeai SDK
  • Individual image analysis via Gemini 2.5 Flash API
  • Cluster labeling with multiple images
  • CLIP embeddings for clustering (via embedder.py)

Migration Strategy

New Pipeline

Image → BLIP-2 Caption → Phi-3.5 Metadata Extraction → VLMMetadata
          ↓
    CLIP Embedding →

Implementation Tasks

Code Refactoring

  • Refactor VLMAnalyzer.__init__():
    • Remove _init_gemini() method
    • Add BLIP and Phi-3.5 client initialization
    • Keep CLIP embedder reference
  • Update analyze_individual_image():
    • Call BLIP-2 for rich caption
    • Call Phi-3.5 to extract structured metadata
    • Keep color palette extraction logic
    • Return VLMMetadata with new model attribution
  • Update label_cluster():
    • Get BLIP captions for representative images
    • Use Phi-3.5 to analyze commonalities
    • Return ClusterMetadata
  • Remove all google-generativeai imports
  • Remove fallback CLIP classification (no longer needed)

Phi-3.5 Prompt Engineering

  • Design prompt for metadata extraction from BLIP captions
  • Implement structured JSON output parsing
  • Add retry logic for invalid JSON responses
  • Test prompt variations for accuracy

Example Prompt:

METADATA_EXTRACTION_PROMPT = """Analyze this image caption and extract structured metadata.

Caption: "{caption}"

Extract and return ONLY valid JSON:
{{
    "primary_category": "<main subject: animals|people|vehicles|nature|food|architecture|objects|abstract>",
    "tags": ["<5-15 descriptive tags covering subjects, activities, attributes, composition>"],
    "detected_objects": [
        {{"name": "<object>", "confidence": <0.0-1.0>}}
    ],
    "scene_type": "<portrait|landscape|still_life|action|abstract|indoor|outdoor>",
    "suggested_cluster_name": "<2-4 words for grouping similar images>"
}}

Be specific and descriptive. Use diverse tags. Estimate confidence based on caption clarity."""

Integration with BLIP-2 Service

  • Implement _get_blip_caption(image_path: str) -> str
  • Add error handling for BLIP service unavailable
  • Add image preprocessing if needed
  • Handle batch requests for clusters

Integration with Phi-3.5 Service

  • Implement _extract_metadata_from_caption(caption: str) -> dict
  • Add JSON validation and parsing
  • Handle malformed responses gracefully
  • Implement fallback for parsing failures

Keep Existing CLIP Integration

  • Ensure clustering logic remains unchanged
  • Validate embedding quality matches previous results

Configuration Updates

  • Remove gemini_api_key from settings.py
  • Remove gemini_model configuration
  • Add blip_endpoint and phi35_endpoint settings
  • Update .env.example with new variables
  • Update documentation

Testing

  • Update tests/unit/test_vlm_analyzer.py:
    • Remove Gemini API mocks
    • Add BLIP/Phi-3.5 endpoint mocks
    • Test caption generation flow
    • Test metadata extraction accuracy
    • Test cluster labeling with multiple images
    • Test error handling (service unavailable)
  • Add integration tests with real services
  • Benchmark accuracy vs. Gemini baseline (target: >90%)
  • Performance testing (target: <5s per image)
  • Test with diverse image types:
    • Portraits, landscapes, objects
    • Indoor/outdoor scenes
    • Abstract images
    • Low-quality images

Quality Validation

  • Run analysis on test dataset (100+ images)
  • Compare metadata quality with Gemini results
  • Validate tag diversity and relevance
  • Check cluster name suggestions for clarity
  • User acceptance testing

Documentation

  • Update docs/technical_specification.md:
    • Replace Gemini architecture with BLIP+Phi-3.5
    • Document new pipeline flow
    • Update performance characteristics
  • Update docs/CONFIGURATION.md
  • Remove Gemini API setup instructions
  • Add BLIP and Phi-3.5 configuration guide
  • Document prompt templates

Cleanup

  • Remove google-generativeai from requirements.txt
  • Remove unused Gemini-related code
  • Update imports across codebase
  • Clean up environment variable references

Example Implementation

# src/media/vlm_analyzer.py (refactored)

from src.models.model_client import BLIPClient, Phi35Client

class VLMAnalyzer:
    def __init__(self):
        self.settings = get_settings()
        self.blip_client = BLIPClient(self.settings.blip_endpoint)
        self.phi35_client = Phi35Client(self.settings.phi35_endpoint)
        self.embedder = MediaEmbedder()  # Keep CLIP!
    
    async def analyze_individual_image(
        self, 
        image_path: str
    ) -> VLMMetadata:
        """Analyze image using BLIP-2 + Phi-3.5 pipeline."""
        start_time = time.time()
        
        # Stage 1: Get BLIP-2 caption
        caption = await self.blip_client.generate_caption(image_path)
        logger.info(f"BLIP caption: {caption}")
        
        # Stage 2: Extract structured metadata with Phi-3.5
        prompt = self.METADATA_EXTRACTION_PROMPT.format(caption=caption)
        metadata_json = await self.phi35_client.generate(
            prompt=prompt,
            max_tokens=500,
            temperature=0.3,  # Low temp for consistency
            response_format={"type": "json_object"}
        )
        metadata = json.loads(metadata_json)
        
        # Stage 3: Extract color palette (existing logic)
        colors = self._extract_colors(image_path)
        
        processing_time = (time.time() - start_time) * 1000
        
        return VLMMetadata(
            primary_category=metadata['primary_category'],
            tags=metadata['tags'],
            description=caption,
            detected_objects=metadata['detected_objects'],
            scene_type=metadata['scene_type'],
            color_palette=colors,
            suggested_cluster_name=metadata['suggested_cluster_name'],
            vlm_model="blip2+phi3.5",
            fallback_used=False,
            processing_time_ms=processing_time
        )
    
    async def label_cluster(
        self,
        representative_images: List[str],
        total_images: int
    ) -> ClusterMetadata:
        """Label cluster using multi-image analysis."""
        
        # Get BLIP captions for representative images
        captions = []
        for img_path in representative_images[:5]:
            caption = await self.blip_client.generate_caption(img_path)
            captions.append(caption)
        
        # Use Phi-3.5 to find commonalities
        prompt = f"""Analyze these {total_images} similar images from {len(captions)} representatives:

{chr(10).join([f'{i+1}. {cap}' for i, cap in enumerate(captions)])}

Provide concise cluster analysis as JSON:
{{
    "cluster_name": "<2-4 words describing this group>",
    "description": "<what these images have in common>",
    "tags": ["<5-10 relevant tags>"],
    "primary_category": "<main category>"
}}"""
        
        result = await self.phi35_client.generate(
            prompt=prompt,
            max_tokens=300,
            response_format={"type": "json_object"}
        )
        
        cluster_info = json.loads(result)
        
        return ClusterMetadata(
            cluster_name=cluster_info['cluster_name'],
            description=cluster_info['description'],
            tags=cluster_info['tags'],
            primary_category=cluster_info['primary_category'],
            vlm_model="blip2+phi3.5",
            images_analyzed=len(captions),
            total_images=total_images
        )

Acceptance Criteria

  • ✅ All Gemini API calls removed
  • ✅ Image analysis works with BLIP-2 + Phi-3.5
  • ✅ Cluster labeling maintains quality
  • ✅ CLIP embeddings/clustering unchanged
  • ✅ All tests passing (20/20 from current suite)
  • ✅ Processing time: <5s per image end-to-end
  • ✅ Metadata accuracy: >90% vs Gemini baseline
  • ✅ No external API dependencies
  • ✅ Structured JSON output consistent
  • ✅ Documentation updated

Rollback Plan

If issues arise:

  1. Keep Gemini code in separate branch
  2. Feature flag to toggle between old/new implementation
  3. A/B test with subset of users
  4. Monitor quality metrics before full rollover

Resources

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions