Phase 2: Migrate Image Analysis to BLIP-2 + Phi-3.5

## Overview
Replace Gemini API in image analysis with local BLIP-2 + Phi-3.5-mini pipeline while keeping existing CLIP embeddings.

**Parent Issue:** #36  
**Depends On:** #43 (Phase 1 infrastructure)
**Estimated Time:** 1 week

---

## Current State
- `src/media/vlm_analyzer.py` uses `google-generativeai` SDK
- Individual image analysis via Gemini 2.5 Flash API
- Cluster labeling with multiple images
- CLIP embeddings for clustering (via `embedder.py`)

---

## Migration Strategy

### New Pipeline
```
Image → BLIP-2 Caption → Phi-3.5 Metadata Extraction → VLMMetadata
          ↓
    CLIP Embedding →
```

---

## Implementation Tasks

### Code Refactoring
- [ ] Refactor `VLMAnalyzer.__init__()`:
  - Remove `_init_gemini()` method
  - Add BLIP and Phi-3.5 client initialization
  - Keep CLIP embedder reference
- [ ] Update `analyze_individual_image()`:
  - Call BLIP-2 for rich caption
  - Call Phi-3.5 to extract structured metadata
  - Keep color palette extraction logic
  - Return `VLMMetadata` with new model attribution
- [ ] Update `label_cluster()`:
  - Get BLIP captions for representative images
  - Use Phi-3.5 to analyze commonalities
  - Return `ClusterMetadata`
- [ ] Remove all `google-generativeai` imports
- [ ] Remove fallback CLIP classification (no longer needed)

### Phi-3.5 Prompt Engineering
- [ ] Design prompt for metadata extraction from BLIP captions
- [ ] Implement structured JSON output parsing
- [ ] Add retry logic for invalid JSON responses
- [ ] Test prompt variations for accuracy

**Example Prompt:**
```python
METADATA_EXTRACTION_PROMPT = """Analyze this image caption and extract structured metadata.

Caption: "{caption}"

Extract and return ONLY valid JSON:
{{
    "primary_category": "<main subject: animals|people|vehicles|nature|food|architecture|objects|abstract>",
    "tags": ["<5-15 descriptive tags covering subjects, activities, attributes, composition>"],
    "detected_objects": [
        {{"name": "<object>", "confidence": <0.0-1.0>}}
    ],
    "scene_type": "<portrait|landscape|still_life|action|abstract|indoor|outdoor>",
    "suggested_cluster_name": "<2-4 words for grouping similar images>"
}}

Be specific and descriptive. Use diverse tags. Estimate confidence based on caption clarity."""
```

### Integration with BLIP-2 Service
- [ ] Implement `_get_blip_caption(image_path: str) -> str`
- [ ] Add error handling for BLIP service unavailable
- [ ] Add image preprocessing if needed
- [ ] Handle batch requests for clusters

### Integration with Phi-3.5 Service  
- [ ] Implement `_extract_metadata_from_caption(caption: str) -> dict`
- [ ] Add JSON validation and parsing
- [ ] Handle malformed responses gracefully
- [ ] Implement fallback for parsing failures

### Keep Existing CLIP Integration
- [ ] Ensure clustering logic remains unchanged
- [ ] Validate embedding quality matches previous results

### Configuration Updates
- [ ] Remove `gemini_api_key` from `settings.py`
- [ ] Remove `gemini_model` configuration
- [ ] Add `blip_endpoint` and `phi35_endpoint` settings
- [ ] Update `.env.example` with new variables
- [ ] Update documentation

### Testing
- [ ] Update `tests/unit/test_vlm_analyzer.py`:
  - Remove Gemini API mocks
  - Add BLIP/Phi-3.5 endpoint mocks
  - Test caption generation flow
  - Test metadata extraction accuracy
  - Test cluster labeling with multiple images
  - Test error handling (service unavailable)
- [ ] Add integration tests with real services
- [ ] Benchmark accuracy vs. Gemini baseline (target: >90%)
- [ ] Performance testing (target: <5s per image)
- [ ] Test with diverse image types:
  - Portraits, landscapes, objects
  - Indoor/outdoor scenes
  - Abstract images
  - Low-quality images

### Quality Validation
- [ ] Run analysis on test dataset (100+ images)
- [ ] Compare metadata quality with Gemini results
- [ ] Validate tag diversity and relevance
- [ ] Check cluster name suggestions for clarity
- [ ] User acceptance testing

### Documentation
- [ ] Update `docs/technical_specification.md`:
  - Replace Gemini architecture with BLIP+Phi-3.5
  - Document new pipeline flow
  - Update performance characteristics
- [ ] Update `docs/CONFIGURATION.md`
- [ ] Remove Gemini API setup instructions
- [ ] Add BLIP and Phi-3.5 configuration guide
- [ ] Document prompt templates

### Cleanup
- [ ] Remove `google-generativeai` from `requirements.txt`
- [ ] Remove unused Gemini-related code
- [ ] Update imports across codebase
- [ ] Clean up environment variable references

---

## Example Implementation

```python
# src/media/vlm_analyzer.py (refactored)

from src.models.model_client import BLIPClient, Phi35Client

class VLMAnalyzer:
    def __init__(self):
        self.settings = get_settings()
        self.blip_client = BLIPClient(self.settings.blip_endpoint)
        self.phi35_client = Phi35Client(self.settings.phi35_endpoint)
        self.embedder = MediaEmbedder()  # Keep CLIP!
    
    async def analyze_individual_image(
        self, 
        image_path: str
    ) -> VLMMetadata:
        """Analyze image using BLIP-2 + Phi-3.5 pipeline."""
        start_time = time.time()
        
        # Stage 1: Get BLIP-2 caption
        caption = await self.blip_client.generate_caption(image_path)
        logger.info(f"BLIP caption: {caption}")
        
        # Stage 2: Extract structured metadata with Phi-3.5
        prompt = self.METADATA_EXTRACTION_PROMPT.format(caption=caption)
        metadata_json = await self.phi35_client.generate(
            prompt=prompt,
            max_tokens=500,
            temperature=0.3,  # Low temp for consistency
            response_format={"type": "json_object"}
        )
        metadata = json.loads(metadata_json)
        
        # Stage 3: Extract color palette (existing logic)
        colors = self._extract_colors(image_path)
        
        processing_time = (time.time() - start_time) * 1000
        
        return VLMMetadata(
            primary_category=metadata['primary_category'],
            tags=metadata['tags'],
            description=caption,
            detected_objects=metadata['detected_objects'],
            scene_type=metadata['scene_type'],
            color_palette=colors,
            suggested_cluster_name=metadata['suggested_cluster_name'],
            vlm_model="blip2+phi3.5",
            fallback_used=False,
            processing_time_ms=processing_time
        )
    
    async def label_cluster(
        self,
        representative_images: List[str],
        total_images: int
    ) -> ClusterMetadata:
        """Label cluster using multi-image analysis."""
        
        # Get BLIP captions for representative images
        captions = []
        for img_path in representative_images[:5]:
            caption = await self.blip_client.generate_caption(img_path)
            captions.append(caption)
        
        # Use Phi-3.5 to find commonalities
        prompt = f"""Analyze these {total_images} similar images from {len(captions)} representatives:

{chr(10).join([f'{i+1}. {cap}' for i, cap in enumerate(captions)])}

Provide concise cluster analysis as JSON:
{{
    "cluster_name": "<2-4 words describing this group>",
    "description": "<what these images have in common>",
    "tags": ["<5-10 relevant tags>"],
    "primary_category": "<main category>"
}}"""
        
        result = await self.phi35_client.generate(
            prompt=prompt,
            max_tokens=300,
            response_format={"type": "json_object"}
        )
        
        cluster_info = json.loads(result)
        
        return ClusterMetadata(
            cluster_name=cluster_info['cluster_name'],
            description=cluster_info['description'],
            tags=cluster_info['tags'],
            primary_category=cluster_info['primary_category'],
            vlm_model="blip2+phi3.5",
            images_analyzed=len(captions),
            total_images=total_images
        )
```

---

## Acceptance Criteria

- ✅ All Gemini API calls removed
- ✅ Image analysis works with BLIP-2 + Phi-3.5
- ✅ Cluster labeling maintains quality
- ✅ CLIP embeddings/clustering unchanged
- ✅ All tests passing (20/20 from current suite)
- ✅ Processing time: <5s per image end-to-end
- ✅ Metadata accuracy: >90% vs Gemini baseline
- ✅ No external API dependencies
- ✅ Structured JSON output consistent
- ✅ Documentation updated

---

## Rollback Plan

If issues arise:
1. Keep Gemini code in separate branch
2. Feature flag to toggle between old/new implementation
3. A/B test with subset of users
4. Monitor quality metrics before full rollover

---

## Resources

- BLIP-2 Paper: https://arxiv.org/abs/2301.12597
- Phi-3.5 Technical Report: https://arxiv.org/abs/2404.14219
- Prompt Engineering Guide: https://www.promptingguide.ai/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 2: Migrate Image Analysis to BLIP-2 + Phi-3.5 #44

Overview

Current State

Migration Strategy

New Pipeline

Implementation Tasks

Code Refactoring

Phi-3.5 Prompt Engineering

Integration with BLIP-2 Service

Integration with Phi-3.5 Service

Keep Existing CLIP Integration

Configuration Updates

Testing

Quality Validation

Documentation

Cleanup

Example Implementation

Acceptance Criteria

Rollback Plan

Resources

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Phase 2: Migrate Image Analysis to BLIP-2 + Phi-3.5 #44

Description

Overview

Current State

Migration Strategy

New Pipeline

Implementation Tasks

Code Refactoring

Phi-3.5 Prompt Engineering

Integration with BLIP-2 Service

Integration with Phi-3.5 Service

Keep Existing CLIP Integration

Configuration Updates

Testing

Quality Validation

Documentation

Cleanup

Example Implementation

Acceptance Criteria

Rollback Plan

Resources

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions