This repository was archived by the owner on Nov 15, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
This repository was archived by the owner on Nov 15, 2025. It is now read-only.
Phase 2: Migrate Image Analysis to BLIP-2 + Phi-3.5 #44
Copy link
Copy link
Open
Description
Overview
Replace Gemini API in image analysis with local BLIP-2 + Phi-3.5-mini pipeline while keeping existing CLIP embeddings.
Parent Issue: #36
Depends On: #43 (Phase 1 infrastructure)
Estimated Time: 1 week
Current State
src/media/vlm_analyzer.pyusesgoogle-generativeaiSDK- Individual image analysis via Gemini 2.5 Flash API
- Cluster labeling with multiple images
- CLIP embeddings for clustering (via
embedder.py)
Migration Strategy
New Pipeline
Image → BLIP-2 Caption → Phi-3.5 Metadata Extraction → VLMMetadata
↓
CLIP Embedding →
Implementation Tasks
Code Refactoring
- Refactor
VLMAnalyzer.__init__():- Remove
_init_gemini()method - Add BLIP and Phi-3.5 client initialization
- Keep CLIP embedder reference
- Remove
- Update
analyze_individual_image():- Call BLIP-2 for rich caption
- Call Phi-3.5 to extract structured metadata
- Keep color palette extraction logic
- Return
VLMMetadatawith new model attribution
- Update
label_cluster():- Get BLIP captions for representative images
- Use Phi-3.5 to analyze commonalities
- Return
ClusterMetadata
- Remove all
google-generativeaiimports - Remove fallback CLIP classification (no longer needed)
Phi-3.5 Prompt Engineering
- Design prompt for metadata extraction from BLIP captions
- Implement structured JSON output parsing
- Add retry logic for invalid JSON responses
- Test prompt variations for accuracy
Example Prompt:
METADATA_EXTRACTION_PROMPT = """Analyze this image caption and extract structured metadata.
Caption: "{caption}"
Extract and return ONLY valid JSON:
{{
"primary_category": "<main subject: animals|people|vehicles|nature|food|architecture|objects|abstract>",
"tags": ["<5-15 descriptive tags covering subjects, activities, attributes, composition>"],
"detected_objects": [
{{"name": "<object>", "confidence": <0.0-1.0>}}
],
"scene_type": "<portrait|landscape|still_life|action|abstract|indoor|outdoor>",
"suggested_cluster_name": "<2-4 words for grouping similar images>"
}}
Be specific and descriptive. Use diverse tags. Estimate confidence based on caption clarity."""Integration with BLIP-2 Service
- Implement
_get_blip_caption(image_path: str) -> str - Add error handling for BLIP service unavailable
- Add image preprocessing if needed
- Handle batch requests for clusters
Integration with Phi-3.5 Service
- Implement
_extract_metadata_from_caption(caption: str) -> dict - Add JSON validation and parsing
- Handle malformed responses gracefully
- Implement fallback for parsing failures
Keep Existing CLIP Integration
- Ensure clustering logic remains unchanged
- Validate embedding quality matches previous results
Configuration Updates
- Remove
gemini_api_keyfromsettings.py - Remove
gemini_modelconfiguration - Add
blip_endpointandphi35_endpointsettings - Update
.env.examplewith new variables - Update documentation
Testing
- Update
tests/unit/test_vlm_analyzer.py:- Remove Gemini API mocks
- Add BLIP/Phi-3.5 endpoint mocks
- Test caption generation flow
- Test metadata extraction accuracy
- Test cluster labeling with multiple images
- Test error handling (service unavailable)
- Add integration tests with real services
- Benchmark accuracy vs. Gemini baseline (target: >90%)
- Performance testing (target: <5s per image)
- Test with diverse image types:
- Portraits, landscapes, objects
- Indoor/outdoor scenes
- Abstract images
- Low-quality images
Quality Validation
- Run analysis on test dataset (100+ images)
- Compare metadata quality with Gemini results
- Validate tag diversity and relevance
- Check cluster name suggestions for clarity
- User acceptance testing
Documentation
- Update
docs/technical_specification.md:- Replace Gemini architecture with BLIP+Phi-3.5
- Document new pipeline flow
- Update performance characteristics
- Update
docs/CONFIGURATION.md - Remove Gemini API setup instructions
- Add BLIP and Phi-3.5 configuration guide
- Document prompt templates
Cleanup
- Remove
google-generativeaifromrequirements.txt - Remove unused Gemini-related code
- Update imports across codebase
- Clean up environment variable references
Example Implementation
# src/media/vlm_analyzer.py (refactored)
from src.models.model_client import BLIPClient, Phi35Client
class VLMAnalyzer:
def __init__(self):
self.settings = get_settings()
self.blip_client = BLIPClient(self.settings.blip_endpoint)
self.phi35_client = Phi35Client(self.settings.phi35_endpoint)
self.embedder = MediaEmbedder() # Keep CLIP!
async def analyze_individual_image(
self,
image_path: str
) -> VLMMetadata:
"""Analyze image using BLIP-2 + Phi-3.5 pipeline."""
start_time = time.time()
# Stage 1: Get BLIP-2 caption
caption = await self.blip_client.generate_caption(image_path)
logger.info(f"BLIP caption: {caption}")
# Stage 2: Extract structured metadata with Phi-3.5
prompt = self.METADATA_EXTRACTION_PROMPT.format(caption=caption)
metadata_json = await self.phi35_client.generate(
prompt=prompt,
max_tokens=500,
temperature=0.3, # Low temp for consistency
response_format={"type": "json_object"}
)
metadata = json.loads(metadata_json)
# Stage 3: Extract color palette (existing logic)
colors = self._extract_colors(image_path)
processing_time = (time.time() - start_time) * 1000
return VLMMetadata(
primary_category=metadata['primary_category'],
tags=metadata['tags'],
description=caption,
detected_objects=metadata['detected_objects'],
scene_type=metadata['scene_type'],
color_palette=colors,
suggested_cluster_name=metadata['suggested_cluster_name'],
vlm_model="blip2+phi3.5",
fallback_used=False,
processing_time_ms=processing_time
)
async def label_cluster(
self,
representative_images: List[str],
total_images: int
) -> ClusterMetadata:
"""Label cluster using multi-image analysis."""
# Get BLIP captions for representative images
captions = []
for img_path in representative_images[:5]:
caption = await self.blip_client.generate_caption(img_path)
captions.append(caption)
# Use Phi-3.5 to find commonalities
prompt = f"""Analyze these {total_images} similar images from {len(captions)} representatives:
{chr(10).join([f'{i+1}. {cap}' for i, cap in enumerate(captions)])}
Provide concise cluster analysis as JSON:
{{
"cluster_name": "<2-4 words describing this group>",
"description": "<what these images have in common>",
"tags": ["<5-10 relevant tags>"],
"primary_category": "<main category>"
}}"""
result = await self.phi35_client.generate(
prompt=prompt,
max_tokens=300,
response_format={"type": "json_object"}
)
cluster_info = json.loads(result)
return ClusterMetadata(
cluster_name=cluster_info['cluster_name'],
description=cluster_info['description'],
tags=cluster_info['tags'],
primary_category=cluster_info['primary_category'],
vlm_model="blip2+phi3.5",
images_analyzed=len(captions),
total_images=total_images
)Acceptance Criteria
- ✅ All Gemini API calls removed
- ✅ Image analysis works with BLIP-2 + Phi-3.5
- ✅ Cluster labeling maintains quality
- ✅ CLIP embeddings/clustering unchanged
- ✅ All tests passing (20/20 from current suite)
- ✅ Processing time: <5s per image end-to-end
- ✅ Metadata accuracy: >90% vs Gemini baseline
- ✅ No external API dependencies
- ✅ Structured JSON output consistent
- ✅ Documentation updated
Rollback Plan
If issues arise:
- Keep Gemini code in separate branch
- Feature flag to toggle between old/new implementation
- A/B test with subset of users
- Monitor quality metrics before full rollover
Resources
- BLIP-2 Paper: https://arxiv.org/abs/2301.12597
- Phi-3.5 Technical Report: https://arxiv.org/abs/2404.14219
- Prompt Engineering Guide: https://www.promptingguide.ai/
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels