-
-
Notifications
You must be signed in to change notification settings - Fork 365
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Scope check
- This is core LLM communication (not application logic)
- This benefits most users (not just my use case)
- This can't be solved in application code with current RubyLLM
- I read the Contributing Guide
Due diligence
- I searched existing issues
- I checked the documentation
What problem does this solve?
Currently, RubyLLM only supports text-based embeddings across all providers. However, Google's VertexAI offers multimodal embedding capabilities through their multimodalembedding model, which can generate embeddings for:
- Images (for visual search, similarity matching)
- Videos (for video content analysis)
- Combined text + image/video (for rich semantic understanding)
Users who need to:
- Build visual search systems
- Compare image/video similarity
- Create multimodal RAG (Retrieval-Augmented Generation) systems
- Generate embeddings for mixed media content
...are currently unable to leverage these capabilities through RubyLLM.
Proposed solution
Extend the existing Embedding.embed API to accept optional image and video parameters:
# Text + Image
RubyLLM.embed(
"A red sports car",
image: File.read('car.jpg'),
model: 'multimodalembedding',
provider: :vertexai
)
# Video with GCS URI
RubyLLM.embed(
"Product demo video",
video: 'gs://my-bucket/demo.mp4',
model: 'multimodalembedding',
provider: :vertexai
)
# Image-only (no text required)
RubyLLM.embed(
image: image_data,
model: 'multimodalembedding',
provider: :vertexai
)
# Text only
RubyLLM.embed(
"A blue sports car",
model: 'multimodalembedding',
provider: :vertexai
)Implementation approach:
- Add
image:andvideo:parameters toProvider#embedmethod - Implement multimodal payload rendering in
VertexAI::Embeddingsmodule:- Support base64-encoded image data
- Support video as base64 or GCS URIs (gs://...)
- Handle optional text for pure image/video embeddings
- Standardize render_embedding_payload signature across all providers
- Return structured embeddings: {
text: [...],image: [...],video: [...]}
Why this belongs in RubyLLM
- Feature parity: VertexAI already supports this; RubyLLM should be able to expose it.
- Unified API: Users expect all provider features through one interface
- Real demand: Visual search, RAG with images, content moderation all need this
I have a working implementation ready to submit as PR if there's interest!
Thanks for the review 😊
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request