Skip to content

[FEATURE] Add multimodal embedding support (image and video) #529

@Ndunge-Makau

Description

@Ndunge-Makau

Scope check

  • This is core LLM communication (not application logic)
  • This benefits most users (not just my use case)
  • This can't be solved in application code with current RubyLLM
  • I read the Contributing Guide

Due diligence

  • I searched existing issues
  • I checked the documentation

What problem does this solve?

Currently, RubyLLM only supports text-based embeddings across all providers. However, Google's VertexAI offers multimodal embedding capabilities through their multimodalembedding model, which can generate embeddings for:

  • Images (for visual search, similarity matching)
  • Videos (for video content analysis)
  • Combined text + image/video (for rich semantic understanding)

Users who need to:

  • Build visual search systems
  • Compare image/video similarity
  • Create multimodal RAG (Retrieval-Augmented Generation) systems
  • Generate embeddings for mixed media content

...are currently unable to leverage these capabilities through RubyLLM.

Proposed solution

Extend the existing Embedding.embed API to accept optional image and video parameters:

# Text + Image
RubyLLM.embed(
  "A red sports car",
  image: File.read('car.jpg'),
  model: 'multimodalembedding',
  provider: :vertexai
)

# Video with GCS URI
RubyLLM.embed(
  "Product demo video",
  video: 'gs://my-bucket/demo.mp4',
  model: 'multimodalembedding',
  provider: :vertexai
)

# Image-only (no text required)
RubyLLM.embed(
  image: image_data,
  model: 'multimodalembedding',
  provider: :vertexai
)

# Text only
RubyLLM.embed(
  "A blue sports car",
  model: 'multimodalembedding',
  provider: :vertexai
)

Implementation approach:

  • Add image: and video: parameters to Provider#embed method
  • Implement multimodal payload rendering in VertexAI::Embeddings module:
    • Support base64-encoded image data
    • Support video as base64 or GCS URIs (gs://...)
    • Handle optional text for pure image/video embeddings
  • Standardize render_embedding_payload signature across all providers
  • Return structured embeddings: { text: [...], image: [...], video: [...] }

Why this belongs in RubyLLM

  • Feature parity: VertexAI already supports this; RubyLLM should be able to expose it.
  • Unified API: Users expect all provider features through one interface
  • Real demand: Visual search, RAG with images, content moderation all need this

I have a working implementation ready to submit as PR if there's interest!

Thanks for the review 😊

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions