[FEATURE] Add multimodal embedding support (image and video)

### Scope check

- [x] This is **core LLM communication** (not application logic)
- [x] This **benefits most users** (not just my use case)
- [x] This **can't be solved in application code** with current RubyLLM
- [x] I read the [Contributing Guide](https://github.com/crmne/ruby_llm/blob/main/CONTRIBUTING.md)

### Due diligence

- [x] I searched existing issues
- [x] I checked the documentation

### What problem does this solve?

Currently, RubyLLM only supports text-based embeddings across all providers. However, Google's VertexAI offers multimodal embedding capabilities through their `multimodalembedding` model, which can generate embeddings for:

- Images (for visual search, similarity matching)
- Videos (for video content analysis)
- Combined text + image/video (for rich semantic understanding)

Users who need to:
- Build visual search systems
- Compare image/video similarity
- Create multimodal RAG (Retrieval-Augmented Generation) systems
- Generate embeddings for mixed media content

...are currently unable to leverage these capabilities through RubyLLM.

### Proposed solution

Extend the existing `Embedding.embed` API to accept optional `image` and `video` parameters:

```ruby
# Text + Image
RubyLLM.embed(
  "A red sports car",
  image: File.read('car.jpg'),
  model: 'multimodalembedding',
  provider: :vertexai
)

# Video with GCS URI
RubyLLM.embed(
  "Product demo video",
  video: 'gs://my-bucket/demo.mp4',
  model: 'multimodalembedding',
  provider: :vertexai
)

# Image-only (no text required)
RubyLLM.embed(
  image: image_data,
  model: 'multimodalembedding',
  provider: :vertexai
)

# Text only
RubyLLM.embed(
  "A blue sports car",
  model: 'multimodalembedding',
  provider: :vertexai
)
```

Implementation approach:

- Add `image:` and `video:` parameters to `Provider#embed` method
- Implement multimodal payload rendering in `VertexAI::Embeddings` module:
     - Support base64-encoded image data
     - Support video as base64 or GCS URIs (gs://...)
     - Handle optional text for pure image/video embeddings
- Standardize render_embedding_payload signature across all providers
- Return structured embeddings: { `text: [...]`, `image: [...]`, `video: [...]` }


### Why this belongs in RubyLLM

- Feature parity: VertexAI already supports this; RubyLLM should be able to expose it.
- Unified API: Users expect all provider features through one interface
- Real demand: Visual search, RAG with images, content moderation all need this


I have a working implementation ready to submit as PR if there's interest!  

Thanks for the review 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Add multimodal embedding support (image and video) #529

Scope check

Due diligence

What problem does this solve?

Proposed solution

Why this belongs in RubyLLM

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[FEATURE] Add multimodal embedding support (image and video) #529

Description

Scope check

Due diligence

What problem does this solve?

Proposed solution

Why this belongs in RubyLLM

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions