Multi-modal embeddings for jinaai/jina-embeddings-v4 #14851

Andrei997 · 2025-07-24T08:50:59Z

Andrei997
Jul 24, 2025

Hey folks!
I'm working on getting multimodal embeddings working with jina-embeddings-v4 (based on Qwen 2.5 VL) through llama.cpp server.
I've hit an issue with mtmd inconsistencies and was hoping someone might have insights on this, or suggestion on how to proceed.

What I'm trying to do
I'm implementing token-level embeddings (no pooling) for a retrieval system using jina-embeddings-v4.
This model is based on Qwen 2.5 VL but was further trained for embedding tasks (supports both text and image).
To get it working in llama.cpp, we merged the jina-embeddings-v4 weights back into the Qwen 2.5 VL architecture.
The setup uses llama.cpp server, processing prompts like <|im_start|>user\n<__image__>Describe the image.<|im_end|>\n.
On the llama.cpp side, we're not applying any pooling or normalization - just extracting the raw token embeddings.

The issue
Here's what's got me scratching my head: everything seems to work perfectly until the vision encoder kicks in.

What's working:

Pixel preprocessing values are identical between llama.cpp and Python
Tokenization produces the exact same sequences
Text token embeddings (like <|im_start|>, user, <|vision_start|>) match the reference implementation (pytorch model)
The attention mechanism is clearly working - image embeddings change when I modify the text prompt
Both implementations produce the same number of image tokens (70 patches)

Where it diverges:
Right after mtmd_encode_chunk() runs, the vision encoder outputs start differing significantly from what I get with the Python/HuggingFace implementation.

llama.cpp vision encoder output:

-1.064453 -1.624023 1.633789 -0.109619 1.169556 0.597290 -3.717308 0.679565 -4.411774 2.885742

python reference:

-0.921875 -1.6171875 1.359375 -0.37109375 1.046875 0.51171875 -3.578125 0.37109375 -4.40625 3.0

They're in the same ballpark but consistently different across all values.

Debugging steps taken
I added some debug output to mtmd_helper_eval_chunk_single():

ret = mtmd_encode_chunk(ctx, chunk);
float * embd = mtmd_get_output_embd(ctx);

printf("=== VIT ENCODER OUTPUT DEBUG ===\n");
printf("First 10 ViT values: ");
for (int i = 0; i < 10; i++) {
    printf("%.6f ", embd[i]);
}

The processing flow:

Text tokens get processed normally and embeddings match reference ✅
Hit LLAMA_TOKEN_NULL (image placeholder) which triggers process_chunk()
mtmd_encode_chunk() runs the vision encoder
Image embeddings come out different from reference ❌

The fact that text embeddings match perfectly makes me think this isn't a fundamental model loading or quantization issue.
And since the image embeddings respond to different text contexts, the attention mechanism seems to be working.

What I'm wondering:

Is this level of difference expected between llama.cpp and reference implementations?
Could there be differences in how the vision transformer layers are implemented?
Are there any known quirks with Qwen 2.5 VL vision encoding in llama.cpp?
Any suggestions for digging deeper into what's happening inside the vision encoder?

I'm happy to run more tests or provide additional debugging info if that would help figure this out.

Really appreciate any insights you might have!

My implementation changes

I've modified the update_slots() function to capture embeddings during multimodal processing.
Here are the key changes I made:

Image processing section (around the LLAMA_TOKEN_NULL check):

// check if we should process the image
if (slot.n_past < slot.n_prompt_tokens && slot.prompt_tokens[slot.n_past] == LLAMA_TOKEN_NULL) {
    printf("=== IMAGE PROCESSING DEBUG ===\n");
    printf("Before process_chunk: slot.n_past = %d, slot.n_prompt_tokens = %d\n", 
        slot.n_past, slot.n_prompt_tokens);
    
    // process the image
    int32_t new_n_past;
    int32_t res = slot.prompt_tokens.process_chunk(ctx, mctx, slot.n_past, slot.id, new_n_past);
    
    // CAPTURE IMAGE EMBEDDINGS immediately after process_chunk()
    if (res == 0 && slot.task_type == SERVER_TASK_TYPE_EMBEDDING) {
        const int n_embd = llama_model_n_embd(model);
        int image_embeddings_found = 0;
        
        for (int batch_pos = 0; batch_pos < 512; batch_pos++) { // reasonable upper bound
            const float * embd = llama_get_embeddings_ith(ctx, batch_pos);
            if (embd != nullptr) {
                slot.stored_image_embeddings.emplace_back(embd, embd + n_embd);
                image_embeddings_found++;
            } else {
                break; // Stop at first nullptr
            }
        }
        
        printf("STORAGE: Captured %d image embeddings dynamically\n", image_embeddings_found);
        slot.has_stored_embeddings = true;
    }

    // ... rest of image processing logic
}

Pre-image text embedding capture (in the batch processing loop):

// NOTE: Added this embedding capture to store embeddings retrieved before image
if (ret == 0) {
    for (auto & slot : slots) {
        if (slot.task_type == SERVER_TASK_TYPE_EMBEDDING) {
            // Capture pre-image embeddings (only before image processing)
            if (!slot.has_stored_embeddings) {
                const int n_embd = llama_model_n_embd(model);
                int pre_image_captured = 0;
                
                for (int batch_pos = 0; batch_pos < batch_view.n_tokens; batch_pos++) {
                    const float * embd = llama_get_embeddings_ith(ctx, batch_pos);
                    if (embd != nullptr) {
                        slot.stored_pre_image_embeddings.emplace_back(embd, embd + n_embd);
                        pre_image_captured++;
                    }
                }
                
                if (pre_image_captured > 0) {
                    printf("PRE-IMAGE CAPTURE: Stored %d pre-image embeddings\n", pre_image_captured);
                }
            }
        }
    }
}

I also modified send_embedding() to assemble the complete multimodal embedding sequence by combining pre-image text embeddings, image embeddings, and post-image text embeddings.

Environment details:

Model: jina-embeddings-v4 (based on Qwen 2.5 VL)
Using causal attention
Latest llama.cpp with mtmd library

Let me know if there's any other info that would be useful!
Our code can be found here if anyone wants to take a look: https://github.com/jina-ai/llama.cpp

CISC · 2025-07-24T09:10:46Z

CISC
Jul 24, 2025
Collaborator

Thank you for looking into this, very much appreciated!

@ngxson should be able to give you some insights into mtmd when time permits.

0 replies

ggerganov · 2025-07-24T13:52:04Z

ggerganov
Jul 24, 2025
Maintainer

Are there any known quirks with Qwen 2.5 VL vision encoding in llama.cpp?

I think our implementation of Qwen 2.5 VL has been subtly broken in some way (#13694). Needs a detailed investigation of where numerical results start to diverge.

0 replies

Andrei997 · 2025-07-24T14:13:26Z

Andrei997
Jul 24, 2025
Author

I've done a sanity check using the official Qwen/Qwen2.5-VL-3B-Instruct model.

llama.cpp qwen25vl
First 10 ViT values: -1.080078 -1.655273 1.696289 -0.018097 1.216431 0.605103 -3.717308 0.730347 -4.380524 2.823242 

hf / pytorch qwen25vl
First 10 ViT values: -0.9140625 -1.625 1.3828125 -0.32226562 1.0625 0.49609375 -3.5625 0.40039062 -4.375 2.984375

The results from ViT show very similar values to what we see in jina-embeddings-v4, and very similar discrepancies between llama.cpp and pytorch implementation 🤔 not sure if it helps in any way but at least I can confirm it's not limited to jina-embeddings-v4 ...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-modal embeddings for jinaai/jina-embeddings-v4 #14851

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Multi-modal embeddings for jinaai/jina-embeddings-v4 #14851

Uh oh!

Uh oh!

Andrei997 Jul 24, 2025

My implementation changes

Environment details:

Replies: 3 comments

Uh oh!

CISC Jul 24, 2025 Collaborator

Uh oh!

ggerganov Jul 24, 2025 Maintainer

Uh oh!

Uh oh!

Andrei997 Jul 24, 2025 Author

Andrei997
Jul 24, 2025

CISC
Jul 24, 2025
Collaborator

ggerganov
Jul 24, 2025
Maintainer

Andrei997
Jul 24, 2025
Author