Replies: 3 comments
-
Thank you for looking into this, very much appreciated! @ngxson should be able to give you some insights into |
Beta Was this translation helpful? Give feedback.
-
I think our implementation of Qwen 2.5 VL has been subtly broken in some way (#13694). Needs a detailed investigation of where numerical results start to diverge. |
Beta Was this translation helpful? Give feedback.
-
I've done a sanity check using the official llama.cpp qwen25vl
First 10 ViT values: -1.080078 -1.655273 1.696289 -0.018097 1.216431 0.605103 -3.717308 0.730347 -4.380524 2.823242
hf / pytorch qwen25vl
First 10 ViT values: -0.9140625 -1.625 1.3828125 -0.32226562 1.0625 0.49609375 -3.5625 0.40039062 -4.375 2.984375 The results from ViT show very similar values to what we see in jina-embeddings-v4, and very similar discrepancies between llama.cpp and pytorch implementation 🤔 not sure if it helps in any way but at least I can confirm it's not limited to jina-embeddings-v4 ... |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey folks!
I'm working on getting multimodal embeddings working with
jina-embeddings-v4
(based on Qwen 2.5 VL) through llama.cpp server.I've hit an issue with
mtmd
inconsistencies and was hoping someone might have insights on this, or suggestion on how to proceed.What I'm trying to do
I'm implementing token-level embeddings (no pooling) for a retrieval system using jina-embeddings-v4.
This model is based on Qwen 2.5 VL but was further trained for embedding tasks (supports both text and image).
To get it working in llama.cpp, we merged the jina-embeddings-v4 weights back into the Qwen 2.5 VL architecture.
The setup uses llama.cpp server, processing prompts like
<|im_start|>user\n<__image__>Describe the image.<|im_end|>\n
.On the llama.cpp side, we're not applying any pooling or normalization - just extracting the raw token embeddings.
The issue
Here's what's got me scratching my head: everything seems to work perfectly until the vision encoder kicks in.
What's working:
Where it diverges:
Right after mtmd_encode_chunk() runs, the vision encoder outputs start differing significantly from what I get with the Python/HuggingFace implementation.
llama.cpp vision encoder output:
python reference:
They're in the same ballpark but consistently different across all values.
Debugging steps taken
I added some debug output to
mtmd_helper_eval_chunk_single()
:The processing flow:
process_chunk()
The fact that text embeddings match perfectly makes me think this isn't a fundamental model loading or quantization issue.
And since the image embeddings respond to different text contexts, the attention mechanism seems to be working.
What I'm wondering:
I'm happy to run more tests or provide additional debugging info if that would help figure this out.
Really appreciate any insights you might have!
My implementation changes
I've modified the
update_slots()
function to capture embeddings during multimodal processing.Here are the key changes I made:
Image processing section (around the LLAMA_TOKEN_NULL check):
Pre-image text embedding capture (in the batch processing loop):
I also modified
send_embedding()
to assemble the complete multimodal embedding sequence by combining pre-image text embeddings, image embeddings, and post-image text embeddings.Environment details:
Let me know if there's any other info that would be useful!
Our code can be found here if anyone wants to take a look: https://github.com/jina-ai/llama.cpp
Beta Was this translation helpful? Give feedback.
All reactions