Open
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
I am running VLM like Qwen2.5-VL with multiple images and one prompt inference request to let the VLM do the video likely understadning.
But I find the log and the code show that llamaCPP does not support real batch processing for multiple images. Accutly multiple-images one prompt (video like) inference is real use case for Auto and Robot which need good TTFT and E2E. So, it is vaule to implement the batch processing even with bigger cgraph.
bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_image_f32_batch * imgs_c_ptr, float * vec) {
const clip_image_f32_batch & imgs = *imgs_c_ptr;
int batch_size = imgs.entries.size();
// TODO @ngxson : implement batch size > 1 as a loop
// we don't need true batching support because the cgraph will gonna be big anyway
if (batch_size != 1) {
return false; // only support batch size of 1
}
``
### Motivation
llama serve log of 2 images with 1 prompt inference request.
```shell
srv update_slots: all slots are idle
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 1365, n_keep = 0, n_prompt_tokens = 29
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 18, n_tokens = 18, progress = 0.620690
slot update_slots: id 0 | task 0 | kv cache rm [18, end)
srv process_chun: processing image...
encoding image slice...
image slice encoded in 600 ms
decoding image batch 1/1, n_tokens_batch = 208
image decoded (batch 1/1) in 4 ms
srv process_chun: image processed in 604 ms
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 22, n_tokens = 3, progress = 0.758621
slot update_slots: id 0 | task 0 | kv cache rm [22, end)
srv process_chun: processing image...
encoding image slice...
image slice encoded in 470 ms
decoding image batch 1/1, n_tokens_batch = 208
image decoded (batch 1/1) in 3 ms
srv process_chun: image processed in 473 ms
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 29, n_tokens = 6, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 29, n_tokens = 6
slot release: id 0 | task 0 | stop processing: n_past = 50, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 2124.96 ms / 29 tokens ( 73.27 ms per token, 13.65 tokens per second)
eval time = 610.92 ms / 22 tokens ( 27.77 ms per token, 36.01 tokens per second)
Possible Implementation
No response