Skip to content

Feature Request: to enable real batch for multiple images input of VLM #14530

Open
@alexhegit

Description

@alexhegit

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

I am running VLM like Qwen2.5-VL with multiple images and one prompt inference request to let the VLM do the video likely understadning.

But I find the log and the code show that llamaCPP does not support real batch processing for multiple images. Accutly multiple-images one prompt (video like) inference is real use case for Auto and Robot which need good TTFT and E2E. So, it is vaule to implement the batch processing even with bigger cgraph.

bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_image_f32_batch * imgs_c_ptr, float * vec) {
    const clip_image_f32_batch & imgs = *imgs_c_ptr;
    int batch_size = imgs.entries.size();

    // TODO @ngxson : implement batch size > 1 as a loop
    //                we don't need true batching support because the cgraph will gonna be big anyway
    if (batch_size != 1) {
        return false; // only support batch size of 1
    }

``

### Motivation

llama serve log of 2 images with 1 prompt inference request. 


```shell
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 1365, n_keep = 0, n_prompt_tokens = 29
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 18, n_tokens = 18, progress = 0.620690
slot update_slots: id  0 | task 0 | kv cache rm [18, end)
srv  process_chun: processing image...
encoding image slice...
image slice encoded in 600 ms
decoding image batch 1/1, n_tokens_batch = 208
image decoded (batch 1/1) in 4 ms
srv  process_chun: image processed in 604 ms
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 22, n_tokens = 3, progress = 0.758621
slot update_slots: id  0 | task 0 | kv cache rm [22, end)
srv  process_chun: processing image...
encoding image slice...
image slice encoded in 470 ms
decoding image batch 1/1, n_tokens_batch = 208
image decoded (batch 1/1) in 3 ms
srv  process_chun: image processed in 473 ms
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 29, n_tokens = 6, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 29, n_tokens = 6
slot      release: id  0 | task 0 | stop processing: n_past = 50, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    2124.96 ms /    29 tokens (   73.27 ms per token,    13.65 tokens per second)
       eval time =     610.92 ms /    22 tokens (   27.77 ms per token,    36.01 tokens per second)

Possible Implementation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions