Skip to content

[Feature]: AutoDeploy: setup of multi-modal AD input processor in trtllm-serveΒ #9281

@lucaslie

Description

@lucaslie

πŸš€ The feature, motivation and pitch

I ran a quick test, and it seems that our trtllm-serve​ integration for multi-modal models seems to miss a few features to support all VLMs out-of-the-box.

In particular, we rely on a wrapper for HF’s multi-modal input processor that is currently not hooked up to trtllm-serve’s handling of multi-modal inputs. trtllm-serve assumes that a custom input processor for multi-modal data using TRT-LLM's base class is available. We just re-use HF's input processor

best case scenario --> we can hook our generic input processor that wraps HF's input processor to TRT-LLM base class interface
worst case scenario --> we have to manually write an input processor for each VLM we enable

Alternatives

No response

Additional context

Testing trtllm-serve with Qwen3-VL

1.

Install latest transformers version: pip install -U transformers~=4.57

2. Apply this patch to avoid name clash from manual PT workflow:

diff --git a/tensorrt_llm/_torch/models/modeling_qwen3_next.py b/tensorrt_llm/_torch/models/modeling_qwen3_next.py
index c6bac044f3..60dcc0b57a 100644
--- a/tensorrt_llm/_torch/models/modeling_qwen3_next.py
+++ b/tensorrt_llm/_torch/models/modeling_qwen3_next.py
@@ -319,7 +319,7 @@ class Qwen3NextConfig(PretrainedConfig):
         self.mlp_only_layers = mlp_only_layers
 
 
-AutoConfig.register("qwen3_next", Qwen3NextConfig)
+# AutoConfig.register("qwen3_next", Qwen3NextConfig)
 
 
 class Qwen3NextGate(nn.Module):

3. Use qwen3_vl.yaml​:

model: Qwen/Qwen3-VL-4B-Instruct
args:
  mode: transformers
  world_size: 1 # can also be > 1
  model_factory: AutoModelForImageTextToText
  max_input_len: 4096
  max_seq_len: 8192
prompt:
  batch_size: 4
  queries:
    - "How big is the universe? "
    - {"prompt": "In simple words and a single sentence, explain the concept of gravity: "}
    # see for chat template format: https://huggingface.co/docs/transformers/en/chat_templating_multimodal
    - - role: user
        content:
          - type: text
            text: How to fix slicing in golf?
    - - role: user
        content:
          - type: text
            text: Please describe the natural scenery you see in the following images
          - type: image
            url: https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png
          - type: image
            url: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png

4. Run example script:

python build_and_run_ad.py --yaml-extra qwen3_vl.yaml​

5. Expected Output:

[11/18/2025-14:43:51] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Running example prompts...
Processed requests: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:14<00:00,  3.72s/it]
[11/18/2025-14:44:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 0] How big is the universe? :  What is its age?
Answer:
The universe is estimated to be **about 93 billion light-years** in diameter, stretching beyond what we can observe.

It is **approximately 13.8 billion years old**.

This estimate comes from observations of cosmic microwave background radiation and other cosmological data. Although the observable universe is only 93 billion light-years across (due to the expansion of space during the universe's lifespan), the total universe might be much largerβ€”or even infinite.

And
[11/18/2025-14:44:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: :  Gravity is the invisible force that pulls everything with mass toward each other, making objects fall to the ground and keeping planets in orbit around stars.
[11/18/2025-14:44:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 2] <|im_start|>user
How to fix slicing in golf?<|im_end|>
<|im_start|>assistant
: Fixing a **slicing** golf shot β€” where the ball curves sharply to the right (for right-handed players) or left (for left-handed players) Spain β€” is a common issue for golfers of all levels. The good news is that slicing is **correctable** with the right technique, mindset, and practice. Here’s a step-by-step guide to help you fix it:

---

## πŸ” 1. **Understand the Cause of Your Slice**
Slicing is usually caused
[11/18/2025-14:44:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 3] <|im_start|>user
Please describe the natural scenery you see in the following images<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant
: Based on the two images provided, here is a description of the natural scenery in each:

**Image 1: A Stormy Sea**

This image captures a powerful and dramatic seascape under a heavy, overcast sky.
*   **Sky:** The sky is completely overcast with a thick blanketΰΈžΰΈ£ΰΈ°ΰΈžΰΈΈΰΈ—ΰΈ˜ of dark, gray clouds, suggesting an impending or ongoing storm.
*   **Sea:** The ocean is turbulent and wild. Large, crested waves are rolling powerfully, with white foam and

6. Spin up trtllm-serve

You can also spin up a trtllm-serve instance with

trtllm-serve serve Qwen/Qwen3-VL-4B-Instruct --backend _autodeploy --extra_llm_api_options qwen3_vl_extra.yaml

where qwen3_vl_extra.yaml​ is

mode: transformers
model_factory: AutoModelForImageTextToText
max_input_len: 4096
max_seq_len: 8192

And send a request:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \ 
    -d '{
        "model": "Qwen/Qwen3-VL-4B-Instruct",
        "messages":[{
            "role": "system",
            "content": "You are a helpful assistant."
        }, {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text":"Tell me the difference between two images"      
                },
                {
                    "type":"image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
                    }
                },
                {
                    "type":"image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png"
                    }
                }
            ]
        }],
        "max_tokens": 64,
        "temperature": 0
    }'

Error Message:

​INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
[11/18/2025-14:45:53] [TRT-LLM] [E] Traceback (most recent call last):
  File "/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/serve/openai_server.py", line 499, in openai_chat
    conversation, mm_coroutines, mm_placeholder_counts = parse_chat_messages_coroutines(request.messages, self.model_config, self.multimodal_server_config)
                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/serve/chat_utils.py", line 239, in parse_chat_messages_coroutines
    mm_data_tracker.add_data(mdata["modality"], mdata["data"])
  File "/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/inputs/utils.py", line 470, in add_data
    placeholder = retrieve_multimodal_placeholder(self._model_type,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/inputs/utils.py", line 418, in retrieve_multimodal_placeholder
    raise TypeError(f"Unknown modality: {modality}")
TypeError: Unknown modality: image

/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/serve/openai_server.py:563: RuntimeWarning: coroutine 'parse_chat_message_content_part.<locals>.load_image_async' was never awaited
  return self.create_error_response(str(e))
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
INFO:     127.0.0.1:41146 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    AutoDeploy<NV> AutoDeploy BackendMultimodalLabel for issues & PRs regarding Multimodal related objectsfeature requestNew feature or request. This includes new model, dtype, functionality support

    Type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions