[Feature]:  AutoDeploy: setup of multi-modal AD input processor in trtllm-serve

### 🚀 The feature, motivation and pitch

I ran a quick test, and it seems that our trtllm-serve​ integration for multi-modal models seems to miss a few features to support all VLMs out-of-the-box.

In particular, we rely on a wrapper for HF’s multi-modal input processor that is currently not hooked up to trtllm-serve’s handling of multi-modal inputs. trtllm-serve assumes that a custom input processor for multi-modal data using TRT-LLM's base class is available. We just re-use HF's input processor

best case scenario --> we can hook our generic input processor that wraps HF's input processor to TRT-LLM base class interface
worst case scenario --> we have to manually write an input processor for each VLM we enable

### Alternatives

_No response_

### Additional context

Testing `trtllm-serve` with Qwen3-VL

#### 1. 

Install latest transformers version: pip install -U transformers~=4.57

#### 2. Apply this patch to avoid name clash from manual PT workflow:
```
diff --git a/tensorrt_llm/_torch/models/modeling_qwen3_next.py b/tensorrt_llm/_torch/models/modeling_qwen3_next.py
index c6bac044f3..60dcc0b57a 100644
--- a/tensorrt_llm/_torch/models/modeling_qwen3_next.py
+++ b/tensorrt_llm/_torch/models/modeling_qwen3_next.py
@@ -319,7 +319,7 @@ class Qwen3NextConfig(PretrainedConfig):
         self.mlp_only_layers = mlp_only_layers
 
 
-AutoConfig.register("qwen3_next", Qwen3NextConfig)
+# AutoConfig.register("qwen3_next", Qwen3NextConfig)
 
 
 class Qwen3NextGate(nn.Module):
```
#### 3. Use `qwen3_vl.yaml​`:
```
model: Qwen/Qwen3-VL-4B-Instruct
args:
  mode: transformers
  world_size: 1 # can also be > 1
  model_factory: AutoModelForImageTextToText
  max_input_len: 4096
  max_seq_len: 8192
prompt:
  batch_size: 4
  queries:
    - "How big is the universe? "
    - {"prompt": "In simple words and a single sentence, explain the concept of gravity: "}
    # see for chat template format: https://huggingface.co/docs/transformers/en/chat_templating_multimodal
    - - role: user
        content:
          - type: text
            text: How to fix slicing in golf?
    - - role: user
        content:
          - type: text
            text: Please describe the natural scenery you see in the following images
          - type: image
            url: https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png
          - type: image
            url: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png
```

#### 4. Run example script:
```
python build_and_run_ad.py --yaml-extra qwen3_vl.yaml​
```

#### 5. Expected Output:
```
[11/18/2025-14:43:51] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Running example prompts...
Processed requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:14<00:00,  3.72s/it]
[11/18/2025-14:44:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 0] How big is the universe? :  What is its age?
Answer:
The universe is estimated to be **about 93 billion light-years** in diameter, stretching beyond what we can observe.

It is **approximately 13.8 billion years old**.

This estimate comes from observations of cosmic microwave background radiation and other cosmological data. Although the observable universe is only 93 billion light-years across (due to the expansion of space during the universe's lifespan), the total universe might be much larger—or even infinite.

And
[11/18/2025-14:44:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: :  Gravity is the invisible force that pulls everything with mass toward each other, making objects fall to the ground and keeping planets in orbit around stars.
[11/18/2025-14:44:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 2] <|im_start|>user
How to fix slicing in golf?<|im_end|>
<|im_start|>assistant
: Fixing a **slicing** golf shot — where the ball curves sharply to the right (for right-handed players) or left (for left-handed players) Spain — is a common issue for golfers of all levels. The good news is that slicing is **correctable** with the right technique, mindset, and practice. Here’s a step-by-step guide to help you fix it:

---

## 🔍 1. **Understand the Cause of Your Slice**
Slicing is usually caused
[11/18/2025-14:44:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 3] <|im_start|>user
Please describe the natural scenery you see in the following images<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant
: Based on the two images provided, here is a description of the natural scenery in each:

**Image 1: A Stormy Sea**

This image captures a powerful and dramatic seascape under a heavy, overcast sky.
*   **Sky:** The sky is completely overcast with a thick blanketพระพุทธ of dark, gray clouds, suggesting an impending or ongoing storm.
*   **Sea:** The ocean is turbulent and wild. Large, crested waves are rolling powerfully, with white foam and
```

### 6. Spin up `trtllm-serve`

You can also spin up a trtllm-serve instance with
```
trtllm-serve serve Qwen/Qwen3-VL-4B-Instruct --backend _autodeploy --extra_llm_api_options qwen3_vl_extra.yaml
```
where `qwen3_vl_extra.yaml`​ is

```
mode: transformers
model_factory: AutoModelForImageTextToText
max_input_len: 4096
max_seq_len: 8192
```
And send a request: 
```
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \ 
    -d '{
        "model": "Qwen/Qwen3-VL-4B-Instruct",
        "messages":[{
            "role": "system",
            "content": "You are a helpful assistant."
        }, {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text":"Tell me the difference between two images"      
                },
                {
                    "type":"image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
                    }
                },
                {
                    "type":"image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png"
                    }
                }
            ]
        }],
        "max_tokens": 64,
        "temperature": 0
    }'
```

Error Message:
```
​INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
[11/18/2025-14:45:53] [TRT-LLM] [E] Traceback (most recent call last):
  File "/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/serve/openai_server.py", line 499, in openai_chat
    conversation, mm_coroutines, mm_placeholder_counts = parse_chat_messages_coroutines(request.messages, self.model_config, self.multimodal_server_config)
                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/serve/chat_utils.py", line 239, in parse_chat_messages_coroutines
    mm_data_tracker.add_data(mdata["modality"], mdata["data"])
  File "/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/inputs/utils.py", line 470, in add_data
    placeholder = retrieve_multimodal_placeholder(self._model_type,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/inputs/utils.py", line 418, in retrieve_multimodal_placeholder
    raise TypeError(f"Unknown modality: {modality}")
TypeError: Unknown modality: image

/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/serve/openai_server.py:563: RuntimeWarning: coroutine 'parse_chat_message_content_part.<locals>.load_image_async' was never awaited
  return self.create_error_response(str(e))
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
INFO:     127.0.0.1:41146 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: AutoDeploy: setup of multi-modal AD input processor in trtllm-serve #9281

🚀 The feature, motivation and pitch

Alternatives

Additional context

1.

2. Apply this patch to avoid name clash from manual PT workflow:

3. Use `qwen3_vl.yaml`:

4. Run example script:

5. Expected Output:

6. Spin up `trtllm-serve`

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: AutoDeploy: setup of multi-modal AD input processor in trtllm-serve #9281

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

1.

2. Apply this patch to avoid name clash from manual PT workflow:

3. Use qwen3_vl.yaml​:

4. Run example script:

5. Expected Output:

6. Spin up trtllm-serve

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3. Use `qwen3_vl.yaml`:

6. Spin up `trtllm-serve`