-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
π The feature, motivation and pitch
I ran a quick test, and it seems that our trtllm-serveβ integration for multi-modal models seems to miss a few features to support all VLMs out-of-the-box.
In particular, we rely on a wrapper for HFβs multi-modal input processor that is currently not hooked up to trtllm-serveβs handling of multi-modal inputs. trtllm-serve assumes that a custom input processor for multi-modal data using TRT-LLM's base class is available. We just re-use HF's input processor
best case scenario --> we can hook our generic input processor that wraps HF's input processor to TRT-LLM base class interface
worst case scenario --> we have to manually write an input processor for each VLM we enable
Alternatives
No response
Additional context
Testing trtllm-serve with Qwen3-VL
1.
Install latest transformers version: pip install -U transformers~=4.57
2. Apply this patch to avoid name clash from manual PT workflow:
diff --git a/tensorrt_llm/_torch/models/modeling_qwen3_next.py b/tensorrt_llm/_torch/models/modeling_qwen3_next.py
index c6bac044f3..60dcc0b57a 100644
--- a/tensorrt_llm/_torch/models/modeling_qwen3_next.py
+++ b/tensorrt_llm/_torch/models/modeling_qwen3_next.py
@@ -319,7 +319,7 @@ class Qwen3NextConfig(PretrainedConfig):
self.mlp_only_layers = mlp_only_layers
-AutoConfig.register("qwen3_next", Qwen3NextConfig)
+# AutoConfig.register("qwen3_next", Qwen3NextConfig)
class Qwen3NextGate(nn.Module):
3. Use qwen3_vl.yamlβ:
model: Qwen/Qwen3-VL-4B-Instruct
args:
mode: transformers
world_size: 1 # can also be > 1
model_factory: AutoModelForImageTextToText
max_input_len: 4096
max_seq_len: 8192
prompt:
batch_size: 4
queries:
- "How big is the universe? "
- {"prompt": "In simple words and a single sentence, explain the concept of gravity: "}
# see for chat template format: https://huggingface.co/docs/transformers/en/chat_templating_multimodal
- - role: user
content:
- type: text
text: How to fix slicing in golf?
- - role: user
content:
- type: text
text: Please describe the natural scenery you see in the following images
- type: image
url: https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png
- type: image
url: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png
4. Run example script:
python build_and_run_ad.py --yaml-extra qwen3_vl.yamlβ
5. Expected Output:
[11/18/2025-14:43:51] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Running example prompts...
Processed requests: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:14<00:00, 3.72s/it]
[11/18/2025-14:44:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 0] How big is the universe? : What is its age?
Answer:
The universe is estimated to be **about 93 billion light-years** in diameter, stretching beyond what we can observe.
It is **approximately 13.8 billion years old**.
This estimate comes from observations of cosmic microwave background radiation and other cosmological data. Although the observable universe is only 93 billion light-years across (due to the expansion of space during the universe's lifespan), the total universe might be much largerβor even infinite.
And
[11/18/2025-14:44:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: : Gravity is the invisible force that pulls everything with mass toward each other, making objects fall to the ground and keeping planets in orbit around stars.
[11/18/2025-14:44:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 2] <|im_start|>user
How to fix slicing in golf?<|im_end|>
<|im_start|>assistant
: Fixing a **slicing** golf shot β where the ball curves sharply to the right (for right-handed players) or left (for left-handed players) Spain β is a common issue for golfers of all levels. The good news is that slicing is **correctable** with the right technique, mindset, and practice. Hereβs a step-by-step guide to help you fix it:
---
## π 1. **Understand the Cause of Your Slice**
Slicing is usually caused
[11/18/2025-14:44:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 3] <|im_start|>user
Please describe the natural scenery you see in the following images<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant
: Based on the two images provided, here is a description of the natural scenery in each:
**Image 1: A Stormy Sea**
This image captures a powerful and dramatic seascape under a heavy, overcast sky.
* **Sky:** The sky is completely overcast with a thick blanketΰΈΰΈ£ΰΈ°ΰΈΰΈΈΰΈΰΈ of dark, gray clouds, suggesting an impending or ongoing storm.
* **Sea:** The ocean is turbulent and wild. Large, crested waves are rolling powerfully, with white foam and
6. Spin up trtllm-serve
You can also spin up a trtllm-serve instance with
trtllm-serve serve Qwen/Qwen3-VL-4B-Instruct --backend _autodeploy --extra_llm_api_options qwen3_vl_extra.yaml
where qwen3_vl_extra.yamlβ is
mode: transformers
model_factory: AutoModelForImageTextToText
max_input_len: 4096
max_seq_len: 8192
And send a request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-4B-Instruct",
"messages":[{
"role": "system",
"content": "You are a helpful assistant."
}, {
"role": "user",
"content": [
{
"type": "text",
"text":"Tell me the difference between two images"
},
{
"type":"image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
}
},
{
"type":"image_url",
"image_url": {
"url": "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png"
}
}
]
}],
"max_tokens": 64,
"temperature": 0
}'
Error Message:
βINFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
[11/18/2025-14:45:53] [TRT-LLM] [E] Traceback (most recent call last):
File "/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/serve/openai_server.py", line 499, in openai_chat
conversation, mm_coroutines, mm_placeholder_counts = parse_chat_messages_coroutines(request.messages, self.model_config, self.multimodal_server_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/serve/chat_utils.py", line 239, in parse_chat_messages_coroutines
mm_data_tracker.add_data(mdata["modality"], mdata["data"])
File "/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/inputs/utils.py", line 470, in add_data
placeholder = retrieve_multimodal_placeholder(self._model_type,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/inputs/utils.py", line 418, in retrieve_multimodal_placeholder
raise TypeError(f"Unknown modality: {modality}")
TypeError: Unknown modality: image
/home/lliebenwein/dev_local/TensorRT-LLM/tensorrt_llm/serve/openai_server.py:563: RuntimeWarning: coroutine 'parse_chat_message_content_part.<locals>.load_image_async' was never awaited
return self.create_error_response(str(e))
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
INFO: 127.0.0.1:41146 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status