Skip to content

Commit 6d4d179

Browse files
authored
[TRTLLM-5518] doc: Adding disaggregated serving section to models doc (#4877)
Signed-off-by: Patrice Castonguay <[email protected]>
1 parent e2bd01f commit 6d4d179

File tree

2 files changed

+229
-4
lines changed

2 files changed

+229
-4
lines changed

examples/models/core/deepseek_v3/README.md

Lines changed: 117 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,10 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
2727
- [ISL-128k-OSL-1024](#isl-128k-osl-1024)
2828
- [Evaluation](#evaluation)
2929
- [Serving](#serving)
30-
- [Use trtllm-serve](#use-trtllm-serve)
31-
- [Use tensorrtllm_backend for triton inference server (Experimental)](#use-tensorrtllm_backend-for-triton-inference-server-experimental)
30+
- [trtllm-serve](#trtllm-serve)
31+
- [Disaggregated Serving](#disaggregated-serving)
32+
- [Dynamo](#dynamo)
33+
- [tensorrtllm_backend for triton inference server (Experimental)](#tensorrtllm_backend-for-triton-inference-server-experimental)
3234
- [Advanced Usages](#advanced-usages)
3335
- [Multi-node](#multi-node)
3436
- [mpirun](#mpirun)
@@ -227,7 +229,7 @@ trtllm-eval --model <YOUR_MODEL_DIR> \
227229
```
228230

229231
## Serving
230-
### Use trtllm-serve
232+
### trtllm-serve
231233

232234
To serve the model using `trtllm-serve`:
233235

@@ -278,8 +280,119 @@ curl http://localhost:8000/v1/completions \
278280

279281
For DeepSeek-R1, use the model name `deepseek-ai/DeepSeek-R1`.
280282

283+
### Disaggregated Serving
281284

282-
### Use tensorrtllm_backend for triton inference server (Experimental)
285+
To serve the model in disaggregated mode, you should launch context and generation servers using `trtllm-serve`.
286+
287+
For example, you can launch a single context server on port 8001 with:
288+
289+
```bash
290+
export TRTLLM_USE_UCX_KVCACHE=1
291+
292+
cat >./ctx-extra-llm-api-config.yml <<EOF
293+
print_iter_log: true
294+
enable_attention_dp: true
295+
EOF
296+
297+
trtllm-serve \
298+
deepseek-ai/DeepSeek-V3 \
299+
--host localhost \
300+
--port 8001 \
301+
--backend pytorch \
302+
--max_batch_size 161 \
303+
--max_num_tokens 1160 \
304+
--tp_size 8 \
305+
--ep_size 8 \
306+
--pp_size 1 \
307+
--kv_cache_free_gpu_memory_fraction 0.95 \
308+
--extra_llm_api_options ./ctx-extra-llm-api-config.yml &> output_ctx &
309+
```
310+
311+
And you can launch two generation servers on port 8002 and 8003 with:
312+
313+
```bash
314+
export TRTLLM_USE_UCX_KVCACHE=1
315+
316+
cat >./gen-extra-llm-api-config.yml <<EOF
317+
use_cuda_graph: true
318+
cuda_graph_padding_enabled: true
319+
cuda_graph_batch_sizes:
320+
- 1
321+
- 2
322+
- 4
323+
- 8
324+
- 16
325+
- 32
326+
- 64
327+
- 128
328+
- 256
329+
- 384
330+
print_iter_log: true
331+
enable_attention_dp: true
332+
EOF
333+
334+
for port in {8002..8003}; do \
335+
trtllm-serve \
336+
deepseek-ai/DeepSeek-V3 \
337+
--host localhost \
338+
--port ${port} \
339+
--backend pytorch \
340+
--max_batch_size 161 \
341+
--max_num_tokens 1160 \
342+
--tp_size 8 \
343+
--ep_size 8 \
344+
--pp_size 1 \
345+
--kv_cache_free_gpu_memory_fraction 0.95 \
346+
--extra_llm_api_options ./gen-extra-llm-api-config.yml \
347+
&> output_gen_${port} & \
348+
done
349+
```
350+
351+
Finally, you can launch the disaggregated server which will accept requests from the client and do
352+
the orchestration between the context and generation servers with:
353+
354+
```bash
355+
cat >./disagg-config.yml <<EOF
356+
hostname: localhost
357+
port: 8000
358+
backend: pytorch
359+
context_servers:
360+
num_instances: 1
361+
urls:
362+
- "localhost:8001"
363+
generation_servers:
364+
num_instances: 1
365+
urls:
366+
- "localhost:8002"
367+
EOF
368+
369+
trtllm-serve disaggregated -c disagg-config.yaml
370+
```
371+
372+
To query the server, you can start with a `curl` command:
373+
```bash
374+
curl http://localhost:8000/v1/completions \
375+
-H "Content-Type: application/json" \
376+
-d '{
377+
"model": "deepseek-ai/DeepSeek-V3",
378+
"prompt": "Where is New York?",
379+
"max_tokens": 16,
380+
"temperature": 0
381+
}'
382+
```
383+
384+
For DeepSeek-R1, use the model name `deepseek-ai/DeepSeek-R1`.
385+
386+
Note that the optimal disaggregated serving configuration (i.e. tp/pp/ep mappings, number of ctx/gen instances, etc.) will depend
387+
on the request parameters, the number of concurrent requests and the GPU type. It is recommended to experiment to identify optimal
388+
settings for your specific use case.
389+
390+
### Dynamo
391+
392+
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
393+
Dynamo supports TensorRT-LLM as one of its inference engine. For details on how to use TensorRT-LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
394+
395+
### tensorrtllm_backend for triton inference server (Experimental)
283396
To serve the model using [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend.git), make sure the version is v0.19+ in which the pytorch path is added as an experimental feature.
284397

285398
The model configuration file is located at https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/llmapi/tensorrt_llm/1/model.yaml

examples/models/core/qwen/README.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,9 @@ This document shows how to build and run a [Qwen](https://huggingface.co/Qwen) m
2222
- [Run a single inference](#run-a-single-inference)
2323
- [Evaluation](#evaluation)
2424
- [Serving](#serving)
25+
- [trtllm-serve](#trtllm-serve)
26+
- [Disaggregated Serving](#disaggregated-serving)
27+
- [Dynamo](#dynamo)
2528
- [Notes and Troubleshooting](#notes-and-troubleshooting)
2629
- [Credits](#credits)
2730

@@ -648,6 +651,7 @@ trtllm-eval --model=Qwen3-30B-A3B/ --tokenizer=Qwen3-30B-A3B/ --backend=pytorch
648651
```
649652

650653
### Serving
654+
#### trtllm-serve
651655

652656
To serve the model using `trtllm-serve`:
653657

@@ -695,7 +699,115 @@ curl http://localhost:8000/v1/completions \
695699
"temperature": 0
696700
}'
697701
```
702+
#### Disaggregated Serving
698703

704+
To serve the model in disaggregated mode, you should launch context and generation servers using `trtllm-serve`.
705+
706+
For example, you can launch a single context server on port 8001 with:
707+
708+
```bash
709+
export TRTLLM_USE_UCX_KVCACHE=1
710+
711+
cat >./ctx-extra-llm-api-config.yml <<EOF
712+
print_iter_log: true
713+
enable_attention_dp: true
714+
EOF
715+
716+
trtllm-serve \
717+
Qwen3-30B-A3B/ \
718+
--host localhost \
719+
--port 8001 \
720+
--backend pytorch \
721+
--max_batch_size 161 \
722+
--max_num_tokens 1160 \
723+
--tp_size 1 \
724+
--ep_size 1 \
725+
--pp_size 1 \
726+
--kv_cache_free_gpu_memory_fraction 0.8 \
727+
--extra_llm_api_options ./ctx-extra-llm-api-config.yml &> output_ctx &
728+
```
729+
730+
And you can launch two generation servers on port 8002 and 8003 with:
731+
732+
```bash
733+
export TRTLLM_USE_UCX_KVCACHE=1
734+
735+
cat >./gen-extra-llm-api-config.yml <<EOF
736+
use_cuda_graph: true
737+
cuda_graph_padding_enabled: true
738+
cuda_graph_batch_sizes:
739+
- 1
740+
- 2
741+
- 4
742+
- 8
743+
- 16
744+
- 32
745+
- 64
746+
- 128
747+
- 256
748+
- 384
749+
print_iter_log: true
750+
enable_attention_dp: true
751+
EOF
752+
753+
for port in {8002..8003}; do \
754+
trtllm-serve \
755+
Qwen3-30B-A3B/ \
756+
--host localhost \
757+
--port ${port} \
758+
--backend pytorch \
759+
--max_batch_size 161 \
760+
--max_num_tokens 1160 \
761+
--tp_size 1 \
762+
--ep_size 1 \
763+
--pp_size 1 \
764+
--kv_cache_free_gpu_memory_fraction 0.8 \
765+
--extra_llm_api_options ./gen-extra-llm-api-config.yml \
766+
&> output_gen_${port} & \
767+
done
768+
```
769+
770+
Finally, you can launch the disaggregated server which will accept requests from the client and do
771+
the orchestration between the context and generation servers with:
772+
773+
```bash
774+
cat >./disagg-config.yml <<EOF
775+
hostname: localhost
776+
port: 8000
777+
backend: pytorch
778+
context_servers:
779+
num_instances: 1
780+
urls:
781+
- "localhost:8001"
782+
generation_servers:
783+
num_instances: 1
784+
urls:
785+
- "localhost:8002"
786+
EOF
787+
788+
trtllm-serve disaggregated -c disagg-config.yaml
789+
```
790+
791+
To query the server, you can start with a `curl` command:
792+
```bash
793+
curl http://localhost:8000/v1/completions \
794+
-H "Content-Type: application/json" \
795+
-d '{
796+
"model": "Qwen3-30B-A3B/",
797+
"prompt": "Please describe what is Qwen.",
798+
"max_tokens": 12,
799+
"temperature": 0
800+
}'
801+
```
802+
803+
Note that the optimal disaggregated serving configuration (i.e. tp/pp/ep mappings, number of ctx/gen instances, etc.) will depend
804+
on the request parameters, the number of concurrent requests and the GPU type. It is recommended to experiment to identify optimal
805+
settings for your specific use case.
806+
807+
### Dynamo
808+
809+
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
810+
Dynamo supports TensorRT-LLM as one of its inference engine. For details on how to use TensorRT-LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
699811

700812
## Notes and Troubleshooting
701813

0 commit comments

Comments
 (0)