[TRTLLM-5518] doc: Adding disaggregated serving section to models doc (#4877)

pcastonguay · web-flow · commit 6d4d179cacf2 · 2025-06-09T17:19:02.000-04:00
Signed-off-by: Patrice Castonguay &lt;55748270+pcastonguay@users.noreply.github.com&gt;
diff --git a/examples/models/core/deepseek_v3/README.md b/examples/models/core/deepseek_v3/README.md
@@ -27,8 +27,10 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
       - [ISL-128k-OSL-1024](#isl-128k-osl-1024)
   - [Evaluation](#evaluation)
   - [Serving](#serving)
-    - [Use trtllm-serve](#use-trtllm-serve)
-    - [Use tensorrtllm_backend for triton inference server (Experimental)](#use-tensorrtllm_backend-for-triton-inference-server-experimental)
+    - [trtllm-serve](#trtllm-serve)
+    - [Disaggregated Serving](#disaggregated-serving)
+    - [Dynamo](#dynamo)
+    - [tensorrtllm_backend for triton inference server (Experimental)](#tensorrtllm_backend-for-triton-inference-server-experimental)
   - [Advanced Usages](#advanced-usages)
     - [Multi-node](#multi-node)
       - [mpirun](#mpirun)
@@ -227,7 +229,7 @@ trtllm-eval --model  <YOUR_MODEL_DIR> \
 ```
 
 ## Serving
-### Use trtllm-serve
+### trtllm-serve
 
 To serve the model using `trtllm-serve`:
 
@@ -278,8 +280,119 @@ curl http://localhost:8000/v1/completions \
 
 For DeepSeek-R1, use the model name `deepseek-ai/DeepSeek-R1`.
 
+### Disaggregated Serving
 
-### Use tensorrtllm_backend for triton inference server (Experimental)
+To serve the model in disaggregated mode, you should launch context and generation servers using `trtllm-serve`.
+
+For example, you can launch a single context server on port 8001 with:
+
+```bash
+export TRTLLM_USE_UCX_KVCACHE=1
+
+cat >./ctx-extra-llm-api-config.yml <<EOF
+print_iter_log: true
+enable_attention_dp: true
+EOF
+
+trtllm-serve \
+  deepseek-ai/DeepSeek-V3 \
+  --host localhost \
+  --port 8001 \
+  --backend pytorch \
+  --max_batch_size 161 \
+  --max_num_tokens 1160 \
+  --tp_size 8 \
+  --ep_size 8 \
+  --pp_size 1 \
+  --kv_cache_free_gpu_memory_fraction 0.95 \
+  --extra_llm_api_options ./ctx-extra-llm-api-config.yml &> output_ctx &
+```
+
+And you can launch two generation servers on port 8002 and 8003 with:
+
+```bash
+export TRTLLM_USE_UCX_KVCACHE=1
+
+cat >./gen-extra-llm-api-config.yml <<EOF
+use_cuda_graph: true
+cuda_graph_padding_enabled: true
+cuda_graph_batch_sizes:
+  - 1
+  - 2
+  - 4
+  - 8
+  - 16
+  - 32
+  - 64
+  - 128
+  - 256
+  - 384
+print_iter_log: true
+enable_attention_dp: true
+EOF
+
+for port in {8002..8003}; do \
+trtllm-serve \
+  deepseek-ai/DeepSeek-V3 \
+  --host localhost \
+  --port ${port} \
+  --backend pytorch \
+  --max_batch_size 161 \
+  --max_num_tokens 1160 \
+  --tp_size 8 \
+  --ep_size 8 \
+  --pp_size 1 \
+  --kv_cache_free_gpu_memory_fraction 0.95 \
+  --extra_llm_api_options ./gen-extra-llm-api-config.yml \
+  &> output_gen_${port} & \
+done
+```
+
+Finally, you can launch the disaggregated server which will accept requests from the client and do
+the orchestration between the context and generation servers with:
+
+```bash
+cat >./disagg-config.yml <<EOF
+hostname: localhost
+port: 8000
+backend: pytorch
+context_servers:
+  num_instances: 1
+  urls:
+      - "localhost:8001"
+generation_servers:
+  num_instances: 1
+  urls:
+      - "localhost:8002"
+EOF
+
+trtllm-serve disaggregated -c disagg-config.yaml
+```
+
+To query the server, you can start with a `curl` command:
+```bash
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "deepseek-ai/DeepSeek-V3",
+      "prompt": "Where is New York?",
+      "max_tokens": 16,
+      "temperature": 0
+  }'
+```
+
+For DeepSeek-R1, use the model name `deepseek-ai/DeepSeek-R1`.
+
+Note that the optimal disaggregated serving configuration (i.e. tp/pp/ep mappings, number of ctx/gen instances, etc.) will depend
+on the request parameters, the number of concurrent requests and the GPU type. It is recommended to experiment to identify optimal
+settings for your specific use case.
+
+### Dynamo
+
+NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
+Dynamo supports TensorRT-LLM as one of its inference engine. For details on how to use TensorRT-LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
+
+### tensorrtllm_backend for triton inference server (Experimental)
 To serve the model using [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend.git), make sure the version is v0.19+ in which the pytorch path is added as an experimental feature.
 
 The model configuration file is located at https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/llmapi/tensorrt_llm/1/model.yaml
diff --git a/examples/models/core/qwen/README.md b/examples/models/core/qwen/README.md
@@ -22,6 +22,9 @@ This document shows how to build and run a [Qwen](https://huggingface.co/Qwen) m
       - [Run a single inference](#run-a-single-inference)
     - [Evaluation](#evaluation)
     - [Serving](#serving)
+      - [trtllm-serve](#trtllm-serve)
+      - [Disaggregated Serving](#disaggregated-serving)
+      - [Dynamo](#dynamo)
   - [Notes and Troubleshooting](#notes-and-troubleshooting)
   - [Credits](#credits)
 
@@ -648,6 +651,7 @@ trtllm-eval --model=Qwen3-30B-A3B/ --tokenizer=Qwen3-30B-A3B/ --backend=pytorch
 ```
 
 ### Serving
+#### trtllm-serve
 
 To serve the model using `trtllm-serve`:
 
@@ -695,7 +699,115 @@ curl http://localhost:8000/v1/completions \
       "temperature": 0
   }'
 ```
+#### Disaggregated Serving
 
+To serve the model in disaggregated mode, you should launch context and generation servers using `trtllm-serve`.
+
+For example, you can launch a single context server on port 8001 with:
+
+```bash
+export TRTLLM_USE_UCX_KVCACHE=1
+
+cat >./ctx-extra-llm-api-config.yml <<EOF
+print_iter_log: true
+enable_attention_dp: true
+EOF
+
+trtllm-serve \
+  Qwen3-30B-A3B/ \
+  --host localhost \
+  --port 8001 \
+  --backend pytorch \
+  --max_batch_size 161 \
+  --max_num_tokens 1160 \
+  --tp_size 1 \
+  --ep_size 1 \
+  --pp_size 1 \
+  --kv_cache_free_gpu_memory_fraction 0.8 \
+  --extra_llm_api_options ./ctx-extra-llm-api-config.yml &> output_ctx &
+```
+
+And you can launch two generation servers on port 8002 and 8003 with:
+
+```bash
+export TRTLLM_USE_UCX_KVCACHE=1
+
+cat >./gen-extra-llm-api-config.yml <<EOF
+use_cuda_graph: true
+cuda_graph_padding_enabled: true
+cuda_graph_batch_sizes:
+  - 1
+  - 2
+  - 4
+  - 8
+  - 16
+  - 32
+  - 64
+  - 128
+  - 256
+  - 384
+print_iter_log: true
+enable_attention_dp: true
+EOF
+
+for port in {8002..8003}; do \
+trtllm-serve \
+  Qwen3-30B-A3B/ \
+  --host localhost \
+  --port ${port} \
+  --backend pytorch \
+  --max_batch_size 161 \
+  --max_num_tokens 1160 \
+  --tp_size 1 \
+  --ep_size 1 \
+  --pp_size 1 \
+  --kv_cache_free_gpu_memory_fraction 0.8 \
+  --extra_llm_api_options ./gen-extra-llm-api-config.yml \
+  &> output_gen_${port} & \
+done
+```
+
+Finally, you can launch the disaggregated server which will accept requests from the client and do
+the orchestration between the context and generation servers with:
+
+```bash
+cat >./disagg-config.yml <<EOF
+hostname: localhost
+port: 8000
+backend: pytorch
+context_servers:
+  num_instances: 1
+  urls:
+      - "localhost:8001"
+generation_servers:
+  num_instances: 1
+  urls:
+      - "localhost:8002"
+EOF
+
+trtllm-serve disaggregated -c disagg-config.yaml
+```
+
+To query the server, you can start with a `curl` command:
+```bash
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "Qwen3-30B-A3B/",
+      "prompt": "Please describe what is Qwen.",
+      "max_tokens": 12,
+      "temperature": 0
+  }'
+```
+
+Note that the optimal disaggregated serving configuration (i.e. tp/pp/ep mappings, number of ctx/gen instances, etc.) will depend
+on the request parameters, the number of concurrent requests and the GPU type. It is recommended to experiment to identify optimal
+settings for your specific use case.
+
+### Dynamo
+
+NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
+Dynamo supports TensorRT-LLM as one of its inference engine. For details on how to use TensorRT-LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
 
 ## Notes and Troubleshooting