Skip to content

Commit a2e964d

Browse files
authored
[None][doc] Minor doc update to disagg-serving (NVIDIA#8768)
Signed-off-by: Sharan Chetlur <[email protected]>
1 parent 834a780 commit a2e964d

File tree

1 file changed

+20
-25
lines changed

1 file changed

+20
-25
lines changed

docs/source/features/disagg-serving.md

Lines changed: 20 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,13 @@
1-
# Disaggregated Serving (Beta)
2-
3-
```{note}
4-
Note:
5-
This feature is currently in beta, and the related APIs are subjected to change in future versions.
6-
```
1+
# Disaggregated Serving
72

83
- [Motivation](#Motivation)
94
- [KV Cache Exchange](#KV-Cache-Exchange)
105
- [Multi-backend Support](#Multi-backend-Support)
116
- [Overlap Optimization](#Overlap-Optimization)
127
- [Cache Layout Transformation](#Cache-Layout-Transformation)
138
- [Usage](#Usage)
14-
- [trtllm-serve](#trtllm-serve)
159
- [Dynamo](#Dynamo)
10+
- [trtllm-serve](#trtllm-serve)
1611
- [Environment Variables](#Environment-Variables)
1712
- [Troubleshooting and FAQ](#Troubleshooting-and-FAQ)
1813

@@ -84,9 +79,26 @@ The optimizations required for KV cache transmission vary depending on whether i
8479

8580
## Usage
8681

82+
### Dynamo
83+
84+
The first approach involves the use of [Dynamo](https://github.com/ai-dynamo/dynamo), a data center-scale inference server developed specifically for LLM workloads. Dynamo introduces several advanced features not present in the other methods, including decoupled pre- and post-processing workers, which are particularly beneficial under high concurrency conditions. The disaggregated LLM inference workflow with Dynamo is illustrated in Figure 7.
85+
86+
<div align="center">
87+
<figure>
88+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture4.png" width="800" height="auto">
89+
</figure>
90+
</div>
91+
<p align="center"><sub><em>Figure 7. Dynamo integration with disaggregated service</em></sub></p>
92+
93+
In the Dynamo workflow, requests are initially processed by pre- and post-processing workers, which then query a smart router to determine the optimal decode worker to route the requests to. Depending on the availability of KV cache blocks, the decoder worker may bypass the prefill stage or forward the request to the prefill worker. Once the prefill worker is done processing the prompt, the KV cache blocks can be sent from the prefill worker to the decoder worker, using the metadata referred to as ctx_params in the figure above.
94+
95+
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
96+
97+
For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
98+
8799
### trtllm-serve
88100

89-
The first approach to do disaggregated LLM inference with TensorRT LLM involves launching a separate OpenAI-compatible server per context and generation instance using `trtllm-serve`. An additional server, referred to as the "disaggregated" server, is also launched with `trtllm-serve` and acts as an orchestrator which receives client requests and dispatches them to the appropriate context and generation servers via OpenAI REST API. Figure 6 below illustrates the disaggregated serving workflow when using this approach. When a context instance is done generating the KV blocks associated with the prompt, it returns a response to the disaggregated server. This response includes the prompt tokens, the first generated token and metadata associated with the context request and context instance. This metadata is referred to as context parameters (`ctx_params` in Figure 6). These parameters are then used by the generation instances to establish communication with the context instance and retrieve the KV cache blocks associated with the request.
101+
The second approach to evaluate disaggregated LLM inference with TensorRT LLM involves launching a separate OpenAI-compatible server per context and generation instance using `trtllm-serve`. An additional server, referred to as the "disaggregated" server, is also launched with `trtllm-serve` and acts as an orchestrator which receives client requests and dispatches them to the appropriate context and generation servers via OpenAI REST API. Figure 6 below illustrates the disaggregated serving workflow when using this approach. When a context instance is done generating the KV blocks associated with the prompt, it returns a response to the disaggregated server. This response includes the prompt tokens, the first generated token and metadata associated with the context request and context instance. This metadata is referred to as context parameters (`ctx_params` in Figure 6). These parameters are then used by the generation instances to establish communication with the context instance and retrieve the KV cache blocks associated with the request.
90102

91103
<div align="center">
92104
<figure>
@@ -171,23 +183,6 @@ curl http://localhost:8000/v1/completions \
171183

172184
Please refer to [Disaggregated Inference Benchmark Scripts](../../../examples/disaggregated/slurm).
173185

174-
### Dynamo
175-
176-
The second approach involves the use of [Dynamo](https://github.com/ai-dynamo/dynamo), a data center-scale inference server developed specifically for LLM workloads. Dynamo introduces several advanced features not present in the other methods, including decoupled pre- and post-processing workers, which are particularly beneficial under high concurrency conditions. The disaggregated LLM inference workflow with Dynamo is illustrated in Figure 7.
177-
178-
<div align="center">
179-
<figure>
180-
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture4.png" width="800" height="auto">
181-
</figure>
182-
</div>
183-
<p align="center"><sub><em>Figure 7. Dynamo integration with disaggregated service</em></sub></p>
184-
185-
In the Dynamo workflow, requests are initially processed by pre- and post-processing workers, which then query a smart router to determine the optimal decode worker to route the requests to. Depending on the availability of KV cache blocks, the decoder worker may bypass the prefill stage or forward the request to the prefill worker. Once the prefill worker is done processing the prompt, the KV cache blocks can be sent from the prefill worker to the decoder worker, using the metadata referred to as ctx_params in the figure above.
186-
187-
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
188-
189-
For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
190-
191186
## Environment Variables
192187

193188
TRT-LLM uses some environment variables to control the behavior of disaggregated service.

0 commit comments

Comments
 (0)