|
1 | | -# Disaggregated Serving (Beta) |
2 | | - |
3 | | -```{note} |
4 | | -Note: |
5 | | -This feature is currently in beta, and the related APIs are subjected to change in future versions. |
6 | | -``` |
| 1 | +# Disaggregated Serving |
7 | 2 |
|
8 | 3 | - [Motivation](#Motivation) |
9 | 4 | - [KV Cache Exchange](#KV-Cache-Exchange) |
10 | 5 | - [Multi-backend Support](#Multi-backend-Support) |
11 | 6 | - [Overlap Optimization](#Overlap-Optimization) |
12 | 7 | - [Cache Layout Transformation](#Cache-Layout-Transformation) |
13 | 8 | - [Usage](#Usage) |
14 | | - - [trtllm-serve](#trtllm-serve) |
15 | 9 | - [Dynamo](#Dynamo) |
| 10 | + - [trtllm-serve](#trtllm-serve) |
16 | 11 | - [Environment Variables](#Environment-Variables) |
17 | 12 | - [Troubleshooting and FAQ](#Troubleshooting-and-FAQ) |
18 | 13 |
|
@@ -84,9 +79,26 @@ The optimizations required for KV cache transmission vary depending on whether i |
84 | 79 |
|
85 | 80 | ## Usage |
86 | 81 |
|
| 82 | +### Dynamo |
| 83 | + |
| 84 | +The first approach involves the use of [Dynamo](https://github.com/ai-dynamo/dynamo), a data center-scale inference server developed specifically for LLM workloads. Dynamo introduces several advanced features not present in the other methods, including decoupled pre- and post-processing workers, which are particularly beneficial under high concurrency conditions. The disaggregated LLM inference workflow with Dynamo is illustrated in Figure 7. |
| 85 | + |
| 86 | +<div align="center"> |
| 87 | +<figure> |
| 88 | + <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture4.png" width="800" height="auto"> |
| 89 | +</figure> |
| 90 | +</div> |
| 91 | +<p align="center"><sub><em>Figure 7. Dynamo integration with disaggregated service</em></sub></p> |
| 92 | + |
| 93 | +In the Dynamo workflow, requests are initially processed by pre- and post-processing workers, which then query a smart router to determine the optimal decode worker to route the requests to. Depending on the availability of KV cache blocks, the decoder worker may bypass the prefill stage or forward the request to the prefill worker. Once the prefill worker is done processing the prompt, the KV cache blocks can be sent from the prefill worker to the decoder worker, using the metadata referred to as ctx_params in the figure above. |
| 94 | + |
| 95 | +Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments. |
| 96 | + |
| 97 | +For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html). |
| 98 | + |
87 | 99 | ### trtllm-serve |
88 | 100 |
|
89 | | -The first approach to do disaggregated LLM inference with TensorRT LLM involves launching a separate OpenAI-compatible server per context and generation instance using `trtllm-serve`. An additional server, referred to as the "disaggregated" server, is also launched with `trtllm-serve` and acts as an orchestrator which receives client requests and dispatches them to the appropriate context and generation servers via OpenAI REST API. Figure 6 below illustrates the disaggregated serving workflow when using this approach. When a context instance is done generating the KV blocks associated with the prompt, it returns a response to the disaggregated server. This response includes the prompt tokens, the first generated token and metadata associated with the context request and context instance. This metadata is referred to as context parameters (`ctx_params` in Figure 6). These parameters are then used by the generation instances to establish communication with the context instance and retrieve the KV cache blocks associated with the request. |
| 101 | +The second approach to evaluate disaggregated LLM inference with TensorRT LLM involves launching a separate OpenAI-compatible server per context and generation instance using `trtllm-serve`. An additional server, referred to as the "disaggregated" server, is also launched with `trtllm-serve` and acts as an orchestrator which receives client requests and dispatches them to the appropriate context and generation servers via OpenAI REST API. Figure 6 below illustrates the disaggregated serving workflow when using this approach. When a context instance is done generating the KV blocks associated with the prompt, it returns a response to the disaggregated server. This response includes the prompt tokens, the first generated token and metadata associated with the context request and context instance. This metadata is referred to as context parameters (`ctx_params` in Figure 6). These parameters are then used by the generation instances to establish communication with the context instance and retrieve the KV cache blocks associated with the request. |
90 | 102 |
|
91 | 103 | <div align="center"> |
92 | 104 | <figure> |
@@ -171,23 +183,6 @@ curl http://localhost:8000/v1/completions \ |
171 | 183 |
|
172 | 184 | Please refer to [Disaggregated Inference Benchmark Scripts](../../../examples/disaggregated/slurm). |
173 | 185 |
|
174 | | -### Dynamo |
175 | | - |
176 | | -The second approach involves the use of [Dynamo](https://github.com/ai-dynamo/dynamo), a data center-scale inference server developed specifically for LLM workloads. Dynamo introduces several advanced features not present in the other methods, including decoupled pre- and post-processing workers, which are particularly beneficial under high concurrency conditions. The disaggregated LLM inference workflow with Dynamo is illustrated in Figure 7. |
177 | | - |
178 | | -<div align="center"> |
179 | | -<figure> |
180 | | - <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog5_Picture4.png" width="800" height="auto"> |
181 | | -</figure> |
182 | | -</div> |
183 | | -<p align="center"><sub><em>Figure 7. Dynamo integration with disaggregated service</em></sub></p> |
184 | | - |
185 | | -In the Dynamo workflow, requests are initially processed by pre- and post-processing workers, which then query a smart router to determine the optimal decode worker to route the requests to. Depending on the availability of KV cache blocks, the decoder worker may bypass the prefill stage or forward the request to the prefill worker. Once the prefill worker is done processing the prompt, the KV cache blocks can be sent from the prefill worker to the decoder worker, using the metadata referred to as ctx_params in the figure above. |
186 | | - |
187 | | -Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments. |
188 | | - |
189 | | -For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html). |
190 | | - |
191 | 186 | ## Environment Variables |
192 | 187 |
|
193 | 188 | TRT-LLM uses some environment variables to control the behavior of disaggregated service. |
|
0 commit comments