You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Finally, you can launch the disaggregated server which will accept requests from the client and do
352
+
the orchestration between the context and generation servers with:
353
+
354
+
```bash
355
+
cat >./disagg-config.yml <<EOF
356
+
hostname: localhost
357
+
port: 8000
358
+
backend: pytorch
359
+
context_servers:
360
+
num_instances: 1
361
+
urls:
362
+
- "localhost:8001"
363
+
generation_servers:
364
+
num_instances: 1
365
+
urls:
366
+
- "localhost:8002"
367
+
EOF
368
+
369
+
trtllm-serve disaggregated -c disagg-config.yaml
370
+
```
371
+
372
+
To query the server, you can start with a `curl` command:
373
+
```bash
374
+
curl http://localhost:8000/v1/completions \
375
+
-H "Content-Type: application/json" \
376
+
-d '{
377
+
"model": "deepseek-ai/DeepSeek-V3",
378
+
"prompt": "Where is New York?",
379
+
"max_tokens": 16,
380
+
"temperature": 0
381
+
}'
382
+
```
383
+
384
+
For DeepSeek-R1, use the model name `deepseek-ai/DeepSeek-R1`.
385
+
386
+
Note that the optimal disaggregated serving configuration (i.e. tp/pp/ep mappings, number of ctx/gen instances, etc.) will depend
387
+
on the request parameters, the number of concurrent requests and the GPU type. It is recommended to experiment to identify optimal
388
+
settings for your specific use case.
389
+
390
+
### Dynamo
391
+
392
+
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
393
+
Dynamo supports TensorRT-LLM as one of its inference engine. For details on how to use TensorRT-LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
394
+
395
+
### tensorrtllm_backend for triton inference server (Experimental)
283
396
To serve the model using [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend.git), make sure the version is v0.19+ in which the pytorch path is added as an experimental feature.
284
397
285
398
The model configuration file is located at https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/llmapi/tensorrt_llm/1/model.yaml
Finally, you can launch the disaggregated server which will accept requests from the client and do
771
+
the orchestration between the context and generation servers with:
772
+
773
+
```bash
774
+
cat >./disagg-config.yml <<EOF
775
+
hostname: localhost
776
+
port: 8000
777
+
backend: pytorch
778
+
context_servers:
779
+
num_instances: 1
780
+
urls:
781
+
- "localhost:8001"
782
+
generation_servers:
783
+
num_instances: 1
784
+
urls:
785
+
- "localhost:8002"
786
+
EOF
787
+
788
+
trtllm-serve disaggregated -c disagg-config.yaml
789
+
```
790
+
791
+
To query the server, you can start with a `curl` command:
792
+
```bash
793
+
curl http://localhost:8000/v1/completions \
794
+
-H "Content-Type: application/json" \
795
+
-d '{
796
+
"model": "Qwen3-30B-A3B/",
797
+
"prompt": "Please describe what is Qwen.",
798
+
"max_tokens": 12,
799
+
"temperature": 0
800
+
}'
801
+
```
802
+
803
+
Note that the optimal disaggregated serving configuration (i.e. tp/pp/ep mappings, number of ctx/gen instances, etc.) will depend
804
+
on the request parameters, the number of concurrent requests and the GPU type. It is recommended to experiment to identify optimal
805
+
settings for your specific use case.
806
+
807
+
### Dynamo
808
+
809
+
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
810
+
Dynamo supports TensorRT-LLM as one of its inference engine. For details on how to use TensorRT-LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md)
0 commit comments