Skip to content

[Feature]: AutoDeploy: investigate latency of first request in trtllm-serve #9276

@lucaslie

Description

@lucaslie

🚀 The feature, motivation and pitch

from @2ez4bz. Seems that the first request in trtllm-serve or dynamo is much slower than subsequent requests. From logging in dynamo @2ez4bz seems to be able to track it down to intitial calls to flashinfer.

@lucaslie and @2ez4bz discussed whether it may be flashinfer prefill that triggers it since AD doens't do any warmup for prefill

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

AutoDeploy<NV> AutoDeploy Backendfeature requestNew feature or request. This includes new model, dtype, functionality support

Type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions