Skip to content

[Question] Unexpectedly low prefill (TTFT) latency ratio #3021

@Kepontry

Description

@Kepontry

Hello,

I'm testing the Qwen2.5-1.5B model with openvino.genai and observing PerfMetrics that seem counter-intuitive.

The Time to First Token (TTFT) accounts for a very small fraction of the total Generate call duration, even with a large prompt.

Observed Metrics

Case 1 (Short Prompt)

Input: 32 tokens

Output: ~130 tokens

Time to First Token: 110,097 us

Total Generate Duration: 20,228,544 us

Ratio (TTFT/Total): ~0.54%

Case 2 (Long Prompt)

Input: ~1024 tokens

Output: ~1024 tokens

Time to First Token: 850,237 us

Total Generate Duration: 121,392,448 us

Ratio (TTFT/Total): ~0.70%

Question

Is this behavior expected? A prefill latency of less than 1% for a 1K token prompt seems unusually low, suggesting either the decode stage is disproportionately slow or Time to First Token isn't capturing the full prefill cost.

Could you please confirm if these metrics are reasonable or if I might be misinterpreting the data?

Environment Details

Hardware: Ultra 258V

OS: Ubuntu 24.04

OpenVINO Version: 2025.2.0

Model Precision: INT8

Thank you.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions