-
Notifications
You must be signed in to change notification settings - Fork 301
Description
Hello,
I'm testing the Qwen2.5-1.5B model with openvino.genai and observing PerfMetrics that seem counter-intuitive.
The Time to First Token (TTFT) accounts for a very small fraction of the total Generate call duration, even with a large prompt.
Observed Metrics
Case 1 (Short Prompt)
Input: 32 tokens
Output: ~130 tokens
Time to First Token: 110,097 us
Total Generate Duration: 20,228,544 us
Ratio (TTFT/Total): ~0.54%
Case 2 (Long Prompt)
Input: ~1024 tokens
Output: ~1024 tokens
Time to First Token: 850,237 us
Total Generate Duration: 121,392,448 us
Ratio (TTFT/Total): ~0.70%
Question
Is this behavior expected? A prefill latency of less than 1% for a 1K token prompt seems unusually low, suggesting either the decode stage is disproportionately slow or Time to First Token isn't capturing the full prefill cost.
Could you please confirm if these metrics are reasonable or if I might be misinterpreting the data?
Environment Details
Hardware: Ultra 258V
OS: Ubuntu 24.04
OpenVINO Version: 2025.2.0
Model Precision: INT8
Thank you.