From 4fb3771d7feea6ab6de9eb0539868d32ddf5d98d Mon Sep 17 00:00:00 2001 From: Anthony Casagrande Date: Wed, 1 Oct 2025 21:23:50 -0700 Subject: [PATCH] docs: add comprehensive metrics docs Signed-off-by: Anthony Casagrande --- README.md | 110 ++++- docs/metrics_reference.md | 841 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 949 insertions(+), 2 deletions(-) create mode 100644 docs/metrics_reference.md diff --git a/README.md b/README.md index b710dbd67..226ef6f04 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ SPDX-License-Identifier: Apache-2.0 [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/aiperf) -**[Architecture](docs/architecture.md)**| **[Design Proposals](https://github.com/ai-dynamo/enhancements)** | **[Migrating from Genai-Perf](docs/migrating.md)** | **[CLI Options](docs/cli_options.md)** +**[Architecture](docs/architecture.md)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** | **[Migrating from Genai-Perf](docs/migrating.md)** | **[CLI Options](docs/cli_options.md)** | **[Metrics Reference](docs/metrics_reference.md)** AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution. @@ -96,7 +96,6 @@ aiperf profile --benchmark-duration 300.0 --benchmark-grace-period 30.0 [other o
- + +## Metrics Reference + +AIPerf provides comprehensive metrics organized into multiple functional categories. For detailed descriptions, requirements, and nuances of each metric, see the **[Complete Metrics Reference](docs/metrics_reference.md)**. + +### Streaming Metrics + +Metrics specific to streaming requests that measure real-time token generation characteristics. Requires `--streaming` flag. + +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Time to First Token (TTFT)**](docs/metrics_reference.md#time-to-first-token-ttft) | `ttft` | `responses[0].perf_ns - request.start_perf_ns` | `ms` | +| [**Time to Second Token (TTST)**](docs/metrics_reference.md#time-to-second-token-ttst) | `ttst` | `responses[1].perf_ns - responses[0].perf_ns` | `ms` | +| [**Inter Token Latency (ITL)**](docs/metrics_reference.md#inter-token-latency-itl) | `inter_token_latency` | `(request_latency - ttft) / (output_sequence_length - 1)` | `ms` | +| [**Inter Chunk Latency (ICL)**](docs/metrics_reference.md#inter-chunk-latency-icl) | `inter_chunk_latency` | `[responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))]` | `ms` | +| [**Output Token Throughput Per User**](docs/metrics_reference.md#output-token-throughput-per-user) | `output_token_throughput_per_user` | `1.0 / inter_token_latency_seconds` | `tokens/sec/user` | +| [**Prefill Throughput**](docs/metrics_reference.md#prefill-throughput) | `prefill_throughput` | `input_sequence_length / ttft_seconds` | `tokens/sec` | + +### Token Based Metrics + +Metrics for token-producing endpoints that track token counts and throughput. Requires text-generating endpoints (chat, completion, etc.). + +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Output Token Count**](docs/metrics_reference.md#output-token-count) | `output_token_count` | `len(tokenizer.encode(content, add_special_tokens=False))` | `tokens` | +| [**Output Sequence Length (OSL)**](docs/metrics_reference.md#output-sequence-length-osl) | `output_sequence_length` | `(output_token_count or 0) + (reasoning_token_count or 0)` | `tokens` | +| [**Input Sequence Length (ISL)**](docs/metrics_reference.md#input-sequence-length-isl) | `input_sequence_length` | `len(tokenizer.encode(prompt, add_special_tokens=False))` | `tokens` | +| [**Total Output Tokens**](docs/metrics_reference.md#total-output-tokens) | `total_output_tokens` | `sum(r.output_token_count for r in records if r.valid)` | `tokens` | +| [**Total Output Sequence Length**](docs/metrics_reference.md#total-output-sequence-length) | `total_osl` | `sum(r.output_sequence_length for r in records if r.valid)` | `tokens` | +| [**Total Input Sequence Length**](docs/metrics_reference.md#total-input-sequence-length) | `total_isl` | `sum(r.input_sequence_length for r in records if r.valid)` | `tokens` | +| [**Output Token Throughput**](docs/metrics_reference.md#output-token-throughput) | `output_token_throughput` | `total_osl / benchmark_duration_seconds` | `tokens/sec` | + +### Reasoning Metrics + +Metrics specific to models that support reasoning/thinking tokens. Requires models with separate `reasoning_content` field. + +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Reasoning Token Count**](docs/metrics_reference.md#reasoning-token-count) | `reasoning_token_count` | `len(tokenizer.encode(reasoning_content, add_special_tokens=False))` | `tokens` | +| [**Total Reasoning Tokens**](docs/metrics_reference.md#total-reasoning-tokens) | `total_reasoning_tokens` | `sum(r.reasoning_token_count for r in records if r.valid)` | `tokens` | + +### Usage Field Metrics + +Metrics tracking API-reported token counts from the `usage` field in responses. Useful for comparing client-side vs server-side token counts. + +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Usage Prompt Tokens**](docs/metrics_reference.md#usage-prompt-tokens) | `usage_prompt_tokens` | `response.usage.prompt_tokens` | `tokens` | +| [**Usage Completion Tokens**](docs/metrics_reference.md#usage-completion-tokens) | `usage_completion_tokens` | `response.usage.completion_tokens` | `tokens` | +| [**Usage Total Tokens**](docs/metrics_reference.md#usage-total-tokens) | `usage_total_tokens` | `response.usage.total_tokens` | `tokens` | +| [**Usage Reasoning Tokens**](docs/metrics_reference.md#usage-reasoning-tokens) | `usage_reasoning_tokens` | `response.usage.completion_tokens_details.reasoning_tokens` | `tokens` | +| [**Total Usage Prompt Tokens**](docs/metrics_reference.md#total-usage-prompt-tokens) | `total_usage_prompt_tokens` | `sum(r.usage_prompt_tokens for r in records if r.valid)` | `tokens` | +| [**Total Usage Completion Tokens**](docs/metrics_reference.md#total-usage-completion-tokens) | `total_usage_completion_tokens` | `sum(r.usage_completion_tokens for r in records if r.valid)` | `tokens` | +| [**Total Usage Total Tokens**](docs/metrics_reference.md#total-usage-total-tokens) | `total_usage_total_tokens` | `sum(r.usage_total_tokens for r in records if r.valid)` | `tokens` | + +### Usage Discrepancy Metrics + +Metrics measuring differences between API-reported and client-computed token counts. + +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Usage Prompt Tokens Diff %**](docs/metrics_reference.md#usage-prompt-tokens-diff-) | `usage_prompt_tokens_diff_pct` | `abs((usage_prompt_tokens - input_sequence_length) / input_sequence_length) * 100` | `%` | +| [**Usage Completion Tokens Diff %**](docs/metrics_reference.md#usage-completion-tokens-diff-) | `usage_completion_tokens_diff_pct` | `abs((usage_completion_tokens - output_sequence_length) / output_sequence_length) * 100` | `%` | +| [**Usage Reasoning Tokens Diff %**](docs/metrics_reference.md#usage-reasoning-tokens-diff-) | `usage_reasoning_tokens_diff_pct` | `abs((usage_reasoning_tokens - reasoning_token_count) / reasoning_token_count) * 100` | `%` | +| [**Usage Discrepancy Count**](docs/metrics_reference.md#usage-discrepancy-count) | `usage_discrepancy_count` | `sum(1 for r in records if r.any_diff > threshold)` | `requests` | + +### Goodput Metrics + +Metrics measuring throughput of requests meeting user-defined Service Level Objectives (SLOs). + +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Good Request Count**](docs/metrics_reference.md#good-request-count) | `good_request_count` | `sum(1 for r in records if r.all_slos_met)` | `requests` | +| [**Goodput**](docs/metrics_reference.md#goodput) | `goodput` | `good_request_count / benchmark_duration_seconds` | `requests/sec` | + +### Error Metrics + +Metrics computed for failed/error requests. + +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Error Input Sequence Length**](docs/metrics_reference.md#error-input-sequence-length) | `error_isl` | `input_sequence_length` (for error requests) | `tokens` | +| [**Total Error Input Sequence Length**](docs/metrics_reference.md#total-error-input-sequence-length) | `total_error_isl` | `sum(r.input_sequence_length for r in records if not r.valid)` | `tokens` | +| [**Error Request Count**](docs/metrics_reference.md#error-request-count) | `error_request_count` | `sum(1 for r in records if not r.valid)` | `requests` | + +### General Metrics + +Metrics available for all benchmark runs with no special requirements. + +| Metric | Tag | Formula | Unit | +|--------|-----|---------|------| +| [**Request Latency**](docs/metrics_reference.md#request-latency) | `request_latency` | `responses[-1].perf_ns - request.start_perf_ns` | `ms` | +| [**Request Throughput**](docs/metrics_reference.md#request-throughput) | `request_throughput` | `request_count / benchmark_duration_seconds` | `requests/sec` | +| [**Request Count**](docs/metrics_reference.md#request-count) | `request_count` | `sum(1 for r in records if r.valid)` | `requests` | +| [**Minimum Request Timestamp**](docs/metrics_reference.md#minimum-request-timestamp) | `min_request_timestamp` | `min(r.timestamp_ns for r in records)` | `datetime` | +| [**Maximum Response Timestamp**](docs/metrics_reference.md#maximum-response-timestamp) | `max_response_timestamp` | `max(r.timestamp_ns + r.request_latency for r in records)` | `datetime` | +| [**Benchmark Duration**](docs/metrics_reference.md#benchmark-duration) | `benchmark_duration` | `max_response_timestamp - min_request_timestamp` | `sec` | + +
+ + ## Known Issues - Output sequence length constraints (`--output-tokens-mean`) cannot be guaranteed unless you pass `ignore_eos` and/or `min_tokens` via `--extra-inputs` to an inference server that supports them. diff --git a/docs/metrics_reference.md b/docs/metrics_reference.md new file mode 100644 index 000000000..2ee21ff42 --- /dev/null +++ b/docs/metrics_reference.md @@ -0,0 +1,841 @@ + +# AIPerf Metrics Reference + +This document provides a comprehensive reference of all metrics available in AIPerf for benchmarking LLM inference performance. Metrics are organized by computation type to help you understand when and how each metric is calculated. + +## Table of Contents + +- [Quick Reference](#quick-reference) +- [Understanding Metric Types](#understanding-metric-types) + - [Record Metrics](#record-metrics) + - [Aggregate Metrics](#aggregate-metrics) + - [Derived Metrics](#derived-metrics) +- [Detailed Metric Descriptions](#detailed-metric-descriptions) + - [Streaming Metrics](#streaming-metrics) + - [Time to First Token (TTFT)](#time-to-first-token-ttft) + - [Time to Second Token (TTST)](#time-to-second-token-ttst) + - [Inter Token Latency (ITL)](#inter-token-latency-itl) + - [Inter Chunk Latency (ICL)](#inter-chunk-latency-icl) + - [Output Token Throughput Per User](#output-token-throughput-per-user) + - [Prefill Throughput](#prefill-throughput) + - [Token Based Metrics](#token-based-metrics) + - [Output Token Count](#output-token-count) + - [Output Sequence Length (OSL)](#output-sequence-length-osl) + - [Input Sequence Length (ISL)](#input-sequence-length-isl) + - [Total Output Tokens](#total-output-tokens) + - [Total Output Sequence Length](#total-output-sequence-length) + - [Total Input Sequence Length](#total-input-sequence-length) + - [Output Token Throughput](#output-token-throughput) + - [Reasoning Metrics](#reasoning-metrics) + - [Reasoning Token Count](#reasoning-token-count) + - [Total Reasoning Tokens](#total-reasoning-tokens) + - [Usage Field Metrics](#usage-field-metrics) + - [Usage Prompt Tokens](#usage-prompt-tokens) + - [Usage Completion Tokens](#usage-completion-tokens) + - [Usage Total Tokens](#usage-total-tokens) + - [Usage Reasoning Tokens](#usage-reasoning-tokens) + - [Total Usage Prompt Tokens](#total-usage-prompt-tokens) + - [Total Usage Completion Tokens](#total-usage-completion-tokens) + - [Total Usage Total Tokens](#total-usage-total-tokens) + - [Usage Discrepancy Metrics](#usage-discrepancy-metrics) + - [Usage Prompt Tokens Diff %](#usage-prompt-tokens-diff-) + - [Usage Completion Tokens Diff %](#usage-completion-tokens-diff-) + - [Usage Reasoning Tokens Diff %](#usage-reasoning-tokens-diff-) + - [Usage Discrepancy Count](#usage-discrepancy-count) + - [Goodput Metrics](#goodput-metrics) + - [Good Request Count](#good-request-count) + - [Goodput](#goodput) + - [Error Metrics](#error-metrics) + - [Error Input Sequence Length](#error-input-sequence-length) + - [Total Error Input Sequence Length](#total-error-input-sequence-length) + - [General Metrics](#general-metrics) + - [Request Latency](#request-latency) + - [Request Throughput](#request-throughput) + - [Request Count](#request-count) + - [Error Request Count](#error-request-count) + - [Minimum Request Timestamp](#minimum-request-timestamp) + - [Maximum Response Timestamp](#maximum-response-timestamp) + - [Benchmark Duration](#benchmark-duration) +- [Metric Flags Reference](#metric-flags-reference) + +--- + +## Quick Reference + +For a quick reference of all metrics with their tags, formulas, and units, see the **[Metrics Reference section in the README](../README.md#metrics-reference)**. + +The sections below provide detailed descriptions, requirements, and notes for each metric. + +--- + +## Understanding Metric Types + +AIPerf computes metrics in three distinct phases during benchmark execution: **Record Metrics**, **Aggregate Metrics**, and **Derived Metrics**. + +## Record Metrics + +Record Metrics are computed **individually** for **each request** and its **response(s)** during the benchmark run. A single request may have one response (non-streaming) or multiple responses (streaming). These metrics capture **per-request characteristics** such as latency, token counts, and streaming behavior. Record metrics produce **statistical distributions** (min, max, mean, median, p90, p99, etc.) that reveal performance variability across requests. + +### Example Metrics +`request_latency`, `ttft`, `inter_token_latency`, `output_token_count`, `input_sequence_length` + +### Dependencies +Record Metrics can depend on raw request/response data and other Record Metrics from the same request. + +### Example Scenario +`request_latency` measures the time for each individual request from start to final response. If you send 100 requests, you get 100 latency values that form a distribution showing how latency varies across requests. + +## Aggregate Metrics + +Aggregate Metrics are computed by **tracking** or **accumulating** values across **all requests** in **real-time** during the benchmark. These include counters, min/max timestamps, and other global statistics. Aggregate metrics produce a **single value** representing the entire benchmark run. + +### Example Metrics +`request_count`, `error_request_count`, `min_request_timestamp`, `max_response_timestamp` + +### Dependencies +Aggregate Metrics can depend on raw request/response data, Record Metrics and other Aggregate Metrics. + +### Example Scenario +`request_count` increments by 1 for each successful request. At the end of a benchmark with 100 successful requests, this metric equals 100 (a single value, not a distribution). + +## Derived Metrics + +Derived Metrics are computed by applying **mathematical formulas** to other metric results, but are **not** computed per-record like Record Metrics. Instead, these metrics depend on one or more **prerequisite metrics** being available first and are calculated either **after the benchmark completes** for final results or in **real-time** across **all current data** for live metrics display. Derived metrics can produce either single values or distributions depending on their dependencies. + +### Example Metrics +`request_throughput`, `output_token_throughput`, `benchmark_duration` + +### Dependencies +Derived Metrics can depend on Record Metrics, Aggregate Metrics, and other Derived Metrics, but do not have +any knowledge of the individual request/response data. + +### Example Scenario +`request_throughput` is computed from `request_count / benchmark_duration_seconds`. This requires both `request_count` and `benchmark_duration` to be available first, then applies a formula to produce a single throughput value (e.g., 10.5 requests/sec). + +--- + +# Detailed Metric Descriptions + +## Streaming Metrics + +> [!NOTE] +> All metrics in this section require the `--streaming` flag with a token-producing endpoint and at least one non-empty response chunk. + +### Time to First Token (TTFT) + +**Type:** [Record Metric](#record-metrics) + +Measures how long it takes to receive the first token (or chunk of tokens) after sending a request. This is critical for user-perceived responsiveness in streaming scenarios, as it represents how quickly the model begins generating output. + +**Formula:** +```python +# nanoseconds +ttft_ns = request.responses[0].perf_ns - request.start_perf_ns + +# Convert to milliseconds for display +ttft_ms = ttft_ns / 1e6 + +# Convert to seconds for throughput calculations +ttft_seconds = ttft_ns / 1e9 +``` + +**Notes:** +- Includes network latency, queuing time, prompt processing, and generation of the first token (or chunk of tokens). +- Raw timestamps are in nanoseconds; converted to milliseconds for display and seconds for rate calculations. +- Response chunks refer to individual messages with non-empty content received during streaming. + +--- + +### Time to Second Token (TTST) + +**Type:** [Record Metric](#record-metrics) + +Measures the time gap between the first and second chunk of tokens. This metric helps identify generation startup overhead separate from steady-state streaming throughput. + +**Formula:** +```python +# nanoseconds +ttst_ns = request.responses[1].perf_ns - request.responses[0].perf_ns + +# Convert to milliseconds for display +ttst_ms = ttst_ns / 1e6 +``` + +**Notes:** +- Requires at least 2 non-empty response chunks to compute the time between first and second tokens. +- Raw timestamps are in nanoseconds; converted to milliseconds for display. + +--- + +### Inter Token Latency (ITL) + +**Type:** [Record Metric](#record-metrics) + +Measures the average time between consecutive tokens during generation, excluding the initial TTFT overhead. This represents the steady-state token generation rate. + +**Formula:** +```python +# Calculate in nanoseconds, then convert to seconds +inter_token_latency_ns = (request_latency_ns - ttft_ns) / (output_sequence_length - 1) + +# Convert to seconds for throughput calculations +inter_token_latency_seconds = inter_token_latency_ns / 1e9 + +# Convert to milliseconds for display +inter_token_latency_ms = inter_token_latency_ns / 1e6 +``` + +**Notes:** +- Requires at least 2 non-empty response chunks and valid `ttft`, `request_latency`, and `output_sequence_length` metrics. +- Result is in seconds when used for throughput calculations (Output Token Throughput Per User). + +--- + +### Inter Chunk Latency (ICL) + +**Type:** [Record Metric](#record-metrics) + +Captures the time gaps between all consecutive response chunks in a streaming response, providing a distribution of chunk arrival times rather than a single average. Note that this is different from the ITL metric, which measures the time between consecutive tokens regardless of chunk size. + +**Formula:** +```python +inter_chunk_latency = [request.responses[i].perf_ns - request.responses[i-1].perf_ns for i in range(1, len(request.responses))] +``` + +**Notes:** +- Requires at least 2 response chunks. +- Unlike ITL (which produces a single average), ICL provides the full distribution of inter-chunk times. +- Useful for detecting variability, jitter, or issues in streaming delivery. +- Analyzing ICL distributions can reveal batching behavior, scheduling issues, or network variability. + +--- + +### Output Token Throughput Per User + +**Type:** [Record Metric](#record-metrics) + +> [!IMPORTANT] +> This metric is computed per-request, and it excludes the TTFT from the equation, so it is **not** directly comparable to the [Output Token Throughput](#output-token-throughput) metric. + +The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. This represents single-request streaming performance. + +**Formula:** +```python +output_token_throughput_per_user = 1.0 / inter_token_latency_seconds +``` + +**Notes:** +- Computes the inverse of ITL to show tokens per second from an individual user's perspective. +- Differs from Output Token Throughput (aggregate across all concurrent requests) by focusing on single-request experience. +- Useful for understanding the user experience independent of concurrency effects. + +--- + +### Prefill Throughput + +**Type:** [Record Metric](#record-metrics) + +Measures the rate at which input tokens are processed during the prefill phase, calculated as input tokens per second based on TTFT. + +**Formula:** +```python +prefill_throughput = input_sequence_length / ttft_seconds +``` + +**Notes:** +- Higher values indicate faster prompt processing. +- Useful for understanding input processing capacity and bottlenecks. +- Depends on Input Sequence Length and TTFT metrics. + +--- + +## Token Based Metrics + +> [!NOTE] +> All metrics in this section require token-producing endpoints that return text content (chat, completion, etc.). These metrics are not available for embeddings or other non-generative endpoints. + +### Output Token Count + +**Type:** [Record Metric](#record-metrics) + +The number of output tokens generated for a single request, _excluding reasoning tokens_. This represents the output tokens returned to the user across all responses for the request. + +**Formula:** +```python +output_token_count = len(tokenizer.encode(content, add_special_tokens=False)) +``` + +**Notes:** +- Tokenization uses `add_special_tokens=False` to count only content tokens, excluding special tokens added by the tokenizer. +- For streaming requests with multiple responses, the responses are joined together and then tokens are counted. +- For models that expose reasoning in a separate `reasoning_content` field, this metric counts only non-reasoning output tokens. +- If reasoning appears inside the regular `content` (e.g., `` blocks), those tokens will be counted unless explicitly filtered. + +--- + +### Output Sequence Length (OSL) + +**Type:** [Record Metric](#record-metrics) + +The total number of completion tokens (output + reasoning) generated for a single request across all its responses. This represents the complete token generation workload for the request. + +**Formula:** +```python +output_sequence_length = (output_token_count or 0) + (reasoning_token_count or 0) +``` + +**Notes:** +- For models that do not support/separate reasoning tokens, OSL equals the output token count. + +--- + +### Input Sequence Length (ISL) + +**Type:** [Record Metric](#record-metrics) + +The number of input/prompt tokens for a single request. This represents the size of the input sent to the model. + +**Formula:** +```python +input_sequence_length = len(tokenizer.encode(prompt, add_special_tokens=False)) +``` + +**Notes:** +- Tokenization uses `add_special_tokens=False` to count only content tokens, excluding special tokens added by the tokenizer. +- Useful for understanding the relationship between input size and latency/throughput. + +--- + +### Total Output Tokens + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The sum of all output tokens (excluding reasoning tokens) generated across all requests. This represents the total output token workload. + +**Formula:** +```python +total_output_tokens = sum(r.output_token_count for r in records if r.valid) +``` + +**Notes:** +- Aggregates output tokens across all successful requests. +- Useful for capacity planning and cost estimation. + +--- + +### Total Output Sequence Length + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The sum of all completion tokens (output + reasoning) generated across all requests. This represents the complete token generation workload. + +**Formula:** +```python +total_osl = sum(r.output_sequence_length for r in records if r.valid) +``` + +**Notes:** +- Aggregates the complete token generation workload including both output and reasoning tokens. +- For models without reasoning tokens, this equals Total Output Tokens. + +--- + +### Total Input Sequence Length + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The sum of all input/prompt tokens processed across all requests. This represents the total input workload sent to the model. + +**Formula:** +```python +total_isl = sum(r.input_sequence_length for r in records if r.valid) +``` + +**Notes:** +- Useful for understanding the input workload, capacity planning, and analyzing the relationship between input size and system performance. + +--- + +### Output Token Throughput + +**Type:** [Derived Metric](#derived-metrics) + +> [!IMPORTANT] +> This metric is computed as a single value across all requests and includes TTFT in the equation, so it is **not** directly comparable to the [Output Token Throughput Per User](#output-token-throughput-per-user) metric. + +The aggregate token generation rate across all concurrent requests, measured as total tokens per second. This represents the system's overall token generation capacity. + +**Formula:** +```python +output_token_throughput = benchmark_token_count / benchmark_duration_seconds +``` + +**Notes:** +- Measures aggregate throughput across all concurrent requests; represents the overall system token generation rate. +- Higher values indicate better system utilization and capacity. +- Uses the hidden `benchmark_token_count` metric (sum of all output sequence lengths) as the numerator. + +--- + +## Reasoning Metrics + +> [!NOTE] +> All metrics in this section require models and backends that expose reasoning content in a separate `reasoning_content` field, distinct from the regular `content` field. + +### Reasoning Token Count + +**Type:** [Record Metric](#record-metrics) + +The number of reasoning tokens generated for a single request. These are tokens used for "thinking" or chain-of-thought reasoning before generating the final output. + +**Formula:** +```python +reasoning_token_count = len(tokenizer.encode(reasoning_content, add_special_tokens=False)) +``` + +**Notes:** +- Tokenization uses `add_special_tokens=False` to count only content tokens, excluding special tokens added by the tokenizer. +- Does **not** differentiate `` tags or extract reasoning from within the regular `content` field. + +--- + +### Total Reasoning Tokens + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The sum of all reasoning tokens generated across all requests. This represents the total reasoning/thinking workload. + +**Formula:** +```python +total_reasoning_tokens = sum(r.reasoning_token_count for r in records if r.valid) +``` + +**Notes:** +- Useful for understanding the reasoning overhead and cost for reasoning-enabled models. + +--- + +## Usage Field Metrics + +> [!NOTE] +> All metrics in this section track API-reported token counts from the `usage` field in API responses. These are **not displayed in console output** but are available in exports. These metrics are useful for comparing client-side token counts with server-reported counts to detect discrepancies. + +### Usage Prompt Tokens + +**Type:** [Record Metric](#record-metrics) + +The number of input/prompt tokens as reported by the API's `usage.prompt_tokens` field for a single request. + +**Formula:** +```python +usage_prompt_tokens = response.usage.prompt_tokens # from last non-None response +``` + +**Notes:** +- Taken from the API response `usage` object, not computed by AIPerf. +- May differ from client-side Input Sequence Length due to different tokenizers or special tokens. +- For streaming responses, uses the last non-None value reported. + +--- + +### Usage Completion Tokens + +**Type:** [Record Metric](#record-metrics) + +The number of completion tokens as reported by the API's `usage.completion_tokens` field for a single request. + +**Formula:** +```python +usage_completion_tokens = response.usage.completion_tokens # from last non-None response +``` + +**Notes:** +- Taken from the API response `usage` object, not computed by AIPerf. +- May differ from client-side Output Sequence Length due to different tokenizers or counting methods. +- For streaming responses, uses the last non-None value reported. + +--- + +### Usage Total Tokens + +**Type:** [Record Metric](#record-metrics) + +The total number of tokens (prompt + completion) as reported by the API's `usage.total_tokens` field for a single request. + +**Formula:** +```python +usage_total_tokens = response.usage.total_tokens # from last non-None response +``` + +**Notes:** +- Taken from the API response `usage` object, not computed by AIPerf. +- Should generally equal `usage_prompt_tokens + usage_completion_tokens`. +- For streaming responses, uses the last non-None value reported. + +--- + +### Usage Reasoning Tokens + +**Type:** [Record Metric](#record-metrics) + +The number of reasoning tokens as reported by the API's `usage.completion_tokens_details.reasoning_tokens` field for a single request. Only available for reasoning-enabled models. + +**Formula:** +```python +usage_reasoning_tokens = response.usage.completion_tokens_details.reasoning_tokens +``` + +**Notes:** +- Taken from the API response for reasoning-enabled models. +- May differ from client-side Reasoning Token Count due to different tokenizers. +- For streaming responses, uses the last non-None value reported. + +--- + +### Total Usage Prompt Tokens + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The sum of all API-reported prompt tokens across all requests. + +**Formula:** +```python +total_usage_prompt_tokens = sum(r.usage_prompt_tokens for r in records if r.valid) +``` + +**Notes:** +- Aggregates server-reported input tokens across all requests. + +--- + +### Total Usage Completion Tokens + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The sum of all API-reported completion tokens across all requests. + +**Formula:** +```python +total_usage_completion_tokens = sum(r.usage_completion_tokens for r in records if r.valid) +``` + +**Notes:** +- Aggregates server-reported completion tokens across all requests. + +--- + +### Total Usage Total Tokens + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The sum of all API-reported total tokens across all requests. + +**Formula:** +```python +total_usage_total_tokens = sum(r.usage_total_tokens for r in records if r.valid) +``` + +**Notes:** +- Aggregates server-reported total tokens across all requests. + +--- + +## Usage Discrepancy Metrics + +> [!NOTE] +> These metrics measure the percentage difference between API-reported token counts (`usage` fields) and client-computed token counts. They are **not displayed in console output** but help identify tokenizer mismatches or counting discrepancies. + +### Usage Prompt Tokens Diff % + +**Type:** [Record Metric](#record-metrics) + +The percentage difference between API-reported prompt tokens and client-computed Input Sequence Length. + +**Formula:** +```python +usage_prompt_tokens_diff_pct = abs((usage_prompt_tokens - input_sequence_length) / input_sequence_length) * 100 +``` + +**Notes:** +- Values close to 0% indicate good agreement between client and server token counts. +- Large differences may indicate tokenizer mismatches or special token handling differences. + +--- + +### Usage Completion Tokens Diff % + +**Type:** [Record Metric](#record-metrics) + +The percentage difference between API-reported completion tokens and client-computed Output Sequence Length. + +**Formula:** +```python +usage_completion_tokens_diff_pct = abs((usage_completion_tokens - output_sequence_length) / output_sequence_length) * 100 +``` + +**Notes:** +- Values close to 0% indicate good agreement between client and server token counts. +- Large differences may indicate tokenizer mismatches or different counting methods. + +--- + +### Usage Reasoning Tokens Diff % + +**Type:** [Record Metric](#record-metrics) + +The percentage difference between API-reported reasoning tokens and client-computed Reasoning Token Count. + +**Formula:** +```python +usage_reasoning_tokens_diff_pct = abs((usage_reasoning_tokens - reasoning_token_count) / reasoning_token_count) * 100 +``` + +**Notes:** +- Only available for reasoning-enabled models. +- Values close to 0% indicate good agreement between client and server reasoning token counts. + +--- + +### Usage Discrepancy Count + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The number of requests where token count differences exceed a threshold (default 10%). + +**Formula:** +```python +usage_discrepancy_count = sum(1 for r in records if r.any_diff > threshold) +``` + +**Notes:** +- Default threshold is 10% difference. +- Counts requests where prompt, completion, or reasoning token differences are significant. +- Useful for monitoring overall token count agreement quality. + +--- + +## Goodput Metrics + +> [!NOTE] +> Goodput metrics measure the throughput of requests that meet user-defined Service Level Objectives (SLOs). See the [Goodput tutorial](../docs/tutorials/goodput.md) for configuration details. + +### Good Request Count + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The number of requests that meet all user-defined SLO thresholds during the benchmark. + +**Formula:** +```python +good_request_count = sum(1 for r in records if r.all_slos_met) +``` + +**Notes:** +- Requires SLO thresholds to be configured (e.g., `--goodput`). +- Only counts requests where ALL SLO constraints are satisfied. +- Used to calculate Goodput metric. + +--- + +### Goodput + +**Type:** [Derived Metric](#derived-metrics) + +The rate of SLO-compliant requests per second. This represents the effective throughput of requests meeting quality requirements. + +**Formula:** +```python +goodput = good_request_count / benchmark_duration_seconds +``` + +**Notes:** +- Requires SLO thresholds to be configured. +- Always less than or equal to Request Throughput. +- Useful for capacity planning and comparing systems based on quality-adjusted throughput. + +--- + +## Error Metrics + +> [!NOTE] +> These metrics are computed only for failed/error requests and are **not displayed in console output**. + +### Error Input Sequence Length + +**Type:** [Record Metric](#record-metrics) + +The number of input tokens for requests that resulted in errors. This helps analyze whether input size correlates with errors. + +**Formula:** +```python +error_isl = input_sequence_length # for error requests only +``` + +**Notes:** +- Only computed for requests that failed. +- Useful for identifying if certain input sizes trigger errors. + +--- + +### Total Error Input Sequence Length + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The sum of all input tokens from requests that resulted in errors. + +**Formula:** +```python +total_error_isl = sum(r.input_sequence_length for r in records if not r.valid) +``` + +**Notes:** +- Aggregates input tokens across all failed requests. + +--- + +## General Metrics + +> [!NOTE] +> Metrics in this section are available for all benchmark runs with no special requirements. + +### Request Latency + +**Type:** [Record Metric](#record-metrics) + +Measures the total end-to-end time from sending a request until receiving the final response. For streaming requests with multiple responses, this measures until the last response is received. This is the complete time experienced by the client for a single request. + +**Formula:** +```python +request_latency_ns = request.responses[-1].perf_ns - request.start_perf_ns +``` + +**Notes:** +- Includes all components: network time, queuing, prompt processing, token generation, and response transmission. +- For streaming requests, measures from request start to the final chunk received. + +--- + +### Request Throughput + +**Type:** [Derived Metric](#derived-metrics) + +The overall rate of completed requests per second across the entire benchmark. This represents the system's ability to process requests under the given concurrency and load. + +**Formula:** +```python +request_throughput = request_count / benchmark_duration_seconds +``` + +**Notes:** +- Captures the aggregate request processing rate; higher values indicate better system throughput. +- Affected by concurrency level, request complexity, output sequence length, and system capacity. + +--- + +### Request Count + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The total number of **successfully completed** requests in the benchmark. This includes all requests that received valid responses, regardless of streaming mode. + +**Formula:** +```python +request_count = sum(1 for r in records if r.valid) +``` + +--- + +### Error Request Count + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The total number of failed/error requests encountered during the benchmark. This includes network errors, HTTP errors, timeout errors, and other failures. + +**Formula:** +```python +error_request_count = sum(1 for r in records if not r.valid) +``` + +**Notes:** +- Error rate can be computed as `error_request_count / (request_count + error_request_count)`. + +--- + +### Minimum Request Timestamp + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The wall-clock timestamp of the first request sent in the benchmark. This is used to calculate the benchmark duration and represents the start of the benchmark run. + +**Formula:** +```python +min_request_timestamp = min(r.timestamp_ns for r in records) +``` + +--- + +### Maximum Response Timestamp + +**Type:** [Aggregate Metric](#aggregate-metrics) + +The wall-clock timestamp of the last response received in the benchmark. This is used to calculate the benchmark duration and represents the end of the benchmark run. + +**Formula:** +```python +max_response_timestamp = max(r.timestamp_ns + r.request_latency for r in records) +``` + +--- + +### Benchmark Duration + +**Type:** [Derived Metric](#derived-metrics) + +The total elapsed time from the first request sent to the last response received. This represents the complete wall-clock duration of the benchmark run. + +**Formula:** +```python +benchmark_duration = max_response_timestamp - min_request_timestamp +``` + +**Notes:** +- Uses wall-clock timestamps representing real calendar time. +- Used as the denominator for throughput calculations; represents the effective measurement window. + +--- + +# Metric Flags Reference + +Metric flags are used to control when and how metrics are computed, displayed, and grouped. Flags can be combined using bitwise operations to create composite behaviors. + +## Individual Flags + +| Flag | Description | Impact | +|------|-------------|--------| +| `NONE` | No flags set | Metric has default behavior with no special restrictions | +| `STREAMING_ONLY` | Only computed for streaming responses | Requires Server-Sent Events (SSE) with multiple response chunks; skipped for non-streaming requests | +| `ERROR_ONLY` | Only computed for error requests | Tracks error-specific information; computed only for invalid/failed requests | +| `PRODUCES_TOKENS_ONLY` | Only computed for token-producing endpoints | Requires endpoints that return text/token content; skipped for embeddings and non-generative endpoints | +| `NO_CONSOLE` | Not displayed in console output | Metric computed but excluded from terminal display; available in JSON/CSV/JSONL exports and used by other metrics | +| `LARGER_IS_BETTER` | Higher values indicate better performance | Used for throughput and count metrics to indicate optimization direction | +| `INTERNAL` | Internal AIPerf metric | Used for AIPerf system diagnostics; not displayed in console or exported without developer mode | +| `SUPPORTS_AUDIO_ONLY` | Only computed for audio endpoints | Requires audio-capable endpoints; skipped for other endpoint types | +| `SUPPORTS_IMAGE_ONLY` | Only computed for image endpoints | Requires image-capable endpoints; skipped for other endpoint types | +| `SUPPORTS_REASONING` | Requires reasoning token support | Only available for models and endpoints that expose reasoning content in separate fields | +| `EXPERIMENTAL` | Experimental/unstable metric | May change or be removed in future releases; not displayed in console or exported without developer mode | +| `GOODPUT` | Only computed when goodput is enabled | Requires SLO thresholds to be configured (e.g., `--goodput-constraints`); skipped otherwise | +| `NO_INDIVIDUAL_RECORDS` | Not exported for individual records | Aggregate metrics not relevant to individual records (e.g., request count, min/max timestamps); excluded from per-record exports | +| `TOKENIZES_INPUT_ONLY` | Only computed when endpoint tokenizes input | Requires endpoints that process and tokenize input text; skipped for non-text endpoints | + +## Composite Flags + +These flags are combinations of multiple individual flags for convenience: + +| Flag | Composition | Description | +|------|-------------|-------------| +| `STREAMING_TOKENS_ONLY` | `STREAMING_ONLY` + `PRODUCES_TOKENS_ONLY` | Requires both streaming support and token-producing endpoints | + +---