From 4fb3771d7feea6ab6de9eb0539868d32ddf5d98d Mon Sep 17 00:00:00 2001
From: Anthony Casagrande <acasagrande@nvidia.com>
Date: Wed, 1 Oct 2025 21:23:50 -0700
Subject: [PATCH] docs: add comprehensive metrics docs

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
---
 README.md                 | 110 ++++-
 docs/metrics_reference.md | 841 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 949 insertions(+), 2 deletions(-)
 create mode 100644 docs/metrics_reference.md
diff --git a/README.md b/README.md
index b710dbd67..226ef6f04 100644
--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@ SPDX-License-Identifier: Apache-2.0
 [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/aiperf)
 
 
-**[Architecture](docs/architecture.md)**| **[Design Proposals](https://github.com/ai-dynamo/enhancements)** | **[Migrating from Genai-Perf](docs/migrating.md)** | **[CLI Options](docs/cli_options.md)**
+**[Architecture](docs/architecture.md)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** | **[Migrating from Genai-Perf](docs/migrating.md)** | **[CLI Options](docs/cli_options.md)** | **[Metrics Reference](docs/metrics_reference.md)**
 
 
 AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution.
@@ -96,7 +96,6 @@ aiperf profile --benchmark-duration 300.0 --benchmark-grace-period 30.0 [other o
 
 </br>
 
-
 <!--
 ======================
 INSTALLATION
@@ -166,6 +165,113 @@ NVIDIA AIPerf | LLM Metrics
 </div>
 
 
+
+<!--
+======================
+METRICS REFERENCE
+======================
+-->
+
+## Metrics Reference
+
+AIPerf provides comprehensive metrics organized into multiple functional categories. For detailed descriptions, requirements, and nuances of each metric, see the **[Complete Metrics Reference](docs/metrics_reference.md)**.
+
+### Streaming Metrics
+
+Metrics specific to streaming requests that measure real-time token generation characteristics. Requires `--streaming` flag.
+
+| Metric | Tag | Formula | Unit |
+|--------|-----|---------|------|
+| [**Time to First Token (TTFT)**](docs/metrics_reference.md#time-to-first-token-ttft) | `ttft` | `responses[0].perf_ns - request.start_perf_ns` | `ms` |
+| [**Time to Second Token (TTST)**](docs/metrics_reference.md#time-to-second-token-ttst) | `ttst` | `responses[1].perf_ns - responses[0].perf_ns` | `ms` |
+| [**Inter Token Latency (ITL)**](docs/metrics_reference.md#inter-token-latency-itl) | `inter_token_latency` | `(request_latency - ttft) / (output_sequence_length - 1)` | `ms` |
+| [**Inter Chunk Latency (ICL)**](docs/metrics_reference.md#inter-chunk-latency-icl) | `inter_chunk_latency` | `[responses[i].perf_ns - responses[i-1].perf_ns for i in range(1, len(responses))]` | `ms` |
+| [**Output Token Throughput Per User**](docs/metrics_reference.md#output-token-throughput-per-user) | `output_token_throughput_per_user` | `1.0 / inter_token_latency_seconds` | `tokens/sec/user` |
+| [**Prefill Throughput**](docs/metrics_reference.md#prefill-throughput) | `prefill_throughput` | `input_sequence_length / ttft_seconds` | `tokens/sec` |
+
+### Token Based Metrics
+
+Metrics for token-producing endpoints that track token counts and throughput. Requires text-generating endpoints (chat, completion, etc.).
+
+| Metric | Tag | Formula | Unit |
+|--------|-----|---------|------|
+| [**Output Token Count**](docs/metrics_reference.md#output-token-count) | `output_token_count` | `len(tokenizer.encode(content, add_special_tokens=False))` | `tokens` |
+| [**Output Sequence Length (OSL)**](docs/metrics_reference.md#output-sequence-length-osl) | `output_sequence_length` | `(output_token_count or 0) + (reasoning_token_count or 0)` | `tokens` |
+| [**Input Sequence Length (ISL)**](docs/metrics_reference.md#input-sequence-length-isl) | `input_sequence_length` | `len(tokenizer.encode(prompt, add_special_tokens=False))` | `tokens` |
+| [**Total Output Tokens**](docs/metrics_reference.md#total-output-tokens) | `total_output_tokens` | `sum(r.output_token_count for r in records if r.valid)` | `tokens` |
+| [**Total Output Sequence Length**](docs/metrics_reference.md#total-output-sequence-length) | `total_osl` | `sum(r.output_sequence_length for r in records if r.valid)` | `tokens` |
+| [**Total Input Sequence Length**](docs/metrics_reference.md#total-input-sequence-length) | `total_isl` | `sum(r.input_sequence_length for r in records if r.valid)` | `tokens` |
+| [**Output Token Throughput**](docs/metrics_reference.md#output-token-throughput) | `output_token_throughput` | `total_osl / benchmark_duration_seconds` | `tokens/sec` |
+
+### Reasoning Metrics
+
+Metrics specific to models that support reasoning/thinking tokens. Requires models with separate `reasoning_content` field.
+
+| Metric | Tag | Formula | Unit |
+|--------|-----|---------|------|
+| [**Reasoning Token Count**](docs/metrics_reference.md#reasoning-token-count) | `reasoning_token_count` | `len(tokenizer.encode(reasoning_content, add_special_tokens=False))` | `tokens` |
+| [**Total Reasoning Tokens**](docs/metrics_reference.md#total-reasoning-tokens) | `total_reasoning_tokens` | `sum(r.reasoning_token_count for r in records if r.valid)` | `tokens` |
+
+### Usage Field Metrics
+
+Metrics tracking API-reported token counts from the `usage` field in responses. Useful for comparing client-side vs server-side token counts.
+
+| Metric | Tag | Formula | Unit |
+|--------|-----|---------|------|
+| [**Usage Prompt Tokens**](docs/metrics_reference.md#usage-prompt-tokens) | `usage_prompt_tokens` | `response.usage.prompt_tokens` | `tokens` |
+| [**Usage Completion Tokens**](docs/metrics_reference.md#usage-completion-tokens) | `usage_completion_tokens` | `response.usage.completion_tokens` | `tokens` |
+| [**Usage Total Tokens**](docs/metrics_reference.md#usage-total-tokens) | `usage_total_tokens` | `response.usage.total_tokens` | `tokens` |
+| [**Usage Reasoning Tokens**](docs/metrics_reference.md#usage-reasoning-tokens) | `usage_reasoning_tokens` | `response.usage.completion_tokens_details.reasoning_tokens` | `tokens` |
+| [**Total Usage Prompt Tokens**](docs/metrics_reference.md#total-usage-prompt-tokens) | `total_usage_prompt_tokens` | `sum(r.usage_prompt_tokens for r in records if r.valid)` | `tokens` |
+| [**Total Usage Completion Tokens**](docs/metrics_reference.md#total-usage-completion-tokens) | `total_usage_completion_tokens` | `sum(r.usage_completion_tokens for r in records if r.valid)` | `tokens` |
+| [**Total Usage Total Tokens**](docs/metrics_reference.md#total-usage-total-tokens) | `total_usage_total_tokens` | `sum(r.usage_total_tokens for r in records if r.valid)` | `tokens` |
+
+### Usage Discrepancy Metrics
+
+Metrics measuring differences between API-reported and client-computed token counts.
+
+| Metric | Tag | Formula | Unit |
+|--------|-----|---------|------|
+| [**Usage Prompt Tokens Diff %**](docs/metrics_reference.md#usage-prompt-tokens-diff-) | `usage_prompt_tokens_diff_pct` | `abs((usage_prompt_tokens - input_sequence_length) / input_sequence_length) * 100` | `%` |
+| [**Usage Completion Tokens Diff %**](docs/metrics_reference.md#usage-completion-tokens-diff-) | `usage_completion_tokens_diff_pct` | `abs((usage_completion_tokens - output_sequence_length) / output_sequence_length) * 100` | `%` |
+| [**Usage Reasoning Tokens Diff %**](docs/metrics_reference.md#usage-reasoning-tokens-diff-) | `usage_reasoning_tokens_diff_pct` | `abs((usage_reasoning_tokens - reasoning_token_count) / reasoning_token_count) * 100` | `%` |
+| [**Usage Discrepancy Count**](docs/metrics_reference.md#usage-discrepancy-count) | `usage_discrepancy_count` | `sum(1 for r in records if r.any_diff > threshold)` | `requests` |
+
+### Goodput Metrics
+
+Metrics measuring throughput of requests meeting user-defined Service Level Objectives (SLOs).
+
+| Metric | Tag | Formula | Unit |
+|--------|-----|---------|------|
+| [**Good Request Count**](docs/metrics_reference.md#good-request-count) | `good_request_count` | `sum(1 for r in records if r.all_slos_met)` | `requests` |
+| [**Goodput**](docs/metrics_reference.md#goodput) | `goodput` | `good_request_count / benchmark_duration_seconds` | `requests/sec` |
+
+### Error Metrics
+
+Metrics computed for failed/error requests.
+
+| Metric | Tag | Formula | Unit |
+|--------|-----|---------|------|
+| [**Error Input Sequence Length**](docs/metrics_reference.md#error-input-sequence-length) | `error_isl` | `input_sequence_length` (for error requests) | `tokens` |
+| [**Total Error Input Sequence Length**](docs/metrics_reference.md#total-error-input-sequence-length) | `total_error_isl` | `sum(r.input_sequence_length for r in records if not r.valid)` | `tokens` |
+| [**Error Request Count**](docs/metrics_reference.md#error-request-count) | `error_request_count` | `sum(1 for r in records if not r.valid)` | `requests` |
+
+### General Metrics
+
+Metrics available for all benchmark runs with no special requirements.
+
+| Metric | Tag | Formula | Unit |
+|--------|-----|---------|------|
+| [**Request Latency**](docs/metrics_reference.md#request-latency) | `request_latency` | `responses[-1].perf_ns - request.start_perf_ns` | `ms` |
+| [**Request Throughput**](docs/metrics_reference.md#request-throughput) | `request_throughput` | `request_count / benchmark_duration_seconds` | `requests/sec` |
+| [**Request Count**](docs/metrics_reference.md#request-count) | `request_count` | `sum(1 for r in records if r.valid)` | `requests` |
+| [**Minimum Request Timestamp**](docs/metrics_reference.md#minimum-request-timestamp) | `min_request_timestamp` | `min(r.timestamp_ns for r in records)` | `datetime` |
+| [**Maximum Response Timestamp**](docs/metrics_reference.md#maximum-response-timestamp) | `max_response_timestamp` | `max(r.timestamp_ns + r.request_latency for r in records)` | `datetime` |
+| [**Benchmark Duration**](docs/metrics_reference.md#benchmark-duration) | `benchmark_duration` | `max_response_timestamp - min_request_timestamp` | `sec` |
+
+</br>
+
+
 ## Known Issues
 
 - Output sequence length constraints (`--output-tokens-mean`) cannot be guaranteed unless you pass `ignore_eos` and/or `min_tokens` via `--extra-inputs` to an inference server that supports them.
diff --git a/docs/metrics_reference.md b/docs/metrics_reference.md
new file mode 100644
index 000000000..2ee21ff42
--- /dev/null
+++ b/docs/metrics_reference.md
@@ -0,0 +1,841 @@
+<!--
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+-->
+# AIPerf Metrics Reference
+
+This document provides a comprehensive reference of all metrics available in AIPerf for benchmarking LLM inference performance. Metrics are organized by computation type to help you understand when and how each metric is calculated.
+
+## Table of Contents
+
+- [Quick Reference](#quick-reference)
+- [Understanding Metric Types](#understanding-metric-types)
+  - [Record Metrics](#record-metrics)
+  - [Aggregate Metrics](#aggregate-metrics)
+  - [Derived Metrics](#derived-metrics)
+- [Detailed Metric Descriptions](#detailed-metric-descriptions)
+  - [Streaming Metrics](#streaming-metrics)
+    - [Time to First Token (TTFT)](#time-to-first-token-ttft)
+    - [Time to Second Token (TTST)](#time-to-second-token-ttst)
+    - [Inter Token Latency (ITL)](#inter-token-latency-itl)
+    - [Inter Chunk Latency (ICL)](#inter-chunk-latency-icl)
+    - [Output Token Throughput Per User](#output-token-throughput-per-user)
+    - [Prefill Throughput](#prefill-throughput)
+  - [Token Based Metrics](#token-based-metrics)
+    - [Output Token Count](#output-token-count)
+    - [Output Sequence Length (OSL)](#output-sequence-length-osl)
+    - [Input Sequence Length (ISL)](#input-sequence-length-isl)
+    - [Total Output Tokens](#total-output-tokens)
+    - [Total Output Sequence Length](#total-output-sequence-length)
+    - [Total Input Sequence Length](#total-input-sequence-length)
+    - [Output Token Throughput](#output-token-throughput)
+  - [Reasoning Metrics](#reasoning-metrics)
+    - [Reasoning Token Count](#reasoning-token-count)
+    - [Total Reasoning Tokens](#total-reasoning-tokens)
+  - [Usage Field Metrics](#usage-field-metrics)
+    - [Usage Prompt Tokens](#usage-prompt-tokens)
+    - [Usage Completion Tokens](#usage-completion-tokens)
+    - [Usage Total Tokens](#usage-total-tokens)
+    - [Usage Reasoning Tokens](#usage-reasoning-tokens)
+    - [Total Usage Prompt Tokens](#total-usage-prompt-tokens)
+    - [Total Usage Completion Tokens](#total-usage-completion-tokens)
+    - [Total Usage Total Tokens](#total-usage-total-tokens)
+  - [Usage Discrepancy Metrics](#usage-discrepancy-metrics)
+    - [Usage Prompt Tokens Diff %](#usage-prompt-tokens-diff-)
+    - [Usage Completion Tokens Diff %](#usage-completion-tokens-diff-)
+    - [Usage Reasoning Tokens Diff %](#usage-reasoning-tokens-diff-)
+    - [Usage Discrepancy Count](#usage-discrepancy-count)
+  - [Goodput Metrics](#goodput-metrics)
+    - [Good Request Count](#good-request-count)
+    - [Goodput](#goodput)
+  - [Error Metrics](#error-metrics)
+    - [Error Input Sequence Length](#error-input-sequence-length)
+    - [Total Error Input Sequence Length](#total-error-input-sequence-length)
+  - [General Metrics](#general-metrics)
+    - [Request Latency](#request-latency)
+    - [Request Throughput](#request-throughput)
+    - [Request Count](#request-count)
+    - [Error Request Count](#error-request-count)
+    - [Minimum Request Timestamp](#minimum-request-timestamp)
+    - [Maximum Response Timestamp](#maximum-response-timestamp)
+    - [Benchmark Duration](#benchmark-duration)
+- [Metric Flags Reference](#metric-flags-reference)
+
+---
+
+## Quick Reference
+
+For a quick reference of all metrics with their tags, formulas, and units, see the **[Metrics Reference section in the README](../README.md#metrics-reference)**.
+
+The sections below provide detailed descriptions, requirements, and notes for each metric.
+
+---
+
+## Understanding Metric Types
+
+AIPerf computes metrics in three distinct phases during benchmark execution: **Record Metrics**, **Aggregate Metrics**, and **Derived Metrics**.
+
+## Record Metrics
+
+Record Metrics are computed **individually** for **each request** and its **response(s)** during the benchmark run. A single request may have one response (non-streaming) or multiple responses (streaming). These metrics capture **per-request characteristics** such as latency, token counts, and streaming behavior. Record metrics produce **statistical distributions** (min, max, mean, median, p90, p99, etc.) that reveal performance variability across requests.
+
+### Example Metrics
+`request_latency`, `ttft`, `inter_token_latency`, `output_token_count`, `input_sequence_length`
+
+### Dependencies
+Record Metrics can depend on raw request/response data and other Record Metrics from the same request.
+
+### Example Scenario
+`request_latency` measures the time for each individual request from start to final response. If you send 100 requests, you get 100 latency values that form a distribution showing how latency varies across requests.
+
+## Aggregate Metrics
+
+Aggregate Metrics are computed by **tracking** or **accumulating** values across **all requests** in **real-time** during the benchmark. These include counters, min/max timestamps, and other global statistics. Aggregate metrics produce a **single value** representing the entire benchmark run.
+
+### Example Metrics
+`request_count`, `error_request_count`, `min_request_timestamp`, `max_response_timestamp`
+
+### Dependencies
+Aggregate Metrics can depend on raw request/response data, Record Metrics and other Aggregate Metrics.
+
+### Example Scenario
+`request_count` increments by 1 for each successful request. At the end of a benchmark with 100 successful requests, this metric equals 100 (a single value, not a distribution).
+
+## Derived Metrics
+
+Derived Metrics are computed by applying **mathematical formulas** to other metric results, but are **not** computed per-record like Record Metrics. Instead, these metrics depend on one or more **prerequisite metrics** being available first and are calculated either **after the benchmark completes** for final results or in **real-time** across **all current data** for live metrics display. Derived metrics can produce either single values or distributions depending on their dependencies.
+
+### Example Metrics
+`request_throughput`, `output_token_throughput`, `benchmark_duration`
+
+### Dependencies
+Derived Metrics can depend on Record Metrics, Aggregate Metrics, and other Derived Metrics, but do not have
+any knowledge of the individual request/response data.
+
+### Example Scenario
+`request_throughput` is computed from `request_count / benchmark_duration_seconds`. This requires both `request_count` and `benchmark_duration` to be available first, then applies a formula to produce a single throughput value (e.g., 10.5 requests/sec).
+
+---
+
+# Detailed Metric Descriptions
+
+## Streaming Metrics
+
+> [!NOTE]
+> All metrics in this section require the `--streaming` flag with a token-producing endpoint and at least one non-empty response chunk.
+
+### Time to First Token (TTFT)
+
+**Type:** [Record Metric](#record-metrics)
+
+Measures how long it takes to receive the first token (or chunk of tokens) after sending a request. This is critical for user-perceived responsiveness in streaming scenarios, as it represents how quickly the model begins generating output.
+
+**Formula:**
+```python
+# nanoseconds
+ttft_ns = request.responses[0].perf_ns - request.start_perf_ns
+
+# Convert to milliseconds for display
+ttft_ms = ttft_ns / 1e6
+
+# Convert to seconds for throughput calculations
+ttft_seconds = ttft_ns / 1e9
+```
+
+**Notes:**
+- Includes network latency, queuing time, prompt processing, and generation of the first token (or chunk of tokens).
+- Raw timestamps are in nanoseconds; converted to milliseconds for display and seconds for rate calculations.
+- Response chunks refer to individual messages with non-empty content received during streaming.
+
+---
+
+### Time to Second Token (TTST)
+
+**Type:** [Record Metric](#record-metrics)
+
+Measures the time gap between the first and second chunk of tokens. This metric helps identify generation startup overhead separate from steady-state streaming throughput.
+
+**Formula:**
+```python
+# nanoseconds
+ttst_ns = request.responses[1].perf_ns - request.responses[0].perf_ns
+
+# Convert to milliseconds for display
+ttst_ms = ttst_ns / 1e6
+```
+
+**Notes:**
+- Requires at least 2 non-empty response chunks to compute the time between first and second tokens.
+- Raw timestamps are in nanoseconds; converted to milliseconds for display.
+
+---
+
+### Inter Token Latency (ITL)
+
+**Type:** [Record Metric](#record-metrics)
+
+Measures the average time between consecutive tokens during generation, excluding the initial TTFT overhead. This represents the steady-state token generation rate.
+
+**Formula:**
+```python
+# Calculate in nanoseconds, then convert to seconds
+inter_token_latency_ns = (request_latency_ns - ttft_ns) / (output_sequence_length - 1)
+
+# Convert to seconds for throughput calculations
+inter_token_latency_seconds = inter_token_latency_ns / 1e9
+
+# Convert to milliseconds for display
+inter_token_latency_ms = inter_token_latency_ns / 1e6
+```
+
+**Notes:**
+- Requires at least 2 non-empty response chunks and valid `ttft`, `request_latency`, and `output_sequence_length` metrics.
+- Result is in seconds when used for throughput calculations (Output Token Throughput Per User).
+
+---
+
+### Inter Chunk Latency (ICL)
+
+**Type:** [Record Metric](#record-metrics)
+
+Captures the time gaps between all consecutive response chunks in a streaming response, providing a distribution of chunk arrival times rather than a single average. Note that this is different from the ITL metric, which measures the time between consecutive tokens regardless of chunk size.
+
+**Formula:**
+```python
+inter_chunk_latency = [request.responses[i].perf_ns - request.responses[i-1].perf_ns for i in range(1, len(request.responses))]
+```
+
+**Notes:**
+- Requires at least 2 response chunks.
+- Unlike ITL (which produces a single average), ICL provides the full distribution of inter-chunk times.
+- Useful for detecting variability, jitter, or issues in streaming delivery.
+- Analyzing ICL distributions can reveal batching behavior, scheduling issues, or network variability.
+
+---
+
+### Output Token Throughput Per User
+
+**Type:** [Record Metric](#record-metrics)
+
+> [!IMPORTANT]
+> This metric is computed per-request, and it excludes the TTFT from the equation, so it is **not** directly comparable to the [Output Token Throughput](#output-token-throughput) metric.
+
+The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. This represents single-request streaming performance.
+
+**Formula:**
+```python
+output_token_throughput_per_user = 1.0 / inter_token_latency_seconds
+```
+
+**Notes:**
+- Computes the inverse of ITL to show tokens per second from an individual user's perspective.
+- Differs from Output Token Throughput (aggregate across all concurrent requests) by focusing on single-request experience.
+- Useful for understanding the user experience independent of concurrency effects.
+
+---
+
+### Prefill Throughput
+
+**Type:** [Record Metric](#record-metrics)
+
+Measures the rate at which input tokens are processed during the prefill phase, calculated as input tokens per second based on TTFT.
+
+**Formula:**
+```python
+prefill_throughput = input_sequence_length / ttft_seconds
+```
+
+**Notes:**
+- Higher values indicate faster prompt processing.
+- Useful for understanding input processing capacity and bottlenecks.
+- Depends on Input Sequence Length and TTFT metrics.
+
+---
+
+## Token Based Metrics
+
+> [!NOTE]
+> All metrics in this section require token-producing endpoints that return text content (chat, completion, etc.). These metrics are not available for embeddings or other non-generative endpoints.
+
+### Output Token Count
+
+**Type:** [Record Metric](#record-metrics)
+
+The number of output tokens generated for a single request, _excluding reasoning tokens_. This represents the output tokens returned to the user across all responses for the request.
+
+**Formula:**
+```python
+output_token_count = len(tokenizer.encode(content, add_special_tokens=False))
+```
+
+**Notes:**
+- Tokenization uses `add_special_tokens=False` to count only content tokens, excluding special tokens added by the tokenizer.
+- For streaming requests with multiple responses, the responses are joined together and then tokens are counted.
+- For models that expose reasoning in a separate `reasoning_content` field, this metric counts only non-reasoning output tokens.
+- If reasoning appears inside the regular `content` (e.g., `<think>` blocks), those tokens will be counted unless explicitly filtered.
+
+---
+
+### Output Sequence Length (OSL)
+
+**Type:** [Record Metric](#record-metrics)
+
+The total number of completion tokens (output + reasoning) generated for a single request across all its responses. This represents the complete token generation workload for the request.
+
+**Formula:**
+```python
+output_sequence_length = (output_token_count or 0) + (reasoning_token_count or 0)
+```
+
+**Notes:**
+- For models that do not support/separate reasoning tokens, OSL equals the output token count.
+
+---
+
+### Input Sequence Length (ISL)
+
+**Type:** [Record Metric](#record-metrics)
+
+The number of input/prompt tokens for a single request. This represents the size of the input sent to the model.
+
+**Formula:**
+```python
+input_sequence_length = len(tokenizer.encode(prompt, add_special_tokens=False))
+```
+
+**Notes:**
+- Tokenization uses `add_special_tokens=False` to count only content tokens, excluding special tokens added by the tokenizer.
+- Useful for understanding the relationship between input size and latency/throughput.
+
+---
+
+### Total Output Tokens
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The sum of all output tokens (excluding reasoning tokens) generated across all requests. This represents the total output token workload.
+
+**Formula:**
+```python
+total_output_tokens = sum(r.output_token_count for r in records if r.valid)
+```
+
+**Notes:**
+- Aggregates output tokens across all successful requests.
+- Useful for capacity planning and cost estimation.
+
+---
+
+### Total Output Sequence Length
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The sum of all completion tokens (output + reasoning) generated across all requests. This represents the complete token generation workload.
+
+**Formula:**
+```python
+total_osl = sum(r.output_sequence_length for r in records if r.valid)
+```
+
+**Notes:**
+- Aggregates the complete token generation workload including both output and reasoning tokens.
+- For models without reasoning tokens, this equals Total Output Tokens.
+
+---
+
+### Total Input Sequence Length
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The sum of all input/prompt tokens processed across all requests. This represents the total input workload sent to the model.
+
+**Formula:**
+```python
+total_isl = sum(r.input_sequence_length for r in records if r.valid)
+```
+
+**Notes:**
+- Useful for understanding the input workload, capacity planning, and analyzing the relationship between input size and system performance.
+
+---
+
+### Output Token Throughput
+
+**Type:** [Derived Metric](#derived-metrics)
+
+> [!IMPORTANT]
+> This metric is computed as a single value across all requests and includes TTFT in the equation, so it is **not** directly comparable to the [Output Token Throughput Per User](#output-token-throughput-per-user) metric.
+
+The aggregate token generation rate across all concurrent requests, measured as total tokens per second. This represents the system's overall token generation capacity.
+
+**Formula:**
+```python
+output_token_throughput = benchmark_token_count / benchmark_duration_seconds
+```
+
+**Notes:**
+- Measures aggregate throughput across all concurrent requests; represents the overall system token generation rate.
+- Higher values indicate better system utilization and capacity.
+- Uses the hidden `benchmark_token_count` metric (sum of all output sequence lengths) as the numerator.
+
+---
+
+## Reasoning Metrics
+
+> [!NOTE]
+> All metrics in this section require models and backends that expose reasoning content in a separate `reasoning_content` field, distinct from the regular `content` field.
+
+### Reasoning Token Count
+
+**Type:** [Record Metric](#record-metrics)
+
+The number of reasoning tokens generated for a single request. These are tokens used for "thinking" or chain-of-thought reasoning before generating the final output.
+
+**Formula:**
+```python
+reasoning_token_count = len(tokenizer.encode(reasoning_content, add_special_tokens=False))
+```
+
+**Notes:**
+- Tokenization uses `add_special_tokens=False` to count only content tokens, excluding special tokens added by the tokenizer.
+- Does **not** differentiate `<think>` tags or extract reasoning from within the regular `content` field.
+
+---
+
+### Total Reasoning Tokens
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The sum of all reasoning tokens generated across all requests. This represents the total reasoning/thinking workload.
+
+**Formula:**
+```python
+total_reasoning_tokens = sum(r.reasoning_token_count for r in records if r.valid)
+```
+
+**Notes:**
+- Useful for understanding the reasoning overhead and cost for reasoning-enabled models.
+
+---
+
+## Usage Field Metrics
+
+> [!NOTE]
+> All metrics in this section track API-reported token counts from the `usage` field in API responses. These are **not displayed in console output** but are available in exports. These metrics are useful for comparing client-side token counts with server-reported counts to detect discrepancies.
+
+### Usage Prompt Tokens
+
+**Type:** [Record Metric](#record-metrics)
+
+The number of input/prompt tokens as reported by the API's `usage.prompt_tokens` field for a single request.
+
+**Formula:**
+```python
+usage_prompt_tokens = response.usage.prompt_tokens  # from last non-None response
+```
+
+**Notes:**
+- Taken from the API response `usage` object, not computed by AIPerf.
+- May differ from client-side Input Sequence Length due to different tokenizers or special tokens.
+- For streaming responses, uses the last non-None value reported.
+
+---
+
+### Usage Completion Tokens
+
+**Type:** [Record Metric](#record-metrics)
+
+The number of completion tokens as reported by the API's `usage.completion_tokens` field for a single request.
+
+**Formula:**
+```python
+usage_completion_tokens = response.usage.completion_tokens  # from last non-None response
+```
+
+**Notes:**
+- Taken from the API response `usage` object, not computed by AIPerf.
+- May differ from client-side Output Sequence Length due to different tokenizers or counting methods.
+- For streaming responses, uses the last non-None value reported.
+
+---
+
+### Usage Total Tokens
+
+**Type:** [Record Metric](#record-metrics)
+
+The total number of tokens (prompt + completion) as reported by the API's `usage.total_tokens` field for a single request.
+
+**Formula:**
+```python
+usage_total_tokens = response.usage.total_tokens  # from last non-None response
+```
+
+**Notes:**
+- Taken from the API response `usage` object, not computed by AIPerf.
+- Should generally equal `usage_prompt_tokens + usage_completion_tokens`.
+- For streaming responses, uses the last non-None value reported.
+
+---
+
+### Usage Reasoning Tokens
+
+**Type:** [Record Metric](#record-metrics)
+
+The number of reasoning tokens as reported by the API's `usage.completion_tokens_details.reasoning_tokens` field for a single request. Only available for reasoning-enabled models.
+
+**Formula:**
+```python
+usage_reasoning_tokens = response.usage.completion_tokens_details.reasoning_tokens
+```
+
+**Notes:**
+- Taken from the API response for reasoning-enabled models.
+- May differ from client-side Reasoning Token Count due to different tokenizers.
+- For streaming responses, uses the last non-None value reported.
+
+---
+
+### Total Usage Prompt Tokens
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The sum of all API-reported prompt tokens across all requests.
+
+**Formula:**
+```python
+total_usage_prompt_tokens = sum(r.usage_prompt_tokens for r in records if r.valid)
+```
+
+**Notes:**
+- Aggregates server-reported input tokens across all requests.
+
+---
+
+### Total Usage Completion Tokens
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The sum of all API-reported completion tokens across all requests.
+
+**Formula:**
+```python
+total_usage_completion_tokens = sum(r.usage_completion_tokens for r in records if r.valid)
+```
+
+**Notes:**
+- Aggregates server-reported completion tokens across all requests.
+
+---
+
+### Total Usage Total Tokens
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The sum of all API-reported total tokens across all requests.
+
+**Formula:**
+```python
+total_usage_total_tokens = sum(r.usage_total_tokens for r in records if r.valid)
+```
+
+**Notes:**
+- Aggregates server-reported total tokens across all requests.
+
+---
+
+## Usage Discrepancy Metrics
+
+> [!NOTE]
+> These metrics measure the percentage difference between API-reported token counts (`usage` fields) and client-computed token counts. They are **not displayed in console output** but help identify tokenizer mismatches or counting discrepancies.
+
+### Usage Prompt Tokens Diff %
+
+**Type:** [Record Metric](#record-metrics)
+
+The percentage difference between API-reported prompt tokens and client-computed Input Sequence Length.
+
+**Formula:**
+```python
+usage_prompt_tokens_diff_pct = abs((usage_prompt_tokens - input_sequence_length) / input_sequence_length) * 100
+```
+
+**Notes:**
+- Values close to 0% indicate good agreement between client and server token counts.
+- Large differences may indicate tokenizer mismatches or special token handling differences.
+
+---
+
+### Usage Completion Tokens Diff %
+
+**Type:** [Record Metric](#record-metrics)
+
+The percentage difference between API-reported completion tokens and client-computed Output Sequence Length.
+
+**Formula:**
+```python
+usage_completion_tokens_diff_pct = abs((usage_completion_tokens - output_sequence_length) / output_sequence_length) * 100
+```
+
+**Notes:**
+- Values close to 0% indicate good agreement between client and server token counts.
+- Large differences may indicate tokenizer mismatches or different counting methods.
+
+---
+
+### Usage Reasoning Tokens Diff %
+
+**Type:** [Record Metric](#record-metrics)
+
+The percentage difference between API-reported reasoning tokens and client-computed Reasoning Token Count.
+
+**Formula:**
+```python
+usage_reasoning_tokens_diff_pct = abs((usage_reasoning_tokens - reasoning_token_count) / reasoning_token_count) * 100
+```
+
+**Notes:**
+- Only available for reasoning-enabled models.
+- Values close to 0% indicate good agreement between client and server reasoning token counts.
+
+---
+
+### Usage Discrepancy Count
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The number of requests where token count differences exceed a threshold (default 10%).
+
+**Formula:**
+```python
+usage_discrepancy_count = sum(1 for r in records if r.any_diff > threshold)
+```
+
+**Notes:**
+- Default threshold is 10% difference.
+- Counts requests where prompt, completion, or reasoning token differences are significant.
+- Useful for monitoring overall token count agreement quality.
+
+---
+
+## Goodput Metrics
+
+> [!NOTE]
+> Goodput metrics measure the throughput of requests that meet user-defined Service Level Objectives (SLOs). See the [Goodput tutorial](../docs/tutorials/goodput.md) for configuration details.
+
+### Good Request Count
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The number of requests that meet all user-defined SLO thresholds during the benchmark.
+
+**Formula:**
+```python
+good_request_count = sum(1 for r in records if r.all_slos_met)
+```
+
+**Notes:**
+- Requires SLO thresholds to be configured (e.g., `--goodput`).
+- Only counts requests where ALL SLO constraints are satisfied.
+- Used to calculate Goodput metric.
+
+---
+
+### Goodput
+
+**Type:** [Derived Metric](#derived-metrics)
+
+The rate of SLO-compliant requests per second. This represents the effective throughput of requests meeting quality requirements.
+
+**Formula:**
+```python
+goodput = good_request_count / benchmark_duration_seconds
+```
+
+**Notes:**
+- Requires SLO thresholds to be configured.
+- Always less than or equal to Request Throughput.
+- Useful for capacity planning and comparing systems based on quality-adjusted throughput.
+
+---
+
+## Error Metrics
+
+> [!NOTE]
+> These metrics are computed only for failed/error requests and are **not displayed in console output**.
+
+### Error Input Sequence Length
+
+**Type:** [Record Metric](#record-metrics)
+
+The number of input tokens for requests that resulted in errors. This helps analyze whether input size correlates with errors.
+
+**Formula:**
+```python
+error_isl = input_sequence_length  # for error requests only
+```
+
+**Notes:**
+- Only computed for requests that failed.
+- Useful for identifying if certain input sizes trigger errors.
+
+---
+
+### Total Error Input Sequence Length
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The sum of all input tokens from requests that resulted in errors.
+
+**Formula:**
+```python
+total_error_isl = sum(r.input_sequence_length for r in records if not r.valid)
+```
+
+**Notes:**
+- Aggregates input tokens across all failed requests.
+
+---
+
+## General Metrics
+
+> [!NOTE]
+> Metrics in this section are available for all benchmark runs with no special requirements.
+
+### Request Latency
+
+**Type:** [Record Metric](#record-metrics)
+
+Measures the total end-to-end time from sending a request until receiving the final response. For streaming requests with multiple responses, this measures until the last response is received. This is the complete time experienced by the client for a single request.
+
+**Formula:**
+```python
+request_latency_ns = request.responses[-1].perf_ns - request.start_perf_ns
+```
+
+**Notes:**
+- Includes all components: network time, queuing, prompt processing, token generation, and response transmission.
+- For streaming requests, measures from request start to the final chunk received.
+
+---
+
+### Request Throughput
+
+**Type:** [Derived Metric](#derived-metrics)
+
+The overall rate of completed requests per second across the entire benchmark. This represents the system's ability to process requests under the given concurrency and load.
+
+**Formula:**
+```python
+request_throughput = request_count / benchmark_duration_seconds
+```
+
+**Notes:**
+- Captures the aggregate request processing rate; higher values indicate better system throughput.
+- Affected by concurrency level, request complexity, output sequence length, and system capacity.
+
+---
+
+### Request Count
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The total number of **successfully completed** requests in the benchmark. This includes all requests that received valid responses, regardless of streaming mode.
+
+**Formula:**
+```python
+request_count = sum(1 for r in records if r.valid)
+```
+
+---
+
+### Error Request Count
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The total number of failed/error requests encountered during the benchmark. This includes network errors, HTTP errors, timeout errors, and other failures.
+
+**Formula:**
+```python
+error_request_count = sum(1 for r in records if not r.valid)
+```
+
+**Notes:**
+- Error rate can be computed as `error_request_count / (request_count + error_request_count)`.
+
+---
+
+### Minimum Request Timestamp
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The wall-clock timestamp of the first request sent in the benchmark. This is used to calculate the benchmark duration and represents the start of the benchmark run.
+
+**Formula:**
+```python
+min_request_timestamp = min(r.timestamp_ns for r in records)
+```
+
+---
+
+### Maximum Response Timestamp
+
+**Type:** [Aggregate Metric](#aggregate-metrics)
+
+The wall-clock timestamp of the last response received in the benchmark. This is used to calculate the benchmark duration and represents the end of the benchmark run.
+
+**Formula:**
+```python
+max_response_timestamp = max(r.timestamp_ns + r.request_latency for r in records)
+```
+
+---
+
+### Benchmark Duration
+
+**Type:** [Derived Metric](#derived-metrics)
+
+The total elapsed time from the first request sent to the last response received. This represents the complete wall-clock duration of the benchmark run.
+
+**Formula:**
+```python
+benchmark_duration = max_response_timestamp - min_request_timestamp
+```
+
+**Notes:**
+- Uses wall-clock timestamps representing real calendar time.
+- Used as the denominator for throughput calculations; represents the effective measurement window.
+
+---
+
+# Metric Flags Reference
+
+Metric flags are used to control when and how metrics are computed, displayed, and grouped. Flags can be combined using bitwise operations to create composite behaviors.
+
+## Individual Flags
+
+| Flag | Description | Impact |
+|------|-------------|--------|
+| <a id="flag-none"></a>`NONE` | No flags set | Metric has default behavior with no special restrictions |
+| <a id="flag-streaming-only"></a>`STREAMING_ONLY` | Only computed for streaming responses | Requires Server-Sent Events (SSE) with multiple response chunks; skipped for non-streaming requests |
+| <a id="flag-error-only"></a>`ERROR_ONLY` | Only computed for error requests | Tracks error-specific information; computed only for invalid/failed requests |
+| <a id="flag-produces-tokens-only"></a>`PRODUCES_TOKENS_ONLY` | Only computed for token-producing endpoints | Requires endpoints that return text/token content; skipped for embeddings and non-generative endpoints |
+| <a id="flag-no-console"></a>`NO_CONSOLE` | Not displayed in console output | Metric computed but excluded from terminal display; available in JSON/CSV/JSONL exports and used by other metrics |
+| <a id="flag-larger-is-better"></a>`LARGER_IS_BETTER` | Higher values indicate better performance | Used for throughput and count metrics to indicate optimization direction |
+| <a id="flag-internal"></a>`INTERNAL` | Internal AIPerf metric | Used for AIPerf system diagnostics; not displayed in console or exported without developer mode |
+| <a id="flag-supports-audio-only"></a>`SUPPORTS_AUDIO_ONLY` | Only computed for audio endpoints | Requires audio-capable endpoints; skipped for other endpoint types |
+| <a id="flag-supports-image-only"></a>`SUPPORTS_IMAGE_ONLY` | Only computed for image endpoints | Requires image-capable endpoints; skipped for other endpoint types |
+| <a id="flag-supports-reasoning"></a>`SUPPORTS_REASONING` | Requires reasoning token support | Only available for models and endpoints that expose reasoning content in separate fields |
+| <a id="flag-experimental"></a>`EXPERIMENTAL` | Experimental/unstable metric | May change or be removed in future releases; not displayed in console or exported without developer mode |
+| <a id="flag-goodput"></a>`GOODPUT` | Only computed when goodput is enabled | Requires SLO thresholds to be configured (e.g., `--goodput-constraints`); skipped otherwise |
+| <a id="flag-no-individual-records"></a>`NO_INDIVIDUAL_RECORDS` | Not exported for individual records | Aggregate metrics not relevant to individual records (e.g., request count, min/max timestamps); excluded from per-record exports |
+| <a id="flag-tokenizes-input-only"></a>`TOKENIZES_INPUT_ONLY` | Only computed when endpoint tokenizes input | Requires endpoints that process and tokenize input text; skipped for non-text endpoints |
+
+## Composite Flags
+
+These flags are combinations of multiple individual flags for convenience:
+
+| Flag | Composition | Description |
+|------|-------------|-------------|
+| <a id="flag-streaming-tokens-only"></a>`STREAMING_TOKENS_ONLY` | `STREAMING_ONLY` + `PRODUCES_TOKENS_ONLY` | Requires both streaming support and token-producing endpoints |
+
+---