docs: add comprehensive metrics documentation #321

ajcasagrande · 2025-10-02T18:26:29Z

Add formulas for each metric on the main readme
add separate metric docs to explain everything in detail.

Summary by CodeRabbit

Documentation
- Added a comprehensive Metrics Reference page covering Streaming, Token-Based, Reasoning, and General metrics, with formulas, notes, examples, and metric flag definitions.
- Introduced classifications (Record, Aggregate, Derived), a Quick Reference, and detailed metric descriptions.
- Linked the Metrics Reference from the README navigation for easier access.
- Fixed minor README formatting; removed an extra blank line and noted a duplicated Metrics Reference block inserted in README.

coderabbitai · 2025-10-02T18:26:37Z

Walkthrough

Adds a new detailed Metrics Reference document at docs/metrics_reference.md and links it into README.md; README navigation was updated, the Metrics Reference block was inserted in two places, and a small whitespace change was made around INSTALLATION.

Changes

Cohort / File(s)	Summary
README updates `README.md`	Inserted a new "Metrics Reference" navigation link; added the Metrics Reference block in two separate locations (duplicate insertion); removed an extra blank line near the INSTALLATION section.
New metrics doc `docs/metrics_reference.md`	Added a comprehensive AIPerf metrics reference (metric categories, quick reference, detailed metric descriptions, formulas, dependencies, streaming considerations, flags and examples).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Potential review focus:

Verify and remove the duplicated Metrics Reference block in README.md if unintended.
Review formulas, flag semantics, and example accuracy in docs/metrics_reference.md.
Check navigation/link correctness and formatting in README.md.

Poem

I thump my paws—new metrics bloom,
Tables, flags, and formulas zoom.
Two echoes now in README's light,
A burrow of docs, tidy and bright.
I hop away, whiskers twitching—done, goodnight. 🐇📈

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'docs: add comprehensive metrics documentation' directly and clearly summarizes the main change in the PR, which adds a new metrics reference document and updates README with metrics information.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2025-10-02T18:29:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (4)

docs/metrics_reference.md (3)
173-176: Ensure ITL is in seconds for this inverse relationship.

The formula assumes inter_token_latency_seconds; make sure the prior section defines ITL in seconds (not ns/ms) to keep this correct.

238-246: Filter to valid records in aggregate sums.

Exclude failed/invalid records from totals to match the description.

Apply this diff:
-total_output_tokens = sum(output_token_count for record in records)
+total_output_tokens = sum(r.output_token_count for r in records if r.valid)
-total_osl = sum(output_sequence_length for record in records)
+total_osl = sum(r.output_sequence_length for r in records if r.valid)
-total_isl = sum(input_sequence_length for record in records)
+total_isl = sum(r.input_sequence_length for r in records if r.valid)
Also applies to: 254-260, 268-275

338-341: Use consistent variable naming with request.start_perf_ns.

Align with earlier formulas to avoid ambiguity.

Apply this diff:
-request_latency = responses[-1].perf_ns - start_perf_ns
+request_latency_ns = responses[-1].perf_ns - request.start_perf_ns
README.md (1)
280-283: Grammar nit: “single values” → “single value”.

Minor text cleanup for clarity.

Apply this diff:
-> [!IMPORTANT]
-> This metric is computed as a single values across all requests, and it includes the TTFT in the equation, so it is **not** directly comparable to the [Output Token Throughput Per User](#output-token-throughput-per-user) metric.
+> [!IMPORTANT]
+> This metric is computed as a single value across all requests and includes TTFT in the equation, so it is **not** directly comparable to the [Output Token Throughput Per User](#output-token-throughput-per-user) metric.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d056b9e and c25567e.

📒 Files selected for processing (2)

README.md (2 hunks)
docs/metrics_reference.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

docs/metrics_reference.md

54-54: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)

67-67: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)

80-80: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build (ubuntu-latest, 3.10)

🔇 Additional comments (1)

README.md (1)

15-15: Navigation update LGTM.

Good addition of Metrics Reference link.

docs/metrics_reference.md

README.md

debermudez

Approved but I would like another set of eyes on it before we publish.
So only commenting for now since we didnt set it up for multiple approvals.

README.md

debermudez · 2025-10-02T18:35:20Z

docs/metrics_reference.md

+  - [Record Metrics](#record-metrics)
+  - [Aggregate Metrics](#aggregate-metrics)
+  - [Derived Metrics](#derived-metrics)
+- [Quick Reference](#quick-reference)


I think this should be the first section in this list.

can you help clarify? you want me to move the quick reference content up in the document? Or you want me to do something different with the ToC?

definitely. I think move this up above understanding metric types

docs/metrics_reference.md

IzzyPutterman · 2025-10-02T19:17:10Z

README.md

+
+| Metric | Tag | Formula | Unit |
+|--------|-----|---------|------|
+| [**Output Token Count**](docs/metrics_reference.md#output-token-count) | `output_token_count` | `len(tokenizer.encode(content))` | `tokens` |


I assume we have add_special_tokens=False

yes. could be good to add a note

IzzyPutterman · 2025-10-02T19:17:28Z

README.md

+|--------|-----|---------|------|
+| [**Output Token Count**](docs/metrics_reference.md#output-token-count) | `output_token_count` | `len(tokenizer.encode(content))` | `tokens` |
+| [**Output Sequence Length (OSL)**](docs/metrics_reference.md#output-sequence-length-osl) | `output_sequence_length` | `(output_token_count or 0) + (reasoning_token_count or 0)` | `tokens` |
+| [**Input Sequence Length (ISL)**](docs/metrics_reference.md#input-sequence-length-isl) | `input_sequence_length` | `len(tokenizer.encode(prompt))` | `tokens` |


Same as output_token_count

@IzzyPutterman can you explain what is the same as output_token_count? Are you referring to the ISL? Is it the wording on prompt?

I think @IzzyPutterman is intending that his feedback here is the same as his feedback in #321 (comment)

IzzyPutterman · 2025-10-02T19:19:20Z

README.md

+
+| Metric | Tag | Formula | Unit |
+|--------|-----|---------|------|
+| [**Time to First Token (TTFT)**](docs/metrics_reference.md#time-to-first-token-ttft) | `ttft` | `responses[0].perf_ns - request.start_perf_ns` | `ms` |


Perhaps a mention that responses are "chunks with non-empty content"

FrankD412 · 2025-10-02T19:27:31Z

docs/metrics_reference.md

+
+**Formula:**
+```python
+ttft = responses[0].perf_ns - request.start_perf_ns


Small knit-pick, I get what this is telling me but I just was wondering why request wasn't indexed. It might make it clearer if there was a pointer to the class or structure where this is used? My initial expectation was that the i-th responses would map to the i-th request.

This isn't a gating comment, just something that on initial impression was a little confusing.

I called out something like that out here: #321 (comment)
so I think this would be helpful, especially for someone looking to contribute.

@FrankD412 yeah, its hard to provide both easy to understand but true to life values, when the real formula is longer than the single line and you want it easy to understand.

Technically everywhere you see responses[x] it is really request.responses[x], but that was kinda wordy. One option is to drop the request from the start_perf, or to add request back in the first part.

@debermudez I agree that the links would be great.

Adding to my first point. Most sum(...) type metrics do not actually use sum at all. that is to make it easier for the user to understand. instead they are computed in 2 stages like is mentioned in the above sections:

Example Scenario

request_count increments by 1 for each successful request. At the end of a benchmark with 100 successful requests, this metric equals 100 (a single value, not a distribution).

class MinRequestTimestampMetric(BaseAggregateMetric[int]): """ Post-processor for calculating the minimum request time stamp metric from records. Formula: Minimum Request Timestamp = Min(Request Timestamps) """ tag = "min_request_timestamp" header = "Minimum Request Timestamp" short_header = "Min Req" short_header_hide_unit = True unit = MetricTimeUnit.NANOSECONDS display_unit = MetricDateTimeUnit.DATE_TIME flags = MetricFlags.HIDDEN required_metrics = None def __init__(self) -> None: # Default to a large value, so that any request timestamp will be smaller. super().__init__(default_value=sys.maxsize) def _parse_record( self, record: ParsedResponseRecord, record_metrics: MetricRecordDict, ) -> int: """Return the request timestamp.""" # NOTE: Use the request timestamp_ns, not the start_perf_ns, because we want wall-clock timestamps, return record.timestamp_ns def _aggregate_value(self, value: int) -> None: """Aggregate the metric value. For this metric, we just take the min of the values from the different processes.""" if value < self._value: self._value = value

If it's more complicated to try and map to implementation, a thought in my mind is maybe move to make the first part of this document the formal definition of the metrics. Like, define a number of pseudo variables -- then once the "theory" is laid out you can have a section or a link to another guide that explains the metric implementation?

Sometimes the implementation gets in the way of clear expression.

FrankD412 · 2025-10-02T19:50:34Z

docs/metrics_reference.md

+> [!NOTE]
+> Metrics in this section are available for all benchmark runs with no special requirements.
+
+### Request Latency


Isn't this the same as the benchmark duration below? Is this intended to be a per-request metric?

Same issue as my response at #321 (comment) I guess. start_perf_ns is by request.

Also, initially I had listed the metric type for each one, but i was trying my best not to make the doc too long. I think it may be good to have it, especially since I directly explain what the differences are up above.

the other thing is that I originally grouped the metrics by type (record, aggregate, derived), but felt it flowed better to be by use-case, especially to help people understand why they are or are not seeing certain metrics. (removed my need to explain the --streaming under each streaming one as well).

Hmm -- that's fair. Might it be worth explaining what a per-request statistic is in a central place and then labeling specific metrics as per-request (then linking to the per-request definition)?

FrankD412 · 2025-10-02T19:53:00Z

docs/metrics_reference.md

+```
+
+**Notes:**
+- Error rate can be computed as `error_request_count / (request_count + error_request_count)`.


Non-gating comment, but might be worth just defining a total requests as the summation of valid + invalid requests?

Signed-off-by: Anthony Casagrande <[email protected]>

ajcasagrande · 2025-11-06T16:13:35Z

@coderabbitai recheck

coderabbitai · 2025-11-06T16:13:57Z

@ajcasagrande: I'll review the changes in this PR now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

README.md (2)
259-270: General Metrics table has formula inconsistencies and unit-conversion gaps.

Multiple issues need addressing:

Request Latency (line 265): Formula shows responses[-1].perf_ns - request.start_perf_ns (incomplete reference; should be request.responses[-1]), with ms unit but ns-based formula without conversion.

Timestamp columns (lines 268–269): Formulas use nanoseconds (timestamp_ns) but units list datetime. This mismatch is confusing; clarify whether these are raw ns values or converted datetime strings.

Benchmark Duration (line 270): Formula subtracts two timestamps without showing the nanosecond-to-seconds conversion required to produce the sec unit.

Apply consistent fixes:
-| [**Request Latency**](docs/metrics_reference.md#request-latency) | `request_latency` | `responses[-1].perf_ns - request.start_perf_ns` | `ms` |
+| [**Request Latency**](docs/metrics_reference.md#request-latency) | `request_latency_ms` | `(request.responses[-1].perf_ns - request.start_perf_ns) / 1e6` | `ms` |
-| [**Benchmark Duration**](docs/metrics_reference.md#benchmark-duration) | `benchmark_duration` | `max_response_timestamp - min_request_timestamp` | `sec` |
+| [**Benchmark Duration**](docs/metrics_reference.md#benchmark-duration) | `benchmark_duration_seconds` | `(max_response_timestamp_ns - min_request_timestamp_ns) / 1e9` | `sec` |
179-190: Fix formula-unit mismatch: nanosecond operations shown with millisecond units.

README formulas show perf_ns values (nanoseconds) but specify ms units without conversion. This conflicts with the detailed metrics_reference.md which shows explicit nanosecond→millisecond conversions. Either show conversions in the formula (e.g., / 1e6 for ms) or update units to ns to match the raw formula.

Example discrepancy:
-| [**Time to First Token (TTFT)**](docs/metrics_reference.md#time-to-first-token-ttft) | `ttft` | `responses[0].perf_ns - request.start_perf_ns` | `ms` |
+| [**Time to First Token (TTFT)**](docs/metrics_reference.md#time-to-first-token-ttft) | `ttft` | `(responses[0].perf_ns - request.start_perf_ns) / 1e6` | `ms` |
Apply similar fixes for TTST (line 186) and ICL (line 188).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c25567e and 4fb3771.

📒 Files selected for processing (2)

README.md (2 hunks)
docs/metrics_reference.md (1 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-10-02T19:31:57.859Z

Learnt from: ajcasagrande
Repo: ai-dynamo/aiperf PR: 321
File: docs/metrics_reference.md:54-63
Timestamp: 2025-10-02T19:31:57.859Z
Learning: In the aiperf repository's docs/metrics_reference.md file, the maintainer prefers using h4 headings (####) for subsections under h2 headings instead of h3 (###) for better visual sizing and readability, even though this violates markdownlint rule MD001.

Applied to files:

docs/metrics_reference.md
README.md

📚 Learning: 2025-10-18T16:31:22.126Z

Learnt from: ajcasagrande
Repo: ai-dynamo/aiperf PR: 380
File: docs/tutorials/request-rate-concurrency.md:27-40
Timestamp: 2025-10-18T16:31:22.126Z
Learning: In the aiperf repository documentation, blank lines between different alert blockquotes (e.g., [!IMPORTANT], [!TIP], [!NOTE]) are intentional for visual separation, even if they trigger MD028 linting warnings. The separate blockquotes improve readability by clearly distinguishing between different types of information.

Applied to files:

docs/metrics_reference.md

📚 Learning: 2025-10-15T03:24:10.758Z

Learnt from: ajcasagrande
Repo: ai-dynamo/aiperf PR: 359
File: aiperf/metrics/types/time_to_first_output_metric.py:0-0
Timestamp: 2025-10-15T03:24:10.758Z
Learning: In TimeToFirstOutputMetric and similar metrics, invalid timestamp scenarios (where response timestamps precede request start) are automatically caught by the base class validation through the record.valid property, which checks that start_perf_ns < end_perf_ns. This validation happens in _require_valid_record before _parse_record is called, so explicit timestamp validation in _parse_record may be redundant.

Applied to files:

docs/metrics_reference.md
README.md

🪛 LanguageTool

README.md

[grammar] ~192-~192: Use a hyphen to join words.
Context: ...tft_seconds|tokens/sec` | ### Token Based Metrics Metrics for token-produci...

(QB_NEW_EN_HYPHEN)

🔇 Additional comments (7)

docs/metrics_reference.md (6)

1-72: Well-organized and comprehensive structure.

The file layout with clear table of contents, proper linking, and introductory guidance is excellent for documentation navigation.

74-117: Clear explanation of metric computation phases.

The section effectively distinguishes Record, Aggregate, and Derived metrics with concrete examples. Heading hierarchy (h2 → h3) is correct per MD001.

122-254: Streaming metrics section is comprehensive and well-documented.

Formulas correctly show nanosecond-to-millisecond and nanosecond-to-second conversions. Dependencies, requirements, and important distinctions (e.g., per-user vs aggregate throughput) are clearly explained with helpful alerts.

255-381: Token-based metrics are clearly documented with proper formula syntax.

Generator expressions correctly use for r in records, and notes thoroughly address reasoning token handling for both separate reasoning_content fields and embedded <think> blocks. This resolves the prior confusion about which tokens are excluded.

699-809: General metrics are properly documented with correct formula syntax and wall-clock handling.

Generator expressions and timestamp calculations follow the correct approach. The wall-clock timestamp + duration method (line 788) correctly implements the approach clarified in previous feedback.

One clarity suggestion: The request_latency formula (line 712) uses nanoseconds but lacks a unit-conversion comment, and benchmark_duration (line 801) doesn't show the nanosecond-to-seconds conversion. While the raw formulas are correct, adding explicit conversion comments (e.g., / 1e9 for nanosecond-to-seconds) would improve clarity for readers implementing these metrics, especially when comparing to the README.md quick-reference tables.

810-840: Metric flags reference is comprehensive and well-organized.

Clear descriptions of individual and composite flags with impact information provide good reference material for understanding metric computation and display behavior.

README.md (1)

15-15: Navigation and section header are properly integrated.

The Metrics Reference link fits naturally into the navigation bar, and the intro correctly directs readers to the detailed documentation.

Also applies to: 175-177

coderabbitai · 2025-11-06T16:17:29Z

README.md

+| [**Output Token Throughput Per User**](docs/metrics_reference.md#output-token-throughput-per-user) | `output_token_throughput_per_user` | `1.0 / inter_token_latency_seconds` | `tokens/sec/user` |
+| [**Prefill Throughput**](docs/metrics_reference.md#prefill-throughput) | `prefill_throughput` | `input_sequence_length / ttft_seconds` | `tokens/sec` |
+
+### Token Based Metrics


⚠️ Potential issue | 🟡 Minor

Use hyphen for compound adjective.

Line 192 should use "Token-Based Metrics" (hyphenated) for proper grammar.

-### Token Based Metrics +### Token-Based Metrics

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### Token Based Metrics

### Token-Based Metrics

🧰 Tools

🪛 LanguageTool

[grammar] ~192-~192: Use a hyphen to join words.
Context: ...tft_seconds|tokens/sec` | ### Token Based Metrics Metrics for token-produci...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents

In README.md around line 192, the heading "Token Based Metrics" uses an unhyphenated compound adjective; change it to "Token-Based Metrics" by inserting a hyphen between Token and Based so the heading follows standard grammar for compound modifiers.

ajcasagrande requested a review from IzzyPutterman October 2, 2025 18:26

github-actions bot added the docs label Oct 2, 2025

ajcasagrande requested review from debermudez and the-david-oy October 2, 2025 18:26

coderabbitai bot reviewed Oct 2, 2025

View reviewed changes

debermudez reviewed Oct 2, 2025

View reviewed changes

IzzyPutterman reviewed Oct 2, 2025

View reviewed changes

FrankD412 reviewed Oct 2, 2025

View reviewed changes

ajcasagrande self-assigned this Oct 28, 2025

ajcasagrande force-pushed the ajc/metric-docs2 branch from c25567e to aa971a6 Compare November 6, 2025 05:07

docs: add comprehensive metrics docs

4fb3771

Signed-off-by: Anthony Casagrande <[email protected]>

ajcasagrande force-pushed the ajc/metric-docs2 branch from aa971a6 to 4fb3771 Compare November 6, 2025 05:10

ajcasagrande requested a review from debermudez November 6, 2025 16:13

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

docs: add comprehensive metrics documentation #321

Are you sure you want to change the base?

docs: add comprehensive metrics documentation #321

Uh oh!

Conversation

ajcasagrande commented Oct 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

codecov bot commented Oct 2, 2025

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

debermudez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajcasagrande Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajcasagrande Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Example Scenario

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ajcasagrande commented Oct 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 2, 2025 •

edited

Loading

ajcasagrande Oct 2, 2025 •

edited

Loading

ajcasagrande Oct 2, 2025 •

edited

Loading