Skip to content

Conversation

@ajcasagrande
Copy link
Contributor

@ajcasagrande ajcasagrande commented Nov 21, 2025

See tutorial: https://github.com/ai-dynamo/aiperf/blob/24f17513495eacede9975f15f5afc5cf9293f675/docs/tutorials/server-metrics.md

Summary by CodeRabbit

New Features

  • Added server metrics collection feature that automatically gathers Prometheus-compatible metrics from LLM inference server endpoints, including support for custom endpoint configuration, automatic discovery, and deduplication of collected data.

Documentation

  • Added comprehensive server metrics tutorial documenting automatic discovery, configuration via CLI options and environment variables, output file formats, metrics schemas, and usage examples.

✏️ Tip: You can customize this high-level summary in your review settings.

@ajcasagrande ajcasagrande requested a review from ilana-n November 21, 2025 20:38
@github-actions
Copy link

github-actions bot commented Nov 21, 2025

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e89cb2c8d04c03a2de44f67ac1ffd369b6058723

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e89cb2c8d04c03a2de44f67ac1ffd369b6058723

Last updated for commit: e89cb2cBrowse code

@github-actions github-actions bot added the feat label Nov 21, 2025
@coderabbitai
Copy link

coderabbitai bot commented Nov 21, 2025

Walkthrough

This PR introduces comprehensive server metrics collection to AIPerf. It adds Prometheus-compatible endpoint discovery, automatic metrics scraping with configurable intervals, async HTTP collection infrastructure, deduplication and buffering mechanisms, new message types for inter-service communication, a dedicated ServerMetricsManager service, export processors, and extensive documentation and tests.

Changes

Cohort / File(s) Summary
Configuration & Environment
src/aiperf/common/config/config_defaults.py, src/aiperf/common/config/groups.py, src/aiperf/common/config/output_config.py, src/aiperf/common/config/user_config.py, src/aiperf/common/environment.py
Adds server metrics configuration: new OutputDefaults file paths, Groups.SERVER_METRICS group, user_config field server_metrics with URL normalization and parsing, output_config properties for export files, and environment variables for collection intervals, batch sizes, ports, timeouts.
Core Models & Data Structures
src/aiperf/common/models/server_metrics_models.py, src/aiperf/common/models/__init__.py
Introduces comprehensive Pydantic models for metrics: HistogramData, SummaryData, SlimMetricSample, MetricSample, MetricFamily, MetricSchema, ServerMetricsRecord with to_slim() and extract_metadata() methods, and related metadata structures.
Message Types & Enums
src/aiperf/common/messages/server_metrics_messages.py, src/aiperf/common/messages/__init__.py, src/aiperf/common/enums/message_enums.py, src/aiperf/common/enums/post_processor_enums.py, src/aiperf/common/enums/prometheus_enums.py, src/aiperf/common/enums/service_enums.py
Adds ServerMetricsRecordsMessage and ServerMetricsStatusMessage for inter-service communication, MessageType enums for server metrics, ResultsProcessorType.SERVER_METRICS_JSONL_WRITER, PrometheusMetricType enum with COUNTER/GAUGE/HISTOGRAM/SUMMARY/UNKNOWN, and ServiceType.SERVER_METRICS_MANAGER.
Metrics Collection Infrastructure
src/aiperf/common/mixins/base_metrics_collector_mixin.py, src/aiperf/common/mixins/buffered_jsonl_writer_mixin.py, src/aiperf/common/mixins/__init__.py, src/aiperf/common/duplicate_tracker.py, src/aiperf/common/metric_utils.py
Adds BaseMetricsCollectorMixin for async HTTP metrics scraping with reachability checks and collection intervals; BufferedJSONLWriterMixinWithDeduplication for batching and deduplication; AsyncKeyedDuplicateTracker for per-key duplicate suppression; utility functions for endpoint URL normalization and hostname-aware Prometheus endpoint discovery.
Server Metrics Services
src/aiperf/server_metrics/server_metrics_data_collector.py, src/aiperf/server_metrics/server_metrics_manager.py, src/aiperf/server_metrics/__init__.py
Implements ServerMetricsDataCollector for parsing Prometheus metrics and ServerMetricsManager for coordinating collectors, endpoint discovery, status messaging, and lifecycle management.
Results Processing & Integration
src/aiperf/post_processors/server_metrics_export_results_processor.py, src/aiperf/post_processors/__init__.py, src/aiperf/records/records_manager.py, src/aiperf/records/__init__.py, src/aiperf/controller/system_controller.py, src/aiperf/common/protocols.py
Adds ServerMetricsExportResultsProcessor for JSONL export with metadata extraction, ServerMetricsResultsProcessorProtocol, records manager integration for routing metrics messages, system controller handlers for status updates, and tracking state types.
Refactored Components
src/aiperf/gpu_telemetry/telemetry_data_collector.py, src/aiperf/post_processors/telemetry_export_results_processor.py
Refactors TelemetryDataCollector to use BaseMetricsCollectorMixin instead of direct lifecycle management; updates TelemetryExportResultsProcessor to use BufferedJSONLWriterMixinWithDeduplication.
Documentation
README.md, docs/cli_options.md, docs/tutorials/server-metrics.md
Adds feature entry and tutorials documenting automatic Prometheus endpoint discovery, collection behavior, configuration via environment variables, output formats (JSONL and metadata JSON), and metric deduplication.
Tests
tests/unit/common/test_duplicate_tracker.py, tests/unit/common/test_metric_utils.py, tests/unit/server_metrics/*, tests/unit/post_processors/test_server_metrics_export_results_processor.py, tests/unit/gpu_telemetry/test_telemetry_data_collector.py, tests/unit/post_processors/test_telemetry_export_results_processor.py
Adds comprehensive test coverage for deduplication, URL utilities, server metrics collection, parsing, manager orchestration, export processor behavior, and updated telemetry tests reflecting the refactor.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • BaseMetricsCollectorMixin (src/aiperf/common/mixins/base_metrics_collector_mixin.py): Core HTTP collection infrastructure with async lifecycle, reachability checks, and background tasks; verify thread-safety and proper cleanup.
  • ServerMetricsDataCollector parsing logic (src/aiperf/server_metrics/server_metrics_data_collector.py): Complex Prometheus metric parsing across counter/gauge/histogram/summary types with deduplication and aggregation; validate correctness of process*_family methods.
  • ServerMetricsManager endpoint discovery and lifecycle (src/aiperf/server_metrics/server_metrics_manager.py): Service coordination, message publishing, and error handling across multiple collectors; check message flow and fallback behavior.
  • AsyncKeyedDuplicateTracker (src/aiperf/common/duplicate_tracker.py): Per-key locking and deduplication semantics; ensure thread-safety with concurrent records.
  • TelemetryDataCollector refactor (src/aiperf/gpu_telemetry/telemetry_data_collector.py): Verify that the migration to BaseMetricsCollectorMixin preserves existing behavior and callbacks work correctly.
  • ServerMetricsExportResultsProcessor metadata handling (src/aiperf/post_processors/server_metrics_export_results_processor.py): Metadata update logic, atomicity of file writes, and lock safety during concurrent record processing.

Poem

🐰 Prometheus whispers from endpoints far and wide,
Our metrics collector hops with boundless pride,
Deduplicating echoes, buffering with care,
Server stats dance through the network air,
From LLMs to systems, telemetry takes flight!

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Collect server-side prometheus metrics' accurately summarizes the main feature being added to the codebase. It clearly communicates that this PR introduces server-side Prometheus metrics collection functionality.
Docstring Coverage ✅ Passed Docstring coverage is 92.38% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/aiperf/gpu_telemetry/telemetry_data_collector.py (1)

23-41: Docstring mostly updated, but “logged and re-raised” no longer matches actual behavior

The class docstring correctly mentions BaseMetricsCollectorMixin, but in the _collect_and_process_metrics docstring (Lines 72-73 in this file) you still say “Any exception from fetch or parse is logged and re-raised.” With the mixin, exceptions are now caught in _collect_metrics_task and either:

  • delivered to error_callback, or
  • logged via self.error, but not re-raised (except CancelledError).

Consider updating that “Raises:” section to describe the new behavior (e.g., “Exceptions propagate to the mixin’s background task, which logs or forwards them via error_callback; callers of this method should not rely on it raising.”).

🧹 Nitpick comments (20)
docs/cli_options.md (1)

417-417: Format URLs as proper markdown links.

The bare URLs should be formatted as markdown links for better readability and to follow markdown best practices.

Consider applying this diff:

-Server metrics collection (ENABLED BY DEFAULT). Automatically collects from inference endpoint base_url + `/metrics`. Optionally specify additional custom Prometheus-compatible endpoint URLs (e.g., http://node1:8081/metrics, http://node2:9090/metrics). Use AIPERF_SERVER_METRICS_ENABLED=false to disable. Example: `--server-metrics node1:8081 node2:9090/metrics` for additional endpoints.
+Server metrics collection (ENABLED BY DEFAULT). Automatically collects from inference endpoint base_url + `/metrics`. Optionally specify additional custom Prometheus-compatible endpoint URLs (e.g., `http://node1:8081/metrics`, `http://node2:9090/metrics`). Use `AIPERF_SERVER_METRICS_ENABLED=false` to disable. Example: `--server-metrics node1:8081 node2:9090/metrics` for additional endpoints.
tests/unit/common/test_duplicate_tracker.py (1)

260-279: Make test_concurrent_writes_to_same_key actually concurrent

Despite the name/docstring, this test currently performs all writes sequentially via write_sequence, so it doesn’t exercise per-key locking under real concurrency.

Consider rewriting to launch multiple writers with asyncio.gather:

     @pytest.mark.asyncio
     @pytest.mark.parametrize("num_tasks,writes_per_task", [(3, 10), (5, 5), (10, 3)])  # fmt: skip
     async def test_concurrent_writes_to_same_key(
         self,
         tracker: AsyncKeyedDuplicateTracker[SampleRecord],
         num_tasks: int,
         writes_per_task: int,
     ):
         """Test that concurrent writes to the same key are handled safely."""
-        results = await write_sequence(
-            tracker, "key1", [1] * (num_tasks * writes_per_task)
-        )
-
-        # At least first record should be written
-        total_written = sum(len(r) for r in results)
+        async def write_to_key(count: int) -> int:
+            total = 0
+            for _ in range(count):
+                record = SampleRecord(key="key1", value=1)
+                result = await tracker.deduplicate_record(record)
+                total += len(result)
+            return total
+
+        # Run multiple writers concurrently to the same key
+        results = await asyncio.gather(
+            *[write_to_key(writes_per_task) for _ in range(num_tasks)]
+        )
+
+        # At least first record should be written
+        total_written = sum(results)

The remaining assertions (total_written >= 1 and < num_tasks * writes_per_task) can stay as-is.

tests/unit/server_metrics/test_server_metrics_models.py (1)

17-191: Solid coverage for server metrics model serialization

The tests here exercise the critical paths for MetricSample.to_slim, ServerMetricsRecord.to_slim, MetricSchema, and the histogram/summary data shapes, including endpoint URL preservation and optional sum/count handling. This gives good confidence in the JSONL-facing representations; any further assertions (e.g., extra is None checks for unused fields) would be purely optional.

tests/unit/server_metrics/test_server_metrics_data_collector.py (2)

20-43: Prefer public properties over private attributes in tests.

The tests directly access private attributes (_endpoint_url, _collection_interval, _reachability_timeout, _session). Based on the mixin interface (src/aiperf/common/mixins/base_metrics_collector_mixin.py lines 74-81), public properties are available. Using public properties makes tests more resilient to internal implementation changes.

Apply this diff to use public properties:

-        assert collector._endpoint_url == "http://localhost:8081/metrics"
-        assert collector._collection_interval == 0.5
-        assert collector._reachability_timeout == 10.0
+        assert collector.endpoint_url == "http://localhost:8081/metrics"
+        assert collector.collection_interval == 0.5
+        # Note: reachability_timeout doesn't have a public property in the mixin
+        assert collector._reachability_timeout == 10.0
         assert collector.id == "test_collector"
         assert collector._session is None
         assert not collector.was_initialized
-        assert collector._endpoint_url == "http://localhost:8081/metrics"
-        assert collector._collection_interval == 0.33
+        assert collector.endpoint_url == "http://localhost:8081/metrics"
+        assert collector.collection_interval == 0.33
         assert collector.id == "server_metrics_collector"

245-264: Consider testing deduplication for other metric types.

The deduplication test only covers counters with last-wins behavior. While this validates the core logic, consider adding tests for gauge and histogram/summary deduplication to ensure consistent behavior across all metric types, especially since histograms and summaries have more complex structures (buckets, quantiles).

src/aiperf/controller/system_controller.py (1)

42-48: Server-metrics status integration is consistent with telemetry flow

Starting ServerMetricsManager as an optional service and handling SERVER_METRICS_STATUS with mirrored endpoint tracking/logging keeps behavior parallel to telemetry and looks correct. If you plan to surface server-metrics endpoint reachability in exports/summary later, you already have the necessary state here to wire in.

Also applies to: 132-138, 186-191, 387-405

docs/tutorials/server-metrics.md (1)

146-155: Minor doc polish and markdownlint fix

Content and examples line up well with the implementation. A couple of small nits you might want to address:

  • Around Line 153: “then new record written (marks start of new period)” – you could shorten to “then the new record (marks start of new period)” to satisfy the style checker.
  • Around Line 273: “at one point in time” can be simplified to “at one time” or just “at one point”.
  • Lines 220–226: there’s a blank > line inside the blockquote, which triggers markdownlint MD028; removing the empty quoted line will keep the callout but satisfy the linter.

Also applies to: 219-226, 271-275

src/aiperf/post_processors/server_metrics_export_results_processor.py (1)

44-67: Consider avoiding unnecessary metadata extraction and clarifying update semantics

The per-endpoint metadata tracking and atomic file writes look solid, and the dedupe key/value choice (endpoint URL + metrics dict) matches the documented behavior.

Two minor suggestions:

  • record.extract_metadata() is computed for every record, even when _should_update_metadata returns False. Since _should_update_metadata only inspects record.metrics, you can defer extract_metadata() until should_write is True to avoid extra Pydantic object construction:
-        url = record.endpoint_url
-        metadata = record.extract_metadata()
+        url = record.endpoint_url
@@
-        if should_write:
+        if should_write:
+            metadata = record.extract_metadata()
             async with self._metadata_file_lock:
  • The comment above _metadata_file_model notes that “metadata fields can occasionally change,” but _should_update_metadata only reacts to new metric names, not changes to existing schemas or info metrics. That’s probably fine in practice; consider tightening the comment or, if you do care about changing descriptions/types, extending _should_update_metadata to compare existing vs new MetricSchema/InfoMetricData entries.

Also applies to: 71-104, 109-134

src/aiperf/server_metrics/server_metrics_manager.py (2)

102-161: Configure phase and reachability handling are robust, with minor lint nits

The PROFILE_CONFIGURE flow (building collectors, checking reachability, emitting a status, and bailing early when none are reachable) is clean and matches the documented behavior. Two small nits:

  • message (Line 104) is unused but required by @on_command. If Ruff’s ARG002 is noisy, consider renaming to _message or adding a # noqa: ARG002 on the line.
  • The broad except Exception as e around reachability (Lines 148-150) is defensively reasonable here, but if you want to satisfy BLE001 you could either narrow to aiohttp.ClientError/OSError or document the intentional broad catch with a # noqa: BLE001 comment.

Functionally this segment is fine as-is.


170-205: Record + status publishing behavior is correct; broad exception catches are intentional but noisy

The callbacks and status sender behave as expected:

  • _on_server_metrics_records only emits messages when records is non-empty and on failure sends a second message carrying ErrorDetails, which aligns with ServerMetricsRecordsMessage.valid/has_data.
  • _on_server_metrics_error correctly wraps collector errors into a dedicated records message with records=[].
  • _send_server_metrics_status centralizes status reporting for both disabled and enabled cases.

The repeated except Exception as e blocks (Lines 191-193, 252-253, 280-295, 318-320, 351-352) are reasonable for shutdown/telemetry paths, but they’re exactly what Ruff’s BLE001 is flagging. If you want to quiet that without changing behavior, you can:

  • Add # noqa: BLE001 on those except lines, or
  • Narrow a couple of them where you know the likely failure class (e.g., messaging client errors).

No functional issues here; this is purely about linter noise vs. defensive coding.

Also applies to: 236-255, 270-295, 296-320, 321-352

src/aiperf/common/models/server_metrics_models.py (1)

75-123: Slim conversion and metadata extraction match the intended server-metrics format

MetricSample.to_slim, ServerMetricsRecord.to_slim, and extract_metadata correctly:

  • Represent histograms/summaries via nested dicts plus sum/count.
  • Exclude _info metrics from slim records while capturing them in info_metrics with full label data.
  • Build metric_schemas with PrometheusMetricType and descriptions as used by the export processor tests.

The implementation is straightforward and lines up with the JSONL + metadata expectations in the post-processor tests. Only tiny nit: the docstring in ServerMetricsRecord.to_slim mentions “array-based format,” but the actual structure is “list of slim samples with dict-based histogram/summary fields,” which you may want to clarify later.

Also applies to: 160-257, 259-291

tests/unit/post_processors/test_server_metrics_export_results_processor.py (1)

380-405: Remove unused monkeypatch fixture parameter

In test_unique_label_values_respects_cardinality_limit the monkeypatch argument is never used, which triggers Ruff’s ARG002. Since the test doesn’t rely on it, you can safely drop the parameter.

-    async def test_unique_label_values_respects_cardinality_limit(
-        self,
-        user_config_server_metrics_export: UserConfig,
-        service_config: ServiceConfig,
-        monkeypatch,
-    ):
+    async def test_unique_label_values_respects_cardinality_limit(
+        self,
+        user_config_server_metrics_export: UserConfig,
+        service_config: ServiceConfig,
+    ):
src/aiperf/server_metrics/server_metrics_data_collector.py (1)

26-66: Collector initialization correctly hooks into the base mixin; consider is None checks for intervals

The constructor wires endpoint_url, collection_interval, reachability_timeout, and callbacks into BaseMetricsCollectorMixin as expected, using environment defaults when explicit values aren’t provided. Functionally this is fine.

If you ever need to support a zero-second interval or timeout explicitly, you might prefer if collection_interval is None / if reachability_timeout is None instead of or, to avoid treating 0.0 as “use default.” For current usage (only None vs. positive defaults) this is just an ergonomics tweak.

-        super().__init__(
-            endpoint_url=endpoint_url,
-            collection_interval=collection_interval
-            or Environment.SERVER_METRICS.COLLECTION_INTERVAL,
-            reachability_timeout=reachability_timeout
-            or Environment.SERVER_METRICS.REACHABILITY_TIMEOUT,
+        super().__init__(
+            endpoint_url=endpoint_url,
+            collection_interval=(
+                collection_interval
+                if collection_interval is not None
+                else Environment.SERVER_METRICS.COLLECTION_INTERVAL
+            ),
+            reachability_timeout=(
+                reachability_timeout
+                if reachability_timeout is not None
+                else Environment.SERVER_METRICS.REACHABILITY_TIMEOUT
+            ),
src/aiperf/records/records_manager.py (2)

69-103: Shared ErrorTrackingState is a reasonable abstraction; tiny typing nit only

Introducing ErrorTrackingState and reusing it via TelemetryTrackingState, ServerMetricsTrackingState, and MetricTrackingState nicely unifies error counting across subsystems.

Minor nit: the field is annotated as dict[ErrorDetails, int] but initialized with defaultdict(int). That’s perfectly fine at runtime but might show up in strict type-checking; if it does, you can either:

  • Change the annotation to collections.abc.MutableMapping[ErrorDetails, int], or
  • Wrap the defaultdict in dict(...) when returning summaries (as you already iterate via .items()).

Functionally, this segment is good.


271-290: Server-metrics record handling and processor error tracking look correct; consider exposing a summary later

The new flow for server metrics:

  • _on_server_metrics_records consumes ServerMetricsRecordsMessage, forwarding each ServerMetricsRecord to _send_server_metrics_to_results_processors when message.valid and counting collection errors otherwise.
  • _send_server_metrics_to_results_processors fans out to all registered server-metrics processors with asyncio.gather(..., return_exceptions=True), logs any processor exceptions, and increments _server_metrics_state.error_counts using ErrorDetails.from_exception.

This mirrors the telemetry path and gives good resilience against individual processor failures.

Right now, the accumulated server-metrics error counts are never surfaced (unlike telemetry via get_telemetry_error_summary). If you want symmetry and better observability for server-metrics failures, a future follow-up could add a get_server_metrics_error_summary() method or fold these into an appropriate results/metadata export. Not a blocker for this PR.

Also applies to: 398-420

src/aiperf/common/mixins/base_metrics_collector_mixin.py (5)

21-28: Consider using concrete type aliases instead of TypeVar-bound callbacks

TRecordCallback / TErrorCallback are TypeVars that aren’t part of the class’ Generic[...] parameters, which most type checkers treat as effectively non-generic. A simple alias like RecordCallback = Callable[[list[TRecord], str], Awaitable[None]] and ErrorCallback = Callable[[ErrorDetails, str], Awaitable[None]] would be clearer and avoid confusing template parameters.


84-96: Session timeout choice: connect-only timeout may allow indefinitely hanging scrapes

Using ClientTimeout(total=None, connect=self._reachability_timeout) intentionally leaves no overall timeout on GET calls. If a metrics endpoint stalls after connection (e.g., slow or wedged server), _fetch_metrics_text can hang until cancellation. If that’s not desired, consider also setting a sock_read or total timeout (possibly configurable) to bound worst-case collection latency.


108-151: Reachability semantics: only HTTP 200 treated as “reachable”

is_url_reachable currently returns True only for status 200 from either HEAD or GET. That will treat common cases like 3xx redirects, 401/403 auth failures, or 204 responses as unreachable, even though the endpoint exists. If the goal is “endpoint is up” rather than “scrape is currently succeeding”, consider broadening to 2xx or even < 500 depending on how you intend to gate startup/health decisions.


153-178: Background task error handling: broad Exception catch is intentional but could log more detail

Catching Exception around _collect_and_process_metrics() and the error_callback ensures that one bad scrape or callback won’t kill the background task, which is appropriate here. To improve debuggability, you might:

  • Log with exc_info=True in the else branch (self.error(f"...", exc_info=True)), and
  • Optionally include exc_info=True when logging callback_error as well.

That keeps robustness while giving full tracebacks when things go wrong.


225-235: Callback failure handling is robust; consider whether to surface structured error info

Silently catching all Exception from record_callback and just logging is appropriate to keep the collector running. If downstreams ever need to react to callback failures (e.g., to emit an ErrorDetails via error_callback or metrics), this would be the place to add that wiring. As-is, the implementation is safe and adequate.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 89b6a45 and 24f1751.

📒 Files selected for processing (44)
  • README.md (1 hunks)
  • docs/cli_options.md (1 hunks)
  • docs/tutorials/server-metrics.md (1 hunks)
  • src/aiperf/common/config/config_defaults.py (1 hunks)
  • src/aiperf/common/config/groups.py (1 hunks)
  • src/aiperf/common/config/output_config.py (4 hunks)
  • src/aiperf/common/config/user_config.py (2 hunks)
  • src/aiperf/common/duplicate_tracker.py (1 hunks)
  • src/aiperf/common/enums/__init__.py (2 hunks)
  • src/aiperf/common/enums/message_enums.py (1 hunks)
  • src/aiperf/common/enums/post_processor_enums.py (1 hunks)
  • src/aiperf/common/enums/prometheus_enums.py (1 hunks)
  • src/aiperf/common/enums/service_enums.py (1 hunks)
  • src/aiperf/common/environment.py (4 hunks)
  • src/aiperf/common/messages/__init__.py (2 hunks)
  • src/aiperf/common/messages/server_metrics_messages.py (1 hunks)
  • src/aiperf/common/metric_utils.py (1 hunks)
  • src/aiperf/common/mixins/__init__.py (3 hunks)
  • src/aiperf/common/mixins/base_metrics_collector_mixin.py (1 hunks)
  • src/aiperf/common/mixins/buffered_jsonl_writer_mixin.py (4 hunks)
  • src/aiperf/common/models/__init__.py (4 hunks)
  • src/aiperf/common/models/server_metrics_models.py (1 hunks)
  • src/aiperf/common/protocols.py (2 hunks)
  • src/aiperf/controller/system_controller.py (4 hunks)
  • src/aiperf/gpu_telemetry/telemetry_data_collector.py (4 hunks)
  • src/aiperf/post_processors/__init__.py (2 hunks)
  • src/aiperf/post_processors/server_metrics_export_results_processor.py (1 hunks)
  • src/aiperf/post_processors/telemetry_export_results_processor.py (3 hunks)
  • src/aiperf/records/__init__.py (1 hunks)
  • src/aiperf/records/records_manager.py (6 hunks)
  • src/aiperf/server_metrics/__init__.py (1 hunks)
  • src/aiperf/server_metrics/server_metrics_data_collector.py (1 hunks)
  • src/aiperf/server_metrics/server_metrics_manager.py (1 hunks)
  • tests/unit/common/test_duplicate_tracker.py (1 hunks)
  • tests/unit/common/test_metric_utils.py (1 hunks)
  • tests/unit/gpu_telemetry/test_telemetry_data_collector.py (8 hunks)
  • tests/unit/post_processors/test_server_metrics_export_results_processor.py (1 hunks)
  • tests/unit/post_processors/test_telemetry_export_results_processor.py (6 hunks)
  • tests/unit/server_metrics/__init__.py (1 hunks)
  • tests/unit/server_metrics/conftest.py (1 hunks)
  • tests/unit/server_metrics/test_malformed_metrics_validation.py (1 hunks)
  • tests/unit/server_metrics/test_server_metrics_data_collector.py (1 hunks)
  • tests/unit/server_metrics/test_server_metrics_manager.py (1 hunks)
  • tests/unit/server_metrics/test_server_metrics_models.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-15T03:24:10.758Z
Learnt from: ajcasagrande
Repo: ai-dynamo/aiperf PR: 359
File: aiperf/metrics/types/time_to_first_output_metric.py:0-0
Timestamp: 2025-10-15T03:24:10.758Z
Learning: In TimeToFirstOutputMetric and similar metrics, invalid timestamp scenarios (where response timestamps precede request start) are automatically caught by the base class validation through the record.valid property, which checks that start_perf_ns < end_perf_ns. This validation happens in _require_valid_record before _parse_record is called, so explicit timestamp validation in _parse_record may be redundant.

Applied to files:

  • tests/unit/server_metrics/test_malformed_metrics_validation.py
📚 Learning: 2025-10-17T16:52:54.523Z
Learnt from: ilana-n
Repo: ai-dynamo/aiperf PR: 372
File: aiperf/gpu_telemetry/telemetry_manager.py:204-206
Timestamp: 2025-10-17T16:52:54.523Z
Learning: In `aiperf/gpu_telemetry/telemetry_data_collector.py`, the `TelemetryDataCollector.initialize()` and `start()` methods perform only synchronous, non-blocking operations. `initialize()` creates an `aiohttp.ClientSession` object (instant construction) and `start()` schedules background tasks without awaiting them (instant scheduling via asyncio). There is no risk of indefinite hangs, so timeouts are not needed when calling these methods.

Applied to files:

  • tests/unit/gpu_telemetry/test_telemetry_data_collector.py
  • src/aiperf/gpu_telemetry/telemetry_data_collector.py
  • src/aiperf/common/mixins/base_metrics_collector_mixin.py
🧬 Code graph analysis (33)
tests/unit/common/test_duplicate_tracker.py (1)
src/aiperf/common/duplicate_tracker.py (3)
  • AsyncKeyedDuplicateTracker (15-136)
  • deduplicate_record (67-111)
  • flush_remaining_duplicates (113-136)
src/aiperf/common/messages/server_metrics_messages.py (4)
src/aiperf/common/enums/message_enums.py (1)
  • MessageType (7-53)
src/aiperf/common/messages/service_messages.py (1)
  • BaseServiceMessage (13-20)
src/aiperf/common/models/error_models.py (1)
  • ErrorDetails (12-76)
src/aiperf/common/models/server_metrics_models.py (1)
  • ServerMetricsRecord (214-291)
tests/unit/post_processors/test_telemetry_export_results_processor.py (1)
src/aiperf/common/models/telemetry_models.py (1)
  • TelemetryMetrics (13-45)
tests/unit/common/test_metric_utils.py (1)
src/aiperf/common/metric_utils.py (2)
  • build_hostname_aware_prometheus_endpoints (41-78)
  • normalize_metrics_endpoint_url (6-38)
src/aiperf/post_processors/server_metrics_export_results_processor.py (7)
src/aiperf/common/config/user_config.py (1)
  • UserConfig (34-448)
src/aiperf/common/enums/post_processor_enums.py (1)
  • ResultsProcessorType (22-50)
src/aiperf/common/factories.py (1)
  • ResultsProcessorFactory (572-592)
src/aiperf/common/mixins/buffered_jsonl_writer_mixin.py (3)
  • BufferedJSONLWriterMixinWithDeduplication (193-228)
  • buffered_write (75-110)
  • buffered_write (216-220)
src/aiperf/common/models/server_metrics_models.py (7)
  • ServerMetricsMetadata (182-198)
  • ServerMetricsMetadataFile (201-211)
  • ServerMetricsRecord (214-291)
  • ServerMetricsSlimRecord (160-179)
  • extract_metadata (259-291)
  • to_slim (92-122)
  • to_slim (234-257)
src/aiperf/common/protocols.py (9)
  • ServerMetricsResultsProcessorProtocol (547-574)
  • info (70-70)
  • process_server_metrics_record (558-566)
  • metadata (394-394)
  • metadata (620-620)
  • error (74-74)
  • summarize (543-543)
  • summarize (568-574)
  • summarize (594-601)
src/aiperf/post_processors/telemetry_export_results_processor.py (1)
  • summarize (68-70)
tests/unit/server_metrics/test_server_metrics_data_collector.py (6)
src/aiperf/common/enums/prometheus_enums.py (1)
  • PrometheusMetricType (10-47)
src/aiperf/common/models/error_models.py (1)
  • ErrorDetails (12-76)
src/aiperf/common/models/server_metrics_models.py (1)
  • ServerMetricsRecord (214-291)
src/aiperf/server_metrics/server_metrics_data_collector.py (2)
  • ServerMetricsDataCollector (26-270)
  • _parse_metrics_to_records (86-156)
src/aiperf/common/mixins/base_metrics_collector_mixin.py (5)
  • endpoint_url (75-77)
  • collection_interval (80-82)
  • is_url_reachable (108-127)
  • _send_records_via_callback (225-235)
  • _collect_metrics_task (154-177)
tests/unit/server_metrics/conftest.py (1)
  • sample_prometheus_metrics (17-35)
src/aiperf/common/mixins/buffered_jsonl_writer_mixin.py (3)
src/aiperf/common/duplicate_tracker.py (3)
  • AsyncKeyedDuplicateTracker (15-136)
  • deduplicate_record (67-111)
  • flush_remaining_duplicates (113-136)
src/aiperf/common/protocols.py (2)
  • exception (75-75)
  • debug (69-69)
src/aiperf/common/hooks.py (1)
  • on_stop (271-288)
src/aiperf/post_processors/telemetry_export_results_processor.py (2)
src/aiperf/common/mixins/buffered_jsonl_writer_mixin.py (1)
  • BufferedJSONLWriterMixinWithDeduplication (193-228)
src/aiperf/common/models/telemetry_models.py (1)
  • TelemetryRecord (48-86)
tests/unit/server_metrics/test_malformed_metrics_validation.py (1)
src/aiperf/server_metrics/server_metrics_data_collector.py (2)
  • ServerMetricsDataCollector (26-270)
  • _parse_metrics_to_records (86-156)
tests/unit/server_metrics/test_server_metrics_models.py (3)
src/aiperf/common/enums/prometheus_enums.py (1)
  • PrometheusMetricType (10-47)
src/aiperf/common/models/server_metrics_models.py (9)
  • HistogramData (10-19)
  • MetricFamily (125-132)
  • MetricSample (75-122)
  • MetricSchema (135-143)
  • ServerMetricsRecord (214-291)
  • SlimMetricSample (34-72)
  • SummaryData (22-31)
  • to_slim (92-122)
  • to_slim (234-257)
tests/unit/server_metrics/conftest.py (3)
  • sample_counter_metric (39-50)
  • sample_histogram_metric (69-84)
  • sample_server_metrics_record (107-122)
src/aiperf/server_metrics/server_metrics_manager.py (8)
src/aiperf/common/base_component_service.py (1)
  • BaseComponentService (32-133)
src/aiperf/common/config/user_config.py (2)
  • UserConfig (34-448)
  • server_metrics_urls (333-335)
src/aiperf/common/messages/server_metrics_messages.py (2)
  • ServerMetricsRecordsMessage (11-40)
  • ServerMetricsStatusMessage (43-62)
src/aiperf/common/metric_utils.py (2)
  • build_hostname_aware_prometheus_endpoints (41-78)
  • normalize_metrics_endpoint_url (6-38)
src/aiperf/common/models/error_models.py (2)
  • ErrorDetails (12-76)
  • from_exception (66-76)
src/aiperf/common/protocols.py (10)
  • PushClientProtocol (156-157)
  • ServiceProtocol (512-524)
  • debug (69-69)
  • error (74-74)
  • stop (122-122)
  • initialize (119-119)
  • start (120-120)
  • warning (72-72)
  • push (157-157)
  • publish (143-143)
src/aiperf/server_metrics/server_metrics_data_collector.py (1)
  • ServerMetricsDataCollector (26-270)
src/aiperf/common/mixins/base_metrics_collector_mixin.py (3)
  • endpoint_url (75-77)
  • collection_interval (80-82)
  • is_url_reachable (108-127)
src/aiperf/common/messages/__init__.py (1)
src/aiperf/common/messages/server_metrics_messages.py (2)
  • ServerMetricsRecordsMessage (11-40)
  • ServerMetricsStatusMessage (43-62)
src/aiperf/controller/system_controller.py (5)
src/aiperf/common/messages/server_metrics_messages.py (1)
  • ServerMetricsStatusMessage (43-62)
src/aiperf/common/protocols.py (1)
  • run_service (484-486)
src/aiperf/common/enums/service_enums.py (1)
  • ServiceType (32-48)
src/aiperf/common/hooks.py (1)
  • on_message (313-334)
src/aiperf/common/enums/message_enums.py (1)
  • MessageType (7-53)
tests/unit/gpu_telemetry/test_telemetry_data_collector.py (1)
src/aiperf/common/mixins/base_metrics_collector_mixin.py (4)
  • endpoint_url (75-77)
  • collection_interval (80-82)
  • _fetch_metrics_text (190-223)
  • _collect_metrics_task (154-177)
tests/unit/server_metrics/test_server_metrics_manager.py (7)
src/aiperf/common/config/user_config.py (1)
  • UserConfig (34-448)
src/aiperf/common/messages/server_metrics_messages.py (1)
  • ServerMetricsRecordsMessage (11-40)
src/aiperf/common/models/error_models.py (2)
  • ErrorDetails (12-76)
  • from_exception (66-76)
src/aiperf/common/models/server_metrics_models.py (1)
  • ServerMetricsRecord (214-291)
src/aiperf/server_metrics/server_metrics_manager.py (5)
  • _profile_configure_command (103-168)
  • _on_start_profiling (171-203)
  • _on_server_metrics_records (255-294)
  • _on_server_metrics_error (296-319)
  • _stop_all_collectors (236-253)
src/aiperf/common/mixins/base_metrics_collector_mixin.py (2)
  • is_url_reachable (108-127)
  • endpoint_url (75-77)
src/aiperf/common/protocols.py (4)
  • initialize (119-119)
  • start (120-120)
  • push (157-157)
  • stop (122-122)
tests/unit/server_metrics/conftest.py (2)
src/aiperf/common/enums/prometheus_enums.py (1)
  • PrometheusMetricType (10-47)
src/aiperf/common/models/server_metrics_models.py (5)
  • HistogramData (10-19)
  • MetricFamily (125-132)
  • MetricSample (75-122)
  • ServerMetricsRecord (214-291)
  • SummaryData (22-31)
src/aiperf/post_processors/__init__.py (1)
src/aiperf/post_processors/server_metrics_export_results_processor.py (1)
  • ServerMetricsExportResultsProcessor (27-166)
src/aiperf/common/models/__init__.py (1)
src/aiperf/common/models/server_metrics_models.py (11)
  • HistogramData (10-19)
  • InfoMetricData (146-157)
  • MetricFamily (125-132)
  • MetricSample (75-122)
  • MetricSchema (135-143)
  • ServerMetricsMetadata (182-198)
  • ServerMetricsMetadataFile (201-211)
  • ServerMetricsRecord (214-291)
  • ServerMetricsSlimRecord (160-179)
  • SlimMetricSample (34-72)
  • SummaryData (22-31)
src/aiperf/common/config/user_config.py (4)
src/aiperf/common/config/config_validators.py (1)
  • parse_str_or_list (18-44)
src/aiperf/common/config/cli_parameter.py (1)
  • CLIParameter (7-16)
src/aiperf/common/config/groups.py (1)
  • Groups (6-31)
src/aiperf/common/metric_utils.py (1)
  • normalize_metrics_endpoint_url (6-38)
src/aiperf/server_metrics/__init__.py (2)
src/aiperf/server_metrics/server_metrics_data_collector.py (1)
  • ServerMetricsDataCollector (26-270)
src/aiperf/server_metrics/server_metrics_manager.py (1)
  • ServerMetricsManager (45-352)
src/aiperf/common/enums/__init__.py (1)
src/aiperf/common/enums/prometheus_enums.py (1)
  • PrometheusMetricType (10-47)
src/aiperf/server_metrics/server_metrics_data_collector.py (6)
tests/unit/utils/time_traveler.py (3)
  • time (48-49)
  • perf_counter_ns (54-55)
  • time_ns (45-46)
src/aiperf/common/enums/prometheus_enums.py (1)
  • PrometheusMetricType (10-47)
src/aiperf/common/mixins/base_metrics_collector_mixin.py (6)
  • BaseMetricsCollectorMixin (31-235)
  • endpoint_url (75-77)
  • collection_interval (80-82)
  • _collect_and_process_metrics (180-188)
  • _fetch_metrics_text (190-223)
  • _send_records_via_callback (225-235)
src/aiperf/common/models/error_models.py (1)
  • ErrorDetails (12-76)
src/aiperf/common/models/server_metrics_models.py (5)
  • HistogramData (10-19)
  • MetricFamily (125-132)
  • MetricSample (75-122)
  • ServerMetricsRecord (214-291)
  • SummaryData (22-31)
src/aiperf/common/protocols.py (2)
  • warning (72-72)
  • debug (69-69)
tests/unit/post_processors/test_server_metrics_export_results_processor.py (1)
src/aiperf/common/config/output_config.py (3)
  • OutputConfig (17-174)
  • server_metrics_export_jsonl_file (169-170)
  • server_metrics_metadata_json_file (173-174)
src/aiperf/common/models/server_metrics_models.py (2)
src/aiperf/common/enums/prometheus_enums.py (1)
  • PrometheusMetricType (10-47)
src/aiperf/common/models/base_models.py (1)
  • AIPerfBaseModel (8-19)
src/aiperf/common/environment.py (1)
src/aiperf/common/config/config_validators.py (1)
  • parse_str_or_csv_list (47-72)
src/aiperf/gpu_telemetry/telemetry_data_collector.py (3)
src/aiperf/common/mixins/base_metrics_collector_mixin.py (6)
  • BaseMetricsCollectorMixin (31-235)
  • collection_interval (80-82)
  • endpoint_url (75-77)
  • _collect_and_process_metrics (180-188)
  • _fetch_metrics_text (190-223)
  • _send_records_via_callback (225-235)
src/aiperf/common/models/telemetry_models.py (2)
  • TelemetryMetrics (13-45)
  • TelemetryRecord (48-86)
src/aiperf/server_metrics/server_metrics_data_collector.py (2)
  • _collect_and_process_metrics (67-84)
  • _parse_metrics_to_records (86-156)
src/aiperf/common/mixins/__init__.py (2)
src/aiperf/common/mixins/base_metrics_collector_mixin.py (1)
  • BaseMetricsCollectorMixin (31-235)
src/aiperf/common/mixins/buffered_jsonl_writer_mixin.py (2)
  • BufferedJSONLWriterMixin (21-190)
  • BufferedJSONLWriterMixinWithDeduplication (193-228)
src/aiperf/common/protocols.py (2)
src/aiperf/common/models/server_metrics_models.py (1)
  • ServerMetricsRecord (214-291)
src/aiperf/post_processors/server_metrics_export_results_processor.py (2)
  • process_server_metrics_record (71-107)
  • summarize (160-166)
src/aiperf/common/mixins/base_metrics_collector_mixin.py (7)
src/aiperf/common/hooks.py (3)
  • background_task (165-203)
  • on_init (231-248)
  • on_stop (271-288)
src/aiperf/common/mixins/aiperf_lifecycle_mixin.py (1)
  • AIPerfLifecycleMixin (37-344)
src/aiperf/common/models/error_models.py (2)
  • ErrorDetails (12-76)
  • from_exception (66-76)
tests/unit/gpu_telemetry/test_telemetry_integration.py (2)
  • record_callback (96-98)
  • error_callback (100-102)
src/aiperf/gpu_telemetry/telemetry_data_collector.py (1)
  • _collect_and_process_metrics (62-77)
src/aiperf/server_metrics/server_metrics_data_collector.py (1)
  • _collect_and_process_metrics (67-84)
src/aiperf/common/protocols.py (1)
  • error (74-74)
src/aiperf/common/duplicate_tracker.py (1)
src/aiperf/common/protocols.py (1)
  • trace (68-68)
src/aiperf/records/records_manager.py (8)
src/aiperf/common/messages/telemetry_messages.py (2)
  • TelemetryRecordsMessage (17-42)
  • valid (39-42)
src/aiperf/common/messages/server_metrics_messages.py (2)
  • ServerMetricsRecordsMessage (11-40)
  • valid (33-35)
src/aiperf/common/models/error_models.py (3)
  • ErrorDetails (12-76)
  • ErrorDetailsCount (103-110)
  • from_exception (66-76)
src/aiperf/common/models/server_metrics_models.py (1)
  • ServerMetricsRecord (214-291)
src/aiperf/common/protocols.py (5)
  • ResultsProcessorProtocol (537-543)
  • ServerMetricsResultsProcessorProtocol (547-574)
  • error (74-74)
  • process_server_metrics_record (558-566)
  • exception (75-75)
src/aiperf/common/hooks.py (1)
  • on_pull_message (367-387)
src/aiperf/common/enums/message_enums.py (1)
  • MessageType (7-53)
src/aiperf/post_processors/server_metrics_export_results_processor.py (1)
  • process_server_metrics_record (71-107)
src/aiperf/records/__init__.py (1)
src/aiperf/records/records_manager.py (3)
  • ErrorTrackingState (70-80)
  • RecordsManager (107-800)
  • TelemetryTrackingState (84-97)
src/aiperf/common/config/output_config.py (1)
src/aiperf/common/config/config_defaults.py (1)
  • OutputDefaults (135-155)
🪛 LanguageTool
docs/tutorials/server-metrics.md

[style] ~151-~151: ‘new record’ might be wordy. Consider a shorter alternative.
Context: ...ate written (marks end of period), then new record written (marks start of new period) **...

(EN_WORDINESS_PREMIUM_NEW_RECORD)


[style] ~273-~273: This phrase is redundant. Consider writing “point” or “time”.
Context: ...ng ALL metrics from one endpoint at one point in time. Example from Dynamo frontend: ``...

(MOMENT_IN_TIME)

🪛 markdownlint-cli2 (0.18.1)
docs/cli_options.md

415-415: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)


417-417: Bare URL used

(MD034, no-bare-urls)


417-417: Bare URL used

(MD034, no-bare-urls)

docs/tutorials/server-metrics.md

219-219: Blank line inside blockquote

(MD028, no-blanks-blockquote)

🪛 Ruff (0.14.5)
src/aiperf/common/metric_utils.py

33-33: Avoid specifying long messages outside the exception class

(TRY003)

tests/unit/common/test_duplicate_tracker.py

122-122: This suppression comment is invalid because it cannot be on its own line

Remove this comment

(RUF028)

tests/unit/common/test_metric_utils.py

16-16: This suppression comment is invalid because it cannot be in an expression, pattern, argument list, or other non-statement

Remove this comment

(RUF028)

src/aiperf/server_metrics/server_metrics_manager.py

104-104: Unused method argument: message

(ARG002)


148-148: Do not catch blind exception: Exception

(BLE001)


171-171: Unused method argument: message

(ARG002)


191-191: Do not catch blind exception: Exception

(BLE001)


207-207: Unused method argument: message

(ARG002)


252-252: Do not catch blind exception: Exception

(BLE001)


280-280: Do not catch blind exception: Exception

(BLE001)


291-291: Do not catch blind exception: Exception

(BLE001)


318-318: Do not catch blind exception: Exception

(BLE001)


351-351: Do not catch blind exception: Exception

(BLE001)

src/aiperf/common/config/user_config.py

308-308: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

tests/unit/post_processors/test_server_metrics_export_results_processor.py

384-384: Unused method argument: monkeypatch

(ARG002)

src/aiperf/common/mixins/base_metrics_collector_mixin.py

167-167: Do not catch blind exception: Exception

(BLE001)


174-174: Do not catch blind exception: Exception

(BLE001)


210-210: Avoid specifying long messages outside the exception class

(TRY003)


234-234: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: build (ubuntu-latest, 3.13)
  • GitHub Check: build (macos-latest, 3.12)
  • GitHub Check: build (ubuntu-latest, 3.12)
  • GitHub Check: build (ubuntu-latest, 3.10)
  • GitHub Check: build (macos-latest, 3.13)
  • GitHub Check: build (macos-latest, 3.10)
  • GitHub Check: integration-tests (macos-latest, 3.12)
  • GitHub Check: build (macos-latest, 3.11)
  • GitHub Check: integration-tests (macos-latest, 3.13)
  • GitHub Check: integration-tests (macos-latest, 3.11)
  • GitHub Check: integration-tests (ubuntu-latest, 3.10)
  • GitHub Check: integration-tests (ubuntu-latest, 3.11)
  • GitHub Check: integration-tests (macos-latest, 3.10)
  • GitHub Check: build (ubuntu-latest, 3.11)
  • GitHub Check: integration-tests (ubuntu-latest, 3.12)
  • GitHub Check: integration-tests (ubuntu-latest, 3.13)

Copy link
Contributor

@ilana-n ilana-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!!! I tried it out on my end and works great :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants