-
Notifications
You must be signed in to change notification settings - Fork 11
feat: Collect server-side prometheus metrics #488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e89cb2c8d04c03a2de44f67ac1ffd369b6058723Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e89cb2c8d04c03a2de44f67ac1ffd369b6058723Last updated for commit: |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
WalkthroughThis PR introduces comprehensive server metrics collection to AIPerf. It adds Prometheus-compatible endpoint discovery, automatic metrics scraping with configurable intervals, async HTTP collection infrastructure, deduplication and buffering mechanisms, new message types for inter-service communication, a dedicated ServerMetricsManager service, export processors, and extensive documentation and tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/aiperf/gpu_telemetry/telemetry_data_collector.py (1)
23-41: Docstring mostly updated, but “logged and re-raised” no longer matches actual behaviorThe class docstring correctly mentions
BaseMetricsCollectorMixin, but in the_collect_and_process_metricsdocstring (Lines 72-73 in this file) you still say “Any exception from fetch or parse is logged and re-raised.” With the mixin, exceptions are now caught in_collect_metrics_taskand either:
- delivered to
error_callback, or- logged via
self.error, but not re-raised (exceptCancelledError).Consider updating that “Raises:” section to describe the new behavior (e.g., “Exceptions propagate to the mixin’s background task, which logs or forwards them via error_callback; callers of this method should not rely on it raising.”).
🧹 Nitpick comments (20)
docs/cli_options.md (1)
417-417: Format URLs as proper markdown links.The bare URLs should be formatted as markdown links for better readability and to follow markdown best practices.
Consider applying this diff:
-Server metrics collection (ENABLED BY DEFAULT). Automatically collects from inference endpoint base_url + `/metrics`. Optionally specify additional custom Prometheus-compatible endpoint URLs (e.g., http://node1:8081/metrics, http://node2:9090/metrics). Use AIPERF_SERVER_METRICS_ENABLED=false to disable. Example: `--server-metrics node1:8081 node2:9090/metrics` for additional endpoints. +Server metrics collection (ENABLED BY DEFAULT). Automatically collects from inference endpoint base_url + `/metrics`. Optionally specify additional custom Prometheus-compatible endpoint URLs (e.g., `http://node1:8081/metrics`, `http://node2:9090/metrics`). Use `AIPERF_SERVER_METRICS_ENABLED=false` to disable. Example: `--server-metrics node1:8081 node2:9090/metrics` for additional endpoints.tests/unit/common/test_duplicate_tracker.py (1)
260-279: Maketest_concurrent_writes_to_same_keyactually concurrentDespite the name/docstring, this test currently performs all writes sequentially via
write_sequence, so it doesn’t exercise per-key locking under real concurrency.Consider rewriting to launch multiple writers with
asyncio.gather:@pytest.mark.asyncio @pytest.mark.parametrize("num_tasks,writes_per_task", [(3, 10), (5, 5), (10, 3)]) # fmt: skip async def test_concurrent_writes_to_same_key( self, tracker: AsyncKeyedDuplicateTracker[SampleRecord], num_tasks: int, writes_per_task: int, ): """Test that concurrent writes to the same key are handled safely.""" - results = await write_sequence( - tracker, "key1", [1] * (num_tasks * writes_per_task) - ) - - # At least first record should be written - total_written = sum(len(r) for r in results) + async def write_to_key(count: int) -> int: + total = 0 + for _ in range(count): + record = SampleRecord(key="key1", value=1) + result = await tracker.deduplicate_record(record) + total += len(result) + return total + + # Run multiple writers concurrently to the same key + results = await asyncio.gather( + *[write_to_key(writes_per_task) for _ in range(num_tasks)] + ) + + # At least first record should be written + total_written = sum(results)The remaining assertions (
total_written >= 1and< num_tasks * writes_per_task) can stay as-is.tests/unit/server_metrics/test_server_metrics_models.py (1)
17-191: Solid coverage for server metrics model serializationThe tests here exercise the critical paths for
MetricSample.to_slim,ServerMetricsRecord.to_slim,MetricSchema, and the histogram/summary data shapes, including endpoint URL preservation and optionalsum/counthandling. This gives good confidence in the JSONL-facing representations; any further assertions (e.g., extrais Nonechecks for unused fields) would be purely optional.tests/unit/server_metrics/test_server_metrics_data_collector.py (2)
20-43: Prefer public properties over private attributes in tests.The tests directly access private attributes (
_endpoint_url,_collection_interval,_reachability_timeout,_session). Based on the mixin interface (src/aiperf/common/mixins/base_metrics_collector_mixin.py lines 74-81), public properties are available. Using public properties makes tests more resilient to internal implementation changes.Apply this diff to use public properties:
- assert collector._endpoint_url == "http://localhost:8081/metrics" - assert collector._collection_interval == 0.5 - assert collector._reachability_timeout == 10.0 + assert collector.endpoint_url == "http://localhost:8081/metrics" + assert collector.collection_interval == 0.5 + # Note: reachability_timeout doesn't have a public property in the mixin + assert collector._reachability_timeout == 10.0 assert collector.id == "test_collector" assert collector._session is None assert not collector.was_initialized- assert collector._endpoint_url == "http://localhost:8081/metrics" - assert collector._collection_interval == 0.33 + assert collector.endpoint_url == "http://localhost:8081/metrics" + assert collector.collection_interval == 0.33 assert collector.id == "server_metrics_collector"
245-264: Consider testing deduplication for other metric types.The deduplication test only covers counters with last-wins behavior. While this validates the core logic, consider adding tests for gauge and histogram/summary deduplication to ensure consistent behavior across all metric types, especially since histograms and summaries have more complex structures (buckets, quantiles).
src/aiperf/controller/system_controller.py (1)
42-48: Server-metrics status integration is consistent with telemetry flowStarting
ServerMetricsManageras an optional service and handlingSERVER_METRICS_STATUSwith mirrored endpoint tracking/logging keeps behavior parallel to telemetry and looks correct. If you plan to surface server-metrics endpoint reachability in exports/summary later, you already have the necessary state here to wire in.Also applies to: 132-138, 186-191, 387-405
docs/tutorials/server-metrics.md (1)
146-155: Minor doc polish and markdownlint fixContent and examples line up well with the implementation. A couple of small nits you might want to address:
- Around Line 153: “then new record written (marks start of new period)” – you could shorten to “then the new record (marks start of new period)” to satisfy the style checker.
- Around Line 273: “at one point in time” can be simplified to “at one time” or just “at one point”.
- Lines 220–226: there’s a blank
>line inside the blockquote, which triggersmarkdownlintMD028; removing the empty quoted line will keep the callout but satisfy the linter.Also applies to: 219-226, 271-275
src/aiperf/post_processors/server_metrics_export_results_processor.py (1)
44-67: Consider avoiding unnecessary metadata extraction and clarifying update semanticsThe per-endpoint metadata tracking and atomic file writes look solid, and the dedupe key/value choice (endpoint URL + metrics dict) matches the documented behavior.
Two minor suggestions:
record.extract_metadata()is computed for every record, even when_should_update_metadatareturnsFalse. Since_should_update_metadataonly inspectsrecord.metrics, you can deferextract_metadata()untilshould_writeisTrueto avoid extra Pydantic object construction:- url = record.endpoint_url - metadata = record.extract_metadata() + url = record.endpoint_url @@ - if should_write: + if should_write: + metadata = record.extract_metadata() async with self._metadata_file_lock:
- The comment above
_metadata_file_modelnotes that “metadata fields can occasionally change,” but_should_update_metadataonly reacts to new metric names, not changes to existing schemas or info metrics. That’s probably fine in practice; consider tightening the comment or, if you do care about changing descriptions/types, extending_should_update_metadatato compare existing vs newMetricSchema/InfoMetricDataentries.Also applies to: 71-104, 109-134
src/aiperf/server_metrics/server_metrics_manager.py (2)
102-161: Configure phase and reachability handling are robust, with minor lint nitsThe PROFILE_CONFIGURE flow (building collectors, checking reachability, emitting a status, and bailing early when none are reachable) is clean and matches the documented behavior. Two small nits:
message(Line 104) is unused but required by@on_command. If Ruff’s ARG002 is noisy, consider renaming to_messageor adding a# noqa: ARG002on the line.- The broad
except Exception as earound reachability (Lines 148-150) is defensively reasonable here, but if you want to satisfy BLE001 you could either narrow toaiohttp.ClientError/OSErroror document the intentional broad catch with a# noqa: BLE001comment.Functionally this segment is fine as-is.
170-205: Record + status publishing behavior is correct; broad exception catches are intentional but noisyThe callbacks and status sender behave as expected:
_on_server_metrics_recordsonly emits messages whenrecordsis non-empty and on failure sends a second message carryingErrorDetails, which aligns withServerMetricsRecordsMessage.valid/has_data._on_server_metrics_errorcorrectly wraps collector errors into a dedicated records message withrecords=[]._send_server_metrics_statuscentralizes status reporting for both disabled and enabled cases.The repeated
except Exception as eblocks (Lines 191-193, 252-253, 280-295, 318-320, 351-352) are reasonable for shutdown/telemetry paths, but they’re exactly what Ruff’s BLE001 is flagging. If you want to quiet that without changing behavior, you can:
- Add
# noqa: BLE001on thoseexceptlines, or- Narrow a couple of them where you know the likely failure class (e.g., messaging client errors).
No functional issues here; this is purely about linter noise vs. defensive coding.
Also applies to: 236-255, 270-295, 296-320, 321-352
src/aiperf/common/models/server_metrics_models.py (1)
75-123: Slim conversion and metadata extraction match the intended server-metrics format
MetricSample.to_slim,ServerMetricsRecord.to_slim, andextract_metadatacorrectly:
- Represent histograms/summaries via nested dicts plus
sum/count.- Exclude
_infometrics from slim records while capturing them ininfo_metricswith full label data.- Build
metric_schemaswithPrometheusMetricTypeand descriptions as used by the export processor tests.The implementation is straightforward and lines up with the JSONL + metadata expectations in the post-processor tests. Only tiny nit: the docstring in
ServerMetricsRecord.to_slimmentions “array-based format,” but the actual structure is “list of slim samples with dict-based histogram/summary fields,” which you may want to clarify later.Also applies to: 160-257, 259-291
tests/unit/post_processors/test_server_metrics_export_results_processor.py (1)
380-405: Remove unusedmonkeypatchfixture parameterIn
test_unique_label_values_respects_cardinality_limitthemonkeypatchargument is never used, which triggers Ruff’s ARG002. Since the test doesn’t rely on it, you can safely drop the parameter.- async def test_unique_label_values_respects_cardinality_limit( - self, - user_config_server_metrics_export: UserConfig, - service_config: ServiceConfig, - monkeypatch, - ): + async def test_unique_label_values_respects_cardinality_limit( + self, + user_config_server_metrics_export: UserConfig, + service_config: ServiceConfig, + ):src/aiperf/server_metrics/server_metrics_data_collector.py (1)
26-66: Collector initialization correctly hooks into the base mixin; consideris Nonechecks for intervalsThe constructor wires
endpoint_url,collection_interval,reachability_timeout, and callbacks intoBaseMetricsCollectorMixinas expected, using environment defaults when explicit values aren’t provided. Functionally this is fine.If you ever need to support a zero-second interval or timeout explicitly, you might prefer
if collection_interval is None/if reachability_timeout is Noneinstead ofor, to avoid treating0.0as “use default.” For current usage (onlyNonevs. positive defaults) this is just an ergonomics tweak.- super().__init__( - endpoint_url=endpoint_url, - collection_interval=collection_interval - or Environment.SERVER_METRICS.COLLECTION_INTERVAL, - reachability_timeout=reachability_timeout - or Environment.SERVER_METRICS.REACHABILITY_TIMEOUT, + super().__init__( + endpoint_url=endpoint_url, + collection_interval=( + collection_interval + if collection_interval is not None + else Environment.SERVER_METRICS.COLLECTION_INTERVAL + ), + reachability_timeout=( + reachability_timeout + if reachability_timeout is not None + else Environment.SERVER_METRICS.REACHABILITY_TIMEOUT + ),src/aiperf/records/records_manager.py (2)
69-103: Shared ErrorTrackingState is a reasonable abstraction; tiny typing nit onlyIntroducing
ErrorTrackingStateand reusing it viaTelemetryTrackingState,ServerMetricsTrackingState, andMetricTrackingStatenicely unifies error counting across subsystems.Minor nit: the field is annotated as
dict[ErrorDetails, int]but initialized withdefaultdict(int). That’s perfectly fine at runtime but might show up in strict type-checking; if it does, you can either:
- Change the annotation to
collections.abc.MutableMapping[ErrorDetails, int], or- Wrap the
defaultdictindict(...)when returning summaries (as you already iterate via.items()).Functionally, this segment is good.
271-290: Server-metrics record handling and processor error tracking look correct; consider exposing a summary laterThe new flow for server metrics:
_on_server_metrics_recordsconsumesServerMetricsRecordsMessage, forwarding eachServerMetricsRecordto_send_server_metrics_to_results_processorswhenmessage.validand counting collection errors otherwise._send_server_metrics_to_results_processorsfans out to all registered server-metrics processors withasyncio.gather(..., return_exceptions=True), logs any processor exceptions, and increments_server_metrics_state.error_countsusingErrorDetails.from_exception.This mirrors the telemetry path and gives good resilience against individual processor failures.
Right now, the accumulated server-metrics error counts are never surfaced (unlike telemetry via
get_telemetry_error_summary). If you want symmetry and better observability for server-metrics failures, a future follow-up could add aget_server_metrics_error_summary()method or fold these into an appropriate results/metadata export. Not a blocker for this PR.Also applies to: 398-420
src/aiperf/common/mixins/base_metrics_collector_mixin.py (5)
21-28: Consider using concrete type aliases instead of TypeVar-bound callbacks
TRecordCallback/TErrorCallbackareTypeVars that aren’t part of the class’Generic[...]parameters, which most type checkers treat as effectively non-generic. A simple alias likeRecordCallback = Callable[[list[TRecord], str], Awaitable[None]]andErrorCallback = Callable[[ErrorDetails, str], Awaitable[None]]would be clearer and avoid confusing template parameters.
84-96: Session timeout choice: connect-only timeout may allow indefinitely hanging scrapesUsing
ClientTimeout(total=None, connect=self._reachability_timeout)intentionally leaves no overall timeout onGETcalls. If a metrics endpoint stalls after connection (e.g., slow or wedged server),_fetch_metrics_textcan hang until cancellation. If that’s not desired, consider also setting asock_readortotaltimeout (possibly configurable) to bound worst-case collection latency.
108-151: Reachability semantics: only HTTP 200 treated as “reachable”
is_url_reachablecurrently returnsTrueonly for status200from eitherHEADorGET. That will treat common cases like3xxredirects,401/403auth failures, or204responses as unreachable, even though the endpoint exists. If the goal is “endpoint is up” rather than “scrape is currently succeeding”, consider broadening to2xxor even< 500depending on how you intend to gate startup/health decisions.
153-178: Background task error handling: broadExceptioncatch is intentional but could log more detailCatching
Exceptionaround_collect_and_process_metrics()and theerror_callbackensures that one bad scrape or callback won’t kill the background task, which is appropriate here. To improve debuggability, you might:
- Log with
exc_info=Truein theelsebranch (self.error(f"...", exc_info=True)), and- Optionally include
exc_info=Truewhen loggingcallback_erroras well.That keeps robustness while giving full tracebacks when things go wrong.
225-235: Callback failure handling is robust; consider whether to surface structured error infoSilently catching all
Exceptionfromrecord_callbackand just logging is appropriate to keep the collector running. If downstreams ever need to react to callback failures (e.g., to emit anErrorDetailsviaerror_callbackor metrics), this would be the place to add that wiring. As-is, the implementation is safe and adequate.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (44)
README.md(1 hunks)docs/cli_options.md(1 hunks)docs/tutorials/server-metrics.md(1 hunks)src/aiperf/common/config/config_defaults.py(1 hunks)src/aiperf/common/config/groups.py(1 hunks)src/aiperf/common/config/output_config.py(4 hunks)src/aiperf/common/config/user_config.py(2 hunks)src/aiperf/common/duplicate_tracker.py(1 hunks)src/aiperf/common/enums/__init__.py(2 hunks)src/aiperf/common/enums/message_enums.py(1 hunks)src/aiperf/common/enums/post_processor_enums.py(1 hunks)src/aiperf/common/enums/prometheus_enums.py(1 hunks)src/aiperf/common/enums/service_enums.py(1 hunks)src/aiperf/common/environment.py(4 hunks)src/aiperf/common/messages/__init__.py(2 hunks)src/aiperf/common/messages/server_metrics_messages.py(1 hunks)src/aiperf/common/metric_utils.py(1 hunks)src/aiperf/common/mixins/__init__.py(3 hunks)src/aiperf/common/mixins/base_metrics_collector_mixin.py(1 hunks)src/aiperf/common/mixins/buffered_jsonl_writer_mixin.py(4 hunks)src/aiperf/common/models/__init__.py(4 hunks)src/aiperf/common/models/server_metrics_models.py(1 hunks)src/aiperf/common/protocols.py(2 hunks)src/aiperf/controller/system_controller.py(4 hunks)src/aiperf/gpu_telemetry/telemetry_data_collector.py(4 hunks)src/aiperf/post_processors/__init__.py(2 hunks)src/aiperf/post_processors/server_metrics_export_results_processor.py(1 hunks)src/aiperf/post_processors/telemetry_export_results_processor.py(3 hunks)src/aiperf/records/__init__.py(1 hunks)src/aiperf/records/records_manager.py(6 hunks)src/aiperf/server_metrics/__init__.py(1 hunks)src/aiperf/server_metrics/server_metrics_data_collector.py(1 hunks)src/aiperf/server_metrics/server_metrics_manager.py(1 hunks)tests/unit/common/test_duplicate_tracker.py(1 hunks)tests/unit/common/test_metric_utils.py(1 hunks)tests/unit/gpu_telemetry/test_telemetry_data_collector.py(8 hunks)tests/unit/post_processors/test_server_metrics_export_results_processor.py(1 hunks)tests/unit/post_processors/test_telemetry_export_results_processor.py(6 hunks)tests/unit/server_metrics/__init__.py(1 hunks)tests/unit/server_metrics/conftest.py(1 hunks)tests/unit/server_metrics/test_malformed_metrics_validation.py(1 hunks)tests/unit/server_metrics/test_server_metrics_data_collector.py(1 hunks)tests/unit/server_metrics/test_server_metrics_manager.py(1 hunks)tests/unit/server_metrics/test_server_metrics_models.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-15T03:24:10.758Z
Learnt from: ajcasagrande
Repo: ai-dynamo/aiperf PR: 359
File: aiperf/metrics/types/time_to_first_output_metric.py:0-0
Timestamp: 2025-10-15T03:24:10.758Z
Learning: In TimeToFirstOutputMetric and similar metrics, invalid timestamp scenarios (where response timestamps precede request start) are automatically caught by the base class validation through the record.valid property, which checks that start_perf_ns < end_perf_ns. This validation happens in _require_valid_record before _parse_record is called, so explicit timestamp validation in _parse_record may be redundant.
Applied to files:
tests/unit/server_metrics/test_malformed_metrics_validation.py
📚 Learning: 2025-10-17T16:52:54.523Z
Learnt from: ilana-n
Repo: ai-dynamo/aiperf PR: 372
File: aiperf/gpu_telemetry/telemetry_manager.py:204-206
Timestamp: 2025-10-17T16:52:54.523Z
Learning: In `aiperf/gpu_telemetry/telemetry_data_collector.py`, the `TelemetryDataCollector.initialize()` and `start()` methods perform only synchronous, non-blocking operations. `initialize()` creates an `aiohttp.ClientSession` object (instant construction) and `start()` schedules background tasks without awaiting them (instant scheduling via asyncio). There is no risk of indefinite hangs, so timeouts are not needed when calling these methods.
Applied to files:
tests/unit/gpu_telemetry/test_telemetry_data_collector.pysrc/aiperf/gpu_telemetry/telemetry_data_collector.pysrc/aiperf/common/mixins/base_metrics_collector_mixin.py
🧬 Code graph analysis (33)
tests/unit/common/test_duplicate_tracker.py (1)
src/aiperf/common/duplicate_tracker.py (3)
AsyncKeyedDuplicateTracker(15-136)deduplicate_record(67-111)flush_remaining_duplicates(113-136)
src/aiperf/common/messages/server_metrics_messages.py (4)
src/aiperf/common/enums/message_enums.py (1)
MessageType(7-53)src/aiperf/common/messages/service_messages.py (1)
BaseServiceMessage(13-20)src/aiperf/common/models/error_models.py (1)
ErrorDetails(12-76)src/aiperf/common/models/server_metrics_models.py (1)
ServerMetricsRecord(214-291)
tests/unit/post_processors/test_telemetry_export_results_processor.py (1)
src/aiperf/common/models/telemetry_models.py (1)
TelemetryMetrics(13-45)
tests/unit/common/test_metric_utils.py (1)
src/aiperf/common/metric_utils.py (2)
build_hostname_aware_prometheus_endpoints(41-78)normalize_metrics_endpoint_url(6-38)
src/aiperf/post_processors/server_metrics_export_results_processor.py (7)
src/aiperf/common/config/user_config.py (1)
UserConfig(34-448)src/aiperf/common/enums/post_processor_enums.py (1)
ResultsProcessorType(22-50)src/aiperf/common/factories.py (1)
ResultsProcessorFactory(572-592)src/aiperf/common/mixins/buffered_jsonl_writer_mixin.py (3)
BufferedJSONLWriterMixinWithDeduplication(193-228)buffered_write(75-110)buffered_write(216-220)src/aiperf/common/models/server_metrics_models.py (7)
ServerMetricsMetadata(182-198)ServerMetricsMetadataFile(201-211)ServerMetricsRecord(214-291)ServerMetricsSlimRecord(160-179)extract_metadata(259-291)to_slim(92-122)to_slim(234-257)src/aiperf/common/protocols.py (9)
ServerMetricsResultsProcessorProtocol(547-574)info(70-70)process_server_metrics_record(558-566)metadata(394-394)metadata(620-620)error(74-74)summarize(543-543)summarize(568-574)summarize(594-601)src/aiperf/post_processors/telemetry_export_results_processor.py (1)
summarize(68-70)
tests/unit/server_metrics/test_server_metrics_data_collector.py (6)
src/aiperf/common/enums/prometheus_enums.py (1)
PrometheusMetricType(10-47)src/aiperf/common/models/error_models.py (1)
ErrorDetails(12-76)src/aiperf/common/models/server_metrics_models.py (1)
ServerMetricsRecord(214-291)src/aiperf/server_metrics/server_metrics_data_collector.py (2)
ServerMetricsDataCollector(26-270)_parse_metrics_to_records(86-156)src/aiperf/common/mixins/base_metrics_collector_mixin.py (5)
endpoint_url(75-77)collection_interval(80-82)is_url_reachable(108-127)_send_records_via_callback(225-235)_collect_metrics_task(154-177)tests/unit/server_metrics/conftest.py (1)
sample_prometheus_metrics(17-35)
src/aiperf/common/mixins/buffered_jsonl_writer_mixin.py (3)
src/aiperf/common/duplicate_tracker.py (3)
AsyncKeyedDuplicateTracker(15-136)deduplicate_record(67-111)flush_remaining_duplicates(113-136)src/aiperf/common/protocols.py (2)
exception(75-75)debug(69-69)src/aiperf/common/hooks.py (1)
on_stop(271-288)
src/aiperf/post_processors/telemetry_export_results_processor.py (2)
src/aiperf/common/mixins/buffered_jsonl_writer_mixin.py (1)
BufferedJSONLWriterMixinWithDeduplication(193-228)src/aiperf/common/models/telemetry_models.py (1)
TelemetryRecord(48-86)
tests/unit/server_metrics/test_malformed_metrics_validation.py (1)
src/aiperf/server_metrics/server_metrics_data_collector.py (2)
ServerMetricsDataCollector(26-270)_parse_metrics_to_records(86-156)
tests/unit/server_metrics/test_server_metrics_models.py (3)
src/aiperf/common/enums/prometheus_enums.py (1)
PrometheusMetricType(10-47)src/aiperf/common/models/server_metrics_models.py (9)
HistogramData(10-19)MetricFamily(125-132)MetricSample(75-122)MetricSchema(135-143)ServerMetricsRecord(214-291)SlimMetricSample(34-72)SummaryData(22-31)to_slim(92-122)to_slim(234-257)tests/unit/server_metrics/conftest.py (3)
sample_counter_metric(39-50)sample_histogram_metric(69-84)sample_server_metrics_record(107-122)
src/aiperf/server_metrics/server_metrics_manager.py (8)
src/aiperf/common/base_component_service.py (1)
BaseComponentService(32-133)src/aiperf/common/config/user_config.py (2)
UserConfig(34-448)server_metrics_urls(333-335)src/aiperf/common/messages/server_metrics_messages.py (2)
ServerMetricsRecordsMessage(11-40)ServerMetricsStatusMessage(43-62)src/aiperf/common/metric_utils.py (2)
build_hostname_aware_prometheus_endpoints(41-78)normalize_metrics_endpoint_url(6-38)src/aiperf/common/models/error_models.py (2)
ErrorDetails(12-76)from_exception(66-76)src/aiperf/common/protocols.py (10)
PushClientProtocol(156-157)ServiceProtocol(512-524)debug(69-69)error(74-74)stop(122-122)initialize(119-119)start(120-120)warning(72-72)push(157-157)publish(143-143)src/aiperf/server_metrics/server_metrics_data_collector.py (1)
ServerMetricsDataCollector(26-270)src/aiperf/common/mixins/base_metrics_collector_mixin.py (3)
endpoint_url(75-77)collection_interval(80-82)is_url_reachable(108-127)
src/aiperf/common/messages/__init__.py (1)
src/aiperf/common/messages/server_metrics_messages.py (2)
ServerMetricsRecordsMessage(11-40)ServerMetricsStatusMessage(43-62)
src/aiperf/controller/system_controller.py (5)
src/aiperf/common/messages/server_metrics_messages.py (1)
ServerMetricsStatusMessage(43-62)src/aiperf/common/protocols.py (1)
run_service(484-486)src/aiperf/common/enums/service_enums.py (1)
ServiceType(32-48)src/aiperf/common/hooks.py (1)
on_message(313-334)src/aiperf/common/enums/message_enums.py (1)
MessageType(7-53)
tests/unit/gpu_telemetry/test_telemetry_data_collector.py (1)
src/aiperf/common/mixins/base_metrics_collector_mixin.py (4)
endpoint_url(75-77)collection_interval(80-82)_fetch_metrics_text(190-223)_collect_metrics_task(154-177)
tests/unit/server_metrics/test_server_metrics_manager.py (7)
src/aiperf/common/config/user_config.py (1)
UserConfig(34-448)src/aiperf/common/messages/server_metrics_messages.py (1)
ServerMetricsRecordsMessage(11-40)src/aiperf/common/models/error_models.py (2)
ErrorDetails(12-76)from_exception(66-76)src/aiperf/common/models/server_metrics_models.py (1)
ServerMetricsRecord(214-291)src/aiperf/server_metrics/server_metrics_manager.py (5)
_profile_configure_command(103-168)_on_start_profiling(171-203)_on_server_metrics_records(255-294)_on_server_metrics_error(296-319)_stop_all_collectors(236-253)src/aiperf/common/mixins/base_metrics_collector_mixin.py (2)
is_url_reachable(108-127)endpoint_url(75-77)src/aiperf/common/protocols.py (4)
initialize(119-119)start(120-120)push(157-157)stop(122-122)
tests/unit/server_metrics/conftest.py (2)
src/aiperf/common/enums/prometheus_enums.py (1)
PrometheusMetricType(10-47)src/aiperf/common/models/server_metrics_models.py (5)
HistogramData(10-19)MetricFamily(125-132)MetricSample(75-122)ServerMetricsRecord(214-291)SummaryData(22-31)
src/aiperf/post_processors/__init__.py (1)
src/aiperf/post_processors/server_metrics_export_results_processor.py (1)
ServerMetricsExportResultsProcessor(27-166)
src/aiperf/common/models/__init__.py (1)
src/aiperf/common/models/server_metrics_models.py (11)
HistogramData(10-19)InfoMetricData(146-157)MetricFamily(125-132)MetricSample(75-122)MetricSchema(135-143)ServerMetricsMetadata(182-198)ServerMetricsMetadataFile(201-211)ServerMetricsRecord(214-291)ServerMetricsSlimRecord(160-179)SlimMetricSample(34-72)SummaryData(22-31)
src/aiperf/common/config/user_config.py (4)
src/aiperf/common/config/config_validators.py (1)
parse_str_or_list(18-44)src/aiperf/common/config/cli_parameter.py (1)
CLIParameter(7-16)src/aiperf/common/config/groups.py (1)
Groups(6-31)src/aiperf/common/metric_utils.py (1)
normalize_metrics_endpoint_url(6-38)
src/aiperf/server_metrics/__init__.py (2)
src/aiperf/server_metrics/server_metrics_data_collector.py (1)
ServerMetricsDataCollector(26-270)src/aiperf/server_metrics/server_metrics_manager.py (1)
ServerMetricsManager(45-352)
src/aiperf/common/enums/__init__.py (1)
src/aiperf/common/enums/prometheus_enums.py (1)
PrometheusMetricType(10-47)
src/aiperf/server_metrics/server_metrics_data_collector.py (6)
tests/unit/utils/time_traveler.py (3)
time(48-49)perf_counter_ns(54-55)time_ns(45-46)src/aiperf/common/enums/prometheus_enums.py (1)
PrometheusMetricType(10-47)src/aiperf/common/mixins/base_metrics_collector_mixin.py (6)
BaseMetricsCollectorMixin(31-235)endpoint_url(75-77)collection_interval(80-82)_collect_and_process_metrics(180-188)_fetch_metrics_text(190-223)_send_records_via_callback(225-235)src/aiperf/common/models/error_models.py (1)
ErrorDetails(12-76)src/aiperf/common/models/server_metrics_models.py (5)
HistogramData(10-19)MetricFamily(125-132)MetricSample(75-122)ServerMetricsRecord(214-291)SummaryData(22-31)src/aiperf/common/protocols.py (2)
warning(72-72)debug(69-69)
tests/unit/post_processors/test_server_metrics_export_results_processor.py (1)
src/aiperf/common/config/output_config.py (3)
OutputConfig(17-174)server_metrics_export_jsonl_file(169-170)server_metrics_metadata_json_file(173-174)
src/aiperf/common/models/server_metrics_models.py (2)
src/aiperf/common/enums/prometheus_enums.py (1)
PrometheusMetricType(10-47)src/aiperf/common/models/base_models.py (1)
AIPerfBaseModel(8-19)
src/aiperf/common/environment.py (1)
src/aiperf/common/config/config_validators.py (1)
parse_str_or_csv_list(47-72)
src/aiperf/gpu_telemetry/telemetry_data_collector.py (3)
src/aiperf/common/mixins/base_metrics_collector_mixin.py (6)
BaseMetricsCollectorMixin(31-235)collection_interval(80-82)endpoint_url(75-77)_collect_and_process_metrics(180-188)_fetch_metrics_text(190-223)_send_records_via_callback(225-235)src/aiperf/common/models/telemetry_models.py (2)
TelemetryMetrics(13-45)TelemetryRecord(48-86)src/aiperf/server_metrics/server_metrics_data_collector.py (2)
_collect_and_process_metrics(67-84)_parse_metrics_to_records(86-156)
src/aiperf/common/mixins/__init__.py (2)
src/aiperf/common/mixins/base_metrics_collector_mixin.py (1)
BaseMetricsCollectorMixin(31-235)src/aiperf/common/mixins/buffered_jsonl_writer_mixin.py (2)
BufferedJSONLWriterMixin(21-190)BufferedJSONLWriterMixinWithDeduplication(193-228)
src/aiperf/common/protocols.py (2)
src/aiperf/common/models/server_metrics_models.py (1)
ServerMetricsRecord(214-291)src/aiperf/post_processors/server_metrics_export_results_processor.py (2)
process_server_metrics_record(71-107)summarize(160-166)
src/aiperf/common/mixins/base_metrics_collector_mixin.py (7)
src/aiperf/common/hooks.py (3)
background_task(165-203)on_init(231-248)on_stop(271-288)src/aiperf/common/mixins/aiperf_lifecycle_mixin.py (1)
AIPerfLifecycleMixin(37-344)src/aiperf/common/models/error_models.py (2)
ErrorDetails(12-76)from_exception(66-76)tests/unit/gpu_telemetry/test_telemetry_integration.py (2)
record_callback(96-98)error_callback(100-102)src/aiperf/gpu_telemetry/telemetry_data_collector.py (1)
_collect_and_process_metrics(62-77)src/aiperf/server_metrics/server_metrics_data_collector.py (1)
_collect_and_process_metrics(67-84)src/aiperf/common/protocols.py (1)
error(74-74)
src/aiperf/common/duplicate_tracker.py (1)
src/aiperf/common/protocols.py (1)
trace(68-68)
src/aiperf/records/records_manager.py (8)
src/aiperf/common/messages/telemetry_messages.py (2)
TelemetryRecordsMessage(17-42)valid(39-42)src/aiperf/common/messages/server_metrics_messages.py (2)
ServerMetricsRecordsMessage(11-40)valid(33-35)src/aiperf/common/models/error_models.py (3)
ErrorDetails(12-76)ErrorDetailsCount(103-110)from_exception(66-76)src/aiperf/common/models/server_metrics_models.py (1)
ServerMetricsRecord(214-291)src/aiperf/common/protocols.py (5)
ResultsProcessorProtocol(537-543)ServerMetricsResultsProcessorProtocol(547-574)error(74-74)process_server_metrics_record(558-566)exception(75-75)src/aiperf/common/hooks.py (1)
on_pull_message(367-387)src/aiperf/common/enums/message_enums.py (1)
MessageType(7-53)src/aiperf/post_processors/server_metrics_export_results_processor.py (1)
process_server_metrics_record(71-107)
src/aiperf/records/__init__.py (1)
src/aiperf/records/records_manager.py (3)
ErrorTrackingState(70-80)RecordsManager(107-800)TelemetryTrackingState(84-97)
src/aiperf/common/config/output_config.py (1)
src/aiperf/common/config/config_defaults.py (1)
OutputDefaults(135-155)
🪛 LanguageTool
docs/tutorials/server-metrics.md
[style] ~151-~151: ‘new record’ might be wordy. Consider a shorter alternative.
Context: ...ate written (marks end of period), then new record written (marks start of new period) **...
(EN_WORDINESS_PREMIUM_NEW_RECORD)
[style] ~273-~273: This phrase is redundant. Consider writing “point” or “time”.
Context: ...ng ALL metrics from one endpoint at one point in time. Example from Dynamo frontend: ``...
(MOMENT_IN_TIME)
🪛 markdownlint-cli2 (0.18.1)
docs/cli_options.md
415-415: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4
(MD001, heading-increment)
417-417: Bare URL used
(MD034, no-bare-urls)
417-417: Bare URL used
(MD034, no-bare-urls)
docs/tutorials/server-metrics.md
219-219: Blank line inside blockquote
(MD028, no-blanks-blockquote)
🪛 Ruff (0.14.5)
src/aiperf/common/metric_utils.py
33-33: Avoid specifying long messages outside the exception class
(TRY003)
tests/unit/common/test_duplicate_tracker.py
122-122: This suppression comment is invalid because it cannot be on its own line
Remove this comment
(RUF028)
tests/unit/common/test_metric_utils.py
16-16: This suppression comment is invalid because it cannot be in an expression, pattern, argument list, or other non-statement
Remove this comment
(RUF028)
src/aiperf/server_metrics/server_metrics_manager.py
104-104: Unused method argument: message
(ARG002)
148-148: Do not catch blind exception: Exception
(BLE001)
171-171: Unused method argument: message
(ARG002)
191-191: Do not catch blind exception: Exception
(BLE001)
207-207: Unused method argument: message
(ARG002)
252-252: Do not catch blind exception: Exception
(BLE001)
280-280: Do not catch blind exception: Exception
(BLE001)
291-291: Do not catch blind exception: Exception
(BLE001)
318-318: Do not catch blind exception: Exception
(BLE001)
351-351: Do not catch blind exception: Exception
(BLE001)
src/aiperf/common/config/user_config.py
308-308: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
tests/unit/post_processors/test_server_metrics_export_results_processor.py
384-384: Unused method argument: monkeypatch
(ARG002)
src/aiperf/common/mixins/base_metrics_collector_mixin.py
167-167: Do not catch blind exception: Exception
(BLE001)
174-174: Do not catch blind exception: Exception
(BLE001)
210-210: Avoid specifying long messages outside the exception class
(TRY003)
234-234: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: build (ubuntu-latest, 3.13)
- GitHub Check: build (macos-latest, 3.12)
- GitHub Check: build (ubuntu-latest, 3.12)
- GitHub Check: build (ubuntu-latest, 3.10)
- GitHub Check: build (macos-latest, 3.13)
- GitHub Check: build (macos-latest, 3.10)
- GitHub Check: integration-tests (macos-latest, 3.12)
- GitHub Check: build (macos-latest, 3.11)
- GitHub Check: integration-tests (macos-latest, 3.13)
- GitHub Check: integration-tests (macos-latest, 3.11)
- GitHub Check: integration-tests (ubuntu-latest, 3.10)
- GitHub Check: integration-tests (ubuntu-latest, 3.11)
- GitHub Check: integration-tests (macos-latest, 3.10)
- GitHub Check: build (ubuntu-latest, 3.11)
- GitHub Check: integration-tests (ubuntu-latest, 3.12)
- GitHub Check: integration-tests (ubuntu-latest, 3.13)
ilana-n
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!!! I tried it out on my end and works great :D
See tutorial: https://github.com/ai-dynamo/aiperf/blob/24f17513495eacede9975f15f5afc5cf9293f675/docs/tutorials/server-metrics.md
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.