[CONTP-1437] Migrate CSI driver check to be a core check#48596
[CONTP-1437] Migrate CSI driver check to be a core check#48596
Conversation
Go Package Import DifferencesBaseline: 3f2c64a
|
07823bf to
4e4a241
Compare
Files inventory check summaryFile checks results against ancestor 3f2c64a2: Results for datadog-agent_7.79.0~devel.git.272.376ea45.pipeline.105250456-1_amd64.deb:Detected file changes:
|
Static quality checks❌ Please find below the results from static quality gates Error
Gate failure full details
Static quality gates prevent the PR to merge! Successful checksInfo
10 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 3f2c64a Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -1.17 | [-4.10, +1.75] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +2.35 | [+2.20, +2.50] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.40 | [+0.16, +0.65] | 1 | Logs bounds checks dashboard |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.29 | [+0.23, +0.36] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.25 | [+0.11, +0.40] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.18 | [+0.15, +0.22] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_logs | memory utilization | +0.12 | [+0.05, +0.19] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.01 | [-0.07, +0.09] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.01 | [-0.10, +0.11] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.20, +0.21] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.00 | [-0.19, +0.19] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.00 | [-0.39, +0.39] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.02 | [-0.18, +0.14] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.02 | [-0.46, +0.41] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.08 | [-0.61, +0.45] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.08 | [-0.15, -0.01] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.11 | [-0.28, +0.06] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -0.26 | [-0.43, -0.08] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.26 | [-0.31, -0.21] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.30 | [-0.53, -0.08] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -0.33 | [-1.88, +1.23] | 1 | Logs bounds checks dashboard |
| ➖ | file_tree | memory utilization | -0.54 | [-0.60, -0.49] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -1.03 | [-1.14, -0.93] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -1.17 | [-4.10, +1.75] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 691 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 272.23MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 697 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.23GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.22GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 174.64MiB ≤ 175MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 490.68MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 202.98MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 355.69 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 410.08MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
4e4a241 to
f7574d5
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f091ab8e6a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
What does this PR do?
Migrates the
datadog_csi_drivercheck from a Python OpenMetrics integration (inintegrations-core) to a native Go core check, and adds a COAT (Cross-Org Agent Telemetry) profile so that CSI driver metrics are collected as internal agent telemetry.Changes:
pkg/collector/corechecks/containers/csi_driver/): Scrapes the CSI driver's Prometheus endpoint (/metrics) and submitsdatadog.csi_driver.node_publish_volume_attempts.countanddatadog.csi_driver.node_unpublish_volume_attempts.countasMonotonicCountmetrics. Handles both with and without the_totalsuffix that Prometheus client libraries may append to counter names.comp/core/telemetrycounters, enabling Datadog to observe CSI driver adoption and health across all agent deployments without requiring customers to configure anything.comp/core/agenttelemetry/impl/config.go): Acsi-driverprofile is added todefaultProfiles, collectingnode_publish_volume_attempts(aggregated bystatus,type) andnode_unpublish_volume_attempts(aggregated bystatus). Zero-valued metrics are excluded. Collection starts 60s after agent boot and repeats every 15 minutes.pkg/commonchecks/corechecks.goand the config directory is added toAGENT_CORECHECKSintasks/agent.pysoconf.yaml.defaultships with the agent.cmd/agent/dist/conf.d/datadog_csi_driver.d/conf.yaml.default): Discovers the CSI driver container via thecsi-driverAD identifier and targetshttp://%%host%%:5000/metrics.Motivation
The Python
datadog_csi_driverintegration inintegrations-corecollects metrics over OpenMetrics but cannot participate in COAT because COAT only collects from the agent's internal telemetry registry (comp/core/telemetry). Python checks run in the rtloader and have no access to register Go-side telemetry counters.Migrating to a Go core check enables:
Backwards compatibility
This migration is backwards compatible and can ship in the same release that drops the Python integration:
datadog_csi_driver): The Go core check loader takes priority over the Python wheel loader. When both are present, the Go check wins automatically.datadog.csi_driver.node_publish_volume_attempts.countanddatadog.csi_driver.node_unpublish_volume_attempts.countare identical to the Python integration's output.csi-driverAD identifier is unchanged, so existing Kubernetes annotations and Helm chart configurations continue to work without modification.datadog.csi_driver.openmetrics.healthreports OK/Critical.openmetrics_endpointinstance config key is preserved.Describe how you validated your changes
Local Kind cluster testing:
Sample
DSDSocketDirectorytype) to triggerNodePublishVolume/NodeUnpublishVolumecalls.Sample:
agent configcheck, verifying configuration source and check loader.datadog.csi_driver.node_publish_volume_attempts.countwith the expected tags (status,type,path).COAT validation:
csi-driverprofile payload contains:Additional Notes
_totalcounter suffix normalization defensively. Current CSI driver versions expose counters without_total, but the OpenMetrics spec mandates it and future Prometheus client library upgrades may add it.zero_metric: true), so thecsi-driverprofile only produces payloads after at least one volume operation has occurred.check_test.go) cover configuration parsing, successful metric submission,_totalsuffix handling, and error scenarios (endpoint down, empty response).integrations-coreshould deprecate or remove the Pythondatadog_csi_driverintegration.