Skip to content

Commit 62fffb5

Browse files
authored
chore: metrics (#218)
1 parent c369130 commit 62fffb5

File tree

1 file changed

+218
-0
lines changed

1 file changed

+218
-0
lines changed

docs/METRICS.md

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
# NVSentinel Metrics
2+
3+
This document outlines all Prometheus metrics exposed by NVSentinel components.
4+
5+
## Table of Contents
6+
7+
- [Fault Quarantine Module](#fault-quarantine-module)
8+
- [Labeler Module](#labeler-module)
9+
- [Janitor](#janitor)
10+
- [Platform Connectors](#platform-connectors)
11+
- [Health Monitors](#health-monitors)
12+
- [GPU Health Monitor](#gpu-health-monitor)
13+
- [Syslog Health Monitor](#syslog-health-monitor)
14+
15+
---
16+
17+
## Fault Quarantine Module
18+
19+
### Event Processing Metrics
20+
21+
| Metric Name | Type | Labels | Description |
22+
|------------|------|--------|-------------|
23+
| `fault_quarantine_events_received_total` | Counter | - | Total number of events received from the watcher |
24+
| `fault_quarantine_events_successfully_processed_total` | Counter | - | Total number of events successfully processed |
25+
| `fault_quarantine_events_skipped_total` | Counter | - | Total number of events received on already cordoned node |
26+
| `fault_quarantine_processing_errors_total` | Counter | `error_type` | Total number of errors encountered during event processing |
27+
| `fault_quarantine_event_backlog_count` | Gauge | - | Number of health events which fault quarantine is yet to process |
28+
| `fault_quarantine_event_handling_duration_seconds` | Histogram | - | Histogram of event handling durations |
29+
30+
### Node Quarantine Metrics
31+
32+
| Metric Name | Type | Labels | Description |
33+
|------------|------|--------|-------------|
34+
| `fault_quarantine_nodes_quarantined_total` | Counter | `node` | Total number of nodes quarantined |
35+
| `fault_quarantine_nodes_unquarantined_total` | Counter | `node` | Total number of nodes unquarantined |
36+
| `fault_quarantine_current_quarantined_nodes` | Gauge | `node` | Current number of quarantined nodes |
37+
38+
### Taint and Cordon Metrics
39+
40+
| Metric Name | Type | Labels | Description |
41+
|------------|------|--------|-------------|
42+
| `fault_quarantine_taints_applied_total` | Counter | `taint_key`, `taint_effect` | Total number of taints applied to nodes |
43+
| `fault_quarantine_taints_removed_total` | Counter | `taint_key`, `taint_effect` | Total number of taints removed from nodes |
44+
| `fault_quarantine_cordons_applied_total` | Counter | - | Total number of cordons applied to nodes |
45+
| `fault_quarantine_cordons_removed_total` | Counter | - | Total number of cordons removed from nodes |
46+
47+
### Ruleset Evaluation Metrics
48+
49+
| Metric Name | Type | Labels | Description |
50+
|------------|------|--------|-------------|
51+
| `fault_quarantine_ruleset_evaluations_total` | Counter | `ruleset` | Total number of ruleset evaluations |
52+
| `fault_quarantine_ruleset_passed_total` | Counter | `ruleset` | Total number of ruleset evaluations that passed |
53+
| `fault_quarantine_ruleset_failed_total` | Counter | `ruleset` | Total number of ruleset evaluations that failed |
54+
55+
### Circuit Breaker Metrics
56+
57+
| Metric Name | Type | Labels | Description |
58+
|------------|------|--------|-------------|
59+
| `fault_quarantine_breaker_state` | Gauge | `state` | State of the fault quarantine breaker |
60+
| `fault_quarantine_breaker_utilization` | Gauge | - | Utilization of the fault quarantine breaker |
61+
| `fault_quarantine_get_total_nodes_duration_seconds` | Histogram | `result` | Duration of getTotalNodesWithRetry calls in seconds |
62+
| `fault_quarantine_get_total_nodes_errors_total` | Counter | `error_type` | Total number of errors from getTotalNodesWithRetry |
63+
| `fault_quarantine_get_total_nodes_retry_attempts` | Histogram | - | Number of retry attempts needed for getTotalNodesWithRetry (buckets: 0, 1, 2, 3, 5, 10) |
64+
65+
---
66+
67+
## Labeler Module
68+
69+
### Event Processing Metrics
70+
71+
| Metric Name | Type | Labels | Description |
72+
|------------|------|--------|-------------|
73+
| `labeler_events_processed_total` | Counter | `status` | Total number of pod events processed. Status values: `success`, `failed` |
74+
| `labeler_node_update_failures_total` | Counter | - | Total number of node update failures during reconciliation |
75+
| `labeler_event_handling_duration_seconds` | Histogram | - | Histogram of event handling durations |
76+
| `labeler_node_update_duration_seconds` | Histogram | - | Histogram of node update operation durations |
77+
78+
---
79+
80+
## Janitor
81+
82+
### Action Metrics
83+
84+
| Metric Name | Type | Labels | Description |
85+
|------------|------|--------|-------------|
86+
| `janitor_actions_count` | Counter | `action_type`, `status`, `node` | Total number of janitor actions by type and status. Action types: `reboot`, `terminate`. Status values: `started`, `succeeded`, `failed` |
87+
| `janitor_action_mttr_seconds` | Histogram | `action_type` | Time taken to complete janitor actions (Mean Time To Repair). Uses exponential buckets (10, 2, 10) for log-scale MTTR measurement |
88+
89+
---
90+
91+
## Platform Connectors
92+
93+
### Server Metrics
94+
95+
| Metric Name | Type | Labels | Description |
96+
|------------|------|--------|-------------|
97+
| `platform_connector_health_events_received_total` | Counter | - | The total number of health events that the platform connector has received |
98+
99+
### Workqueue Metrics
100+
101+
These metrics track the internal ring buffer workqueue performance:
102+
103+
| Metric Name | Type | Labels | Description |
104+
|------------|------|--------|-------------|
105+
| `platform_connector_workqueue_depth_<name>` | Gauge | `workqueue` | Current depth of Platform connector workqueue |
106+
| `platform_connector_workqueue_adds_total_<name>` | Counter | `workqueue` | Total number of adds handled by Platform connector workqueue |
107+
| `platform_connector_workqueue_latency_seconds_<name>` | Histogram | `workqueue` | How long an item stays in Platform connector workqueue before being requested. Uses linear buckets (0, 10, 500) |
108+
| `platform_connector_workqueue_work_duration_seconds_<name>` | Histogram | `workqueue` | How long processing an item from Platform connector workqueue takes. Uses linear buckets (0, 10, 500) |
109+
| `platform_connector_workqueue_retries_total_<name>` | Counter | `workqueue` | Total number of retries handled by Platform connector workqueue |
110+
| `platform_connector_workqueue_longest_running_processor_seconds_<name>` | Gauge | `workqueue` | How many seconds the longest running processor for Platform connector workqueue has been running |
111+
| `platform_connector_workqueue_unfinished_work_seconds_<name>` | Gauge | `workqueue` | The total time in seconds of work in progress in Platform connector workqueue |
112+
113+
**Note:** `<name>` in the metric names is replaced with the actual workqueue name at runtime.
114+
115+
---
116+
117+
## Health Monitors
118+
119+
### GPU Health Monitor
120+
121+
These metrics track GPU health events detected via DCGM (Data Center GPU Manager):
122+
123+
| Metric Name | Type | Labels | Description |
124+
|------------|------|--------|-------------|
125+
| `dcgm_health_events_publish_time_to_grpc_channel` | Histogram | `operation_name` | Amount of time spent in publishing DCGM health events on the gRPC channel |
126+
| `health_events_insertion_to_uds_succeed` | Counter | - | Total number of successful insertions of health events to UDS |
127+
| `health_events_insertion_to_uds_error` | Gauge | - | Error in insertions of health events to UDS |
128+
| `dcgm_health_active_non_fatal_health_events` | Gauge | `event_type`, `gpu_id` | Total number of active non-fatal health events at any given time |
129+
| `dcgm_health_active_fatal_health_events` | Gauge | `event_type`, `gpu_id` | Total number of active fatal health events at any given time |
130+
131+
---
132+
133+
### Syslog Health Monitor
134+
135+
The syslog health monitor tracks GPU-related errors detected from system logs.
136+
137+
#### XID Error Metrics
138+
139+
XID (GPU Error ID) errors are NVIDIA GPU driver errors:
140+
141+
| Metric Name | Type | Labels | Description |
142+
|------------|------|--------|-------------|
143+
| `syslog_health_monitor_xid_errors` | Counter | `node`, `err_code` | Total number of XID errors found |
144+
| `syslog_health_monitor_xid_processing_errors` | Counter | `error_type`, `node` | Total number of errors encountered during XID processing |
145+
| `syslog_health_monitor_xid_processing_latency_seconds` | Histogram | - | Histogram of XID processing latency |
146+
147+
#### SXID Error Metrics
148+
149+
SXID errors are NVSwitch-related errors:
150+
151+
| Metric Name | Type | Labels | Description |
152+
|------------|------|--------|-------------|
153+
| `syslog_health_monitor_sxid_errors` | Counter | `node`, `err_code`, `link`, `nvswitch` | Total number of SXID errors found |
154+
155+
#### GPU Fallen Off Bus Metrics
156+
157+
| Metric Name | Type | Labels | Description |
158+
|------------|------|--------|-------------|
159+
| `syslog_health_monitor_gpu_fallen_errors` | Counter | `node` | Total number of GPU fallen off bus errors detected |
160+
161+
---
162+
163+
## Metrics Configuration
164+
165+
### Scraping Metrics
166+
167+
All NVSentinel components expose Prometheus metrics on a metrics endpoint (typically `:2112/metrics`). The metrics can be scraped by Prometheus using standard scrape configurations.
168+
169+
### Helm Chart Configuration
170+
171+
The NVSentinel Helm chart automatically creates a `PodMonitor` resource for Prometheus Operator integration:
172+
173+
```bash
174+
helm install nvsentinel ./distros/kubernetes/nvsentinel \
175+
--namespace nvsentinel --create-namespace
176+
```
177+
178+
The PodMonitor is configured to scrape all NVSentinel component pods on their metrics endpoints (`/metrics` on port `metrics`).
179+
180+
### Annotation-based Discovery
181+
182+
Components can be configured to include Prometheus scrape annotations:
183+
184+
```yaml
185+
annotations:
186+
prometheus.io/scrape: "true"
187+
prometheus.io/port: "2112"
188+
prometheus.io/path: "/metrics"
189+
```
190+
191+
---
192+
193+
## Metric Types Reference
194+
195+
- **Counter**: A cumulative metric that only increases or resets to zero on restart
196+
- **Gauge**: A metric that can arbitrarily go up and down
197+
- **Histogram**: Samples observations and counts them in configurable buckets
198+
- **Summary**: Similar to histogram but calculates configurable quantiles over a sliding time window
199+
200+
---
201+
202+
## Common Label Values
203+
204+
### Status Labels
205+
- `success` / `failed` - Operation outcome
206+
- `started` / `succeeded` / `failed` - Action lifecycle status
207+
208+
### Action Types
209+
- `reboot` - Node reboot action
210+
- `terminate` - Node termination action
211+
212+
### CSP Labels
213+
- `gcp` - Google Cloud Platform
214+
- `aws` - Amazon Web Services
215+
216+
### Trigger Types
217+
- `quarantine` - Node quarantine trigger
218+
- `healthy` - Node healthy trigger

0 commit comments

Comments
 (0)