@@ -6,14 +6,54 @@ metadata:
66 {{- include "console-plugin-nvidia-gpu.labels" . | nindent 4 }}
77data :
88 dcgm-metrics.csv : |
9- DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, gpu utilization.
10- DCGM_FI_DEV_MEM_COPY_UTIL, gauge, mem utilization.
11- DCGM_FI_DEV_ENC_UTIL, gauge, enc utilization.
12- DCGM_FI_DEV_DEC_UTIL, gauge, dec utilization.
13- DCGM_FI_DEV_POWER_USAGE, gauge, power usage.
9+ # === Added by the console plugin ===
1410 DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX, gauge, power mgmt limit.
15- DCGM_FI_DEV_GPU_TEMP, gauge, gpu temp.
16- DCGM_FI_DEV_SM_CLOCK, gauge, sm clock.
1711 DCGM_FI_DEV_MAX_SM_CLOCK, gauge, max sm clock.
18- DCGM_FI_DEV_MEM_CLOCK, gauge, mem clock.
1912 DCGM_FI_DEV_MAX_MEM_CLOCK, gauge, max mem clock.
13+
14+ # === Available by default ===
15+ # Clocks
16+ DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
17+ DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
18+
19+ # Temperature
20+ DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
21+ DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
22+
23+ # Power
24+ DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
25+ DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
26+
27+ # PCIE
28+ DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
29+
30+ # Utilization (the sample period varies depending on the product)
31+ DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
32+ DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
33+ DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
34+ DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
35+
36+ # Errors and violations
37+ DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
38+
39+ # Memory usage
40+ DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
41+ DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
42+
43+ # NVLink
44+ DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
45+
46+ # VGPU License status
47+ DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
48+
49+ # Remapped rows
50+ DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
51+ DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors
52+ DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
53+
54+ # DCP metrics
55+ DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active.
56+ DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
57+ DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data.
58+ DCGM_FI_PROF_PCIE_TX_BYTES, counter, The number of bytes of active pcie tx data including both header and payload.
59+ DCGM_FI_PROF_PCIE_RX_BYTES, counter, The number of bytes of active pcie rx data including both header and payload.
0 commit comments