-
Notifications
You must be signed in to change notification settings - Fork 413
Description
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04.4 LTS
- Kernel Version: 5.4.0-147-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.7.0-rc.1
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s, v1.26.2
- GPU Operator Version: gpu-operator-v23.9.0
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
The relabeling function is supported in the values. yaml file in the official repository:
dcgmExporter:
enabled: true
repository: nvcr.io/nvidia/k8s
image: dcgm-exporter
version: 3.2.6-3.1.9-ubuntu20.04
imagePullPolicy: IfNotPresent
env:
- name: DCGM_EXPORTER_LISTEN
value: ":9400"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_COLLECTORS
value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
resources: {}
serviceMonitor:
enabled: false
interval: 15s
honorLabels: false
additionalLabels: {}
relabelings: []
# - source_labels:
# - __meta_kubernetes_pod_node_name
# regex: (.*)
# target_label: instance
# replacement: $1
# action: replaceAND: I installed the latest version of nvidia/gpu operator using Helm, and I customized the values. yaml file:
cdi:
enabled: true
default: true
driver:
enabled: false
rdma:
enabled: true
useHostMofed: true
toolkit:
enabled: false
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: "false"
dcgmExporter:
enabled: true
serviceMonitor:
enabled: true
relabelings:
- action: replace
sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: instanceMy Helm releases:
$ helm ls --all-namespaces
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator gpu-operator 10 2023-11-06 16:58:33.967677 +0800 CST deployed gpu-operator-v23.9.0 v23.9.0
But the configuration of relabeling still does not take effect!
Others:
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
None.
4. Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE - kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE - If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME - If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi - containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]