-
Notifications
You must be signed in to change notification settings - Fork 412
Open
Labels
featureissue/PR that proposes a new feature or functionalityissue/PR that proposes a new feature or functionalitylifecycle/frozen
Description
We would like to manage daemonsets like gpu-feature-discovery and dcgm-exporter on GKE via the gpu-operator where the driver is installed in a non standard location /home/kubernetes/bin/nvidia/
We do not want to use the device-plugin and container-toolkit via the gpu-operator because
- GKE NAP does not allow labelling nodes to disable the GKE's default device plugin
- Just enabling container-toolkit via the gpu-operator can cause race conditions where GKE's device plugin registers devices before gpu-operator's toolkit gets chance to run
Sadly this means we have to play by rules of GKE's runc which calls GKE's own nvidia-container-cli variant which does not inject devices, nvidia-smi and libnvml to gpu-feature-discovery and dcgm-exporter since these pods do not ask for any gpu devices using resources.limits
Ability to add volumes and mounts will allow to bypass this limitation by manually exposing devices, binaries and libraries
volumes:
- name: dev
hostPath:
path: /dev
type: ''
- name: nvidia-install-dir-host
hostPath:
path: /home/kubernetes/bin/nvidia
type: ''
- name: nvidia-config
hostPath:
path: /etc/nvidia
type: ''
containers:
- name: nvidia-dcgm-exporter
volumeMounts:
- name: dev
mountPath: /dev
- name: nvidia-install-dir-host
mountPath: /usr/local/nvidia
- name: nvidia-config
mountPath: /etc/nvidia
env:
- name: DCGM_EXPORTER_LISTEN
value: ':9400'
- name: DCGM_EXPORTER_KUBERNETES
value: 'true'
- name: DCGM_EXPORTER_COLLECTORS
value: /etc/dcgm-exporter/dcp-metrics-included.csv
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE
value: uid
- name: NVIDIA_INSTALL_DIR_HOST
value: /home/kubernetes/bin/nvidia
- name: NVIDIA_INSTALL_DIR_CONTAINER
value: /usr/local/nvidia
- name: LD_LIBRARY_PATH
value: >-
/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/cuda/lib64
Metadata
Metadata
Assignees
Labels
featureissue/PR that proposes a new feature or functionalityissue/PR that proposes a new feature or functionalitylifecycle/frozen