-
Notifications
You must be signed in to change notification settings - Fork 412
Closed
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.
Description
Main issue:
not able to use GPU inside minikube due to permission issues.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04):
>> uname -a
Linux xxx 6.5.0-25-generic #25~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Feb 20 16:09:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
- Kernel Version:
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
>> kubectl config view
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: DATA+OMITTED
server: https://kubernetes.docker.internal:6443
name: docker-desktop
- cluster:
certificate-authority: /home/leo/.minikube/ca.crt
extensions:
- extension:
last-update: Mon, 11 Mar 2024 13:40:51 CET
provider: minikube.sigs.k8s.io
version: v1.32.0
name: cluster_info
server: https://192.168.49.2:8443
name: minikube
contexts:
- context:
cluster: docker-desktop
user: docker-desktop
name: docker-desktop
- context:
cluster: minikube
extensions:
- extension:
last-update: Mon, 11 Mar 2024 13:40:51 CET
provider: minikube.sigs.k8s.io
version: v1.32.0
name: context_info
namespace: default
user: minikube
name: minikube
current-context: minikube
kind: Config
preferences: {}
users:
- name: docker-desktop
user:
client-certificate-data: DATA+OMITTED
client-key-data: DATA+OMITTED
- name: minikube
user:
client-certificate: /home/leo/.minikube/profiles/minikube/client.crt
client-key: /home/leo/.minikube/profiles/minikube/client.key
- GPU Operator Version:
nvidia-smi
Mon Mar 11 13:55:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 51C P8 13W / 80W | 1353MiB / 8192MiB | 13% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3888 G /usr/lib/xorg/Xorg 416MiB |
| 0 N/A N/A 4220 G /usr/bin/gnome-shell 113MiB |
| 0 N/A N/A 7233 G ...irefox/3941/usr/lib/firefox/firefox 476MiB |
| 0 N/A N/A 8787 G ...irefox/3941/usr/lib/firefox/firefox 151MiB |
| 0 N/A N/A 9794 G ...irefox/3941/usr/lib/firefox/firefox 41MiB |
| 0 N/A N/A 31467 G ...sion,SpareRendererForSitePerProcess 71MiB |
| 0 N/A N/A 116653 G ...,WinRetrieveSuggestionsOnlyOnDemand 31MiB |
| 0 N/A N/A 132611 C+G warp-terminal 20MiB |
+---------------------------------------------------------------------------------------+
2. Issue or feature description
using minikube, k8s, helm and gpu-operator.
i am getting:
Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown
for nvidia-operator-validator
3. Steps to reproduce the issue
I think i have something broken in my persmission/user setup and i am running out of ideas on how to resolve it.
4. Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-qmktd 0/1 Init:0/1 0 9m28s
gpu-operator-574c687b59-pcjwr 1/1 Running 0 10m
gpu-operator-node-feature-discovery-gc-7cc7ccfff8-9vvk8 1/1 Running 0 10m
gpu-operator-node-feature-discovery-master-d8597d549-qqkpv 1/1 Running 0 10m
gpu-operator-node-feature-discovery-worker-xcwnx 1/1 Running 0 10m
nvidia-container-toolkit-daemonset-r8ktc 1/1 Running 0 9m28s
nvidia-dcgm-exporter-mhxx4 0/1 Init:0/1 0 9m28s
nvidia-device-plugin-daemonset-v79cd 0/1 Init:0/1 0 9m28s
nvidia-operator-validator-ptj47 0/1 Init:CrashLoopBackOff 6 (3m28s ago) 9m28s
- kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
kubectl get ds -n gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 9m57s
gpu-operator-node-feature-discovery-worker 1 1 1 1 1 <none> 10m
nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 9m57s
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 9m57s
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 9m57s
nvidia-driver-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.driver=true 9m57s
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 9m57s
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 9m57s
- If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl describe pod -n gpu-operator nvidia-operator-validator-ptj47
Name: nvidia-operator-validator-ptj47
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-operator-validator
Node: minikube/192.168.49.2
Start Time: Mon, 11 Mar 2024 13:46:54 +0100
Labels: app=nvidia-operator-validator
app.kubernetes.io/managed-by=gpu-operator
app.kubernetes.io/part-of=gpu-operator
controller-revision-hash=74c7484fb6
helm.sh/chart=gpu-operator-v23.9.2
pod-template-generation=1
Annotations: <none>
Status: Pending
IP: 10.244.0.16
IPs:
IP: 10.244.0.16
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
driver-validation:
Container ID: docker://871f1cc1d632838d5e168db3bfe66f10cba3c84c070366cd36c654955e891f6f
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID: docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:9aefef081c3ab1123556374d2b15d0429f3990af2fbaccc3c9827801e1042703
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 11 Mar 2024 13:47:03 +0100
Finished: Mon, 11 Mar 2024 13:47:03 +0100
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
toolkit-validation:
Container ID: docker://71762f7b569cd2ceba213aa845fe6c2598cec3889dfdf0902f9ef68f273cf622
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID: docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:9aefef081c3ab1123556374d2b15d0429f3990af2fbaccc3c9827801e1042703
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown
Exit Code: 128
Started: Mon, 11 Mar 2024 13:52:54 +0100
Finished: Mon, 11 Mar 2024 13:52:54 +0100
Ready: False
Restart Count: 6
Environment:
NVIDIA_VISIBLE_DEVICES: all
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
cuda-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
WITH_WAIT: false
COMPONENT: cuda
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
plugin-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
COMPONENT: plugin
WITH_WAIT: false
WITH_WORKLOAD: false
MIG_STRATEGY: single
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
Containers:
nvidia-operator-validator:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
echo all validations are successful; sleep infinity
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
kube-api-access-cjskm:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.operator-validator=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned gpu-operator/nvidia-operator-validator-ptj47 to minikube
Normal Pulling 10m kubelet Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2"
Normal Pulled 10m kubelet Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" in 2.116s (7.99s including waiting)
Normal Created 10m kubelet Created container driver-validation
Normal Started 10m kubelet Started container driver-validation
Warning Failed 10m (x3 over 10m) kubelet Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown
Normal Pulled 8m59s (x5 over 10m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" already present on machine
Normal Created 8m58s (x5 over 10m) kubelet Created container toolkit-validation
Warning Failed 8m58s (x2 over 9m49s) kubelet Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown
Warning BackOff 32s (x46 over 10m) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-ptj47_gpu-operator(7c2a5005-4339-4674-82c7-244051860212)
- If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
kubectl logs -n gpu-operator nvidia-operator-validator-ptj47
Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init)
Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-ptj47" is waiting to start: PodInitializing
- Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
not able to
- containerd logs
journalctl -u containerd > containerd.log
it is huge and dont seem to have anything relevant. i can post it later if needed.
extra:
ls -l /dev/nvidia*
crw-rw---- 1 root vglusers 195, 0 Mar 11 11:24 /dev/nvidia0
crw-rw---- 1 root vglusers 195, 255 Mar 11 11:24 /dev/nvidiactl
crw-rw---- 1 root vglusers 195, 254 Mar 11 11:24 /dev/nvidia-modeset
crw-rw-rw- 1 root root 508, 0 Mar 11 11:24 /dev/nvidia-uvm
crw-rw-rw- 1 root root 508, 1 Mar 11 11:24 /dev/nvidia-uvm-tools
/dev/nvidia-caps:
total 0
cr-------- 1 root root 511, 1 Mar 11 11:30 nvidia-cap1
cr--r--r-- 1 root root 511, 2 Mar 11 11:30 nvidia-cap2
getent group vglusers
vglusers:x:1002:leo,root
minikube ssh
docker@minikube:~$ ls -l /dev/nvidia*
crw-rw---- 1 root 1002 195, 254 Mar 11 12:40 /dev/nvidia-modeset
crw-rw-rw- 1 root root 508, 0 Mar 11 10:24 /dev/nvidia-uvm
crw-rw-rw- 1 root root 508, 1 Mar 11 10:24 /dev/nvidia-uvm-tools
crw-rw---- 1 root 1002 195, 0 Mar 11 10:24 /dev/nvidia0
crw-rw---- 1 root 1002 195, 255 Mar 11 10:24 /dev/nvidiactl
/dev/nvidia-caps:
total 0
cr-------- 1 root root 511, 1 Mar 11 12:40 nvidia-cap1
cr--r--r-- 1 root root 511, 2 Mar 11 12:40 nvidia-cap2
docker@minikube:~$
Metadata
Metadata
Assignees
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.