Skip to content

Permissions issues: initialization error: nvml error: insufficient permissions #679

@leobenkel

Description

@leobenkel

Main issue:
not able to use GPU inside minikube due to permission issues.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):
>> uname -a
Linux xxx 6.5.0-25-generic #25~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Feb 20 16:09:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  • Kernel Version:
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
>> kubectl config view
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: DATA+OMITTED
    server: https://kubernetes.docker.internal:6443
  name: docker-desktop
- cluster:
    certificate-authority: /home/leo/.minikube/ca.crt
    extensions:
    - extension:
        last-update: Mon, 11 Mar 2024 13:40:51 CET
        provider: minikube.sigs.k8s.io
        version: v1.32.0
      name: cluster_info
    server: https://192.168.49.2:8443
  name: minikube
contexts:
- context:
    cluster: docker-desktop
    user: docker-desktop
  name: docker-desktop
- context:
    cluster: minikube
    extensions:
    - extension:
        last-update: Mon, 11 Mar 2024 13:40:51 CET
        provider: minikube.sigs.k8s.io
        version: v1.32.0
      name: context_info
    namespace: default
    user: minikube
  name: minikube
current-context: minikube
kind: Config
preferences: {}
users:
- name: docker-desktop
  user:
    client-certificate-data: DATA+OMITTED
    client-key-data: DATA+OMITTED
- name: minikube
  user:
    client-certificate: /home/leo/.minikube/profiles/minikube/client.crt
    client-key: /home/leo/.minikube/profiles/minikube/client.key
  • GPU Operator Version:
nvidia-smi
Mon Mar 11 13:55:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   51C    P8              13W /  80W |   1353MiB /  8192MiB |     13%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3888      G   /usr/lib/xorg/Xorg                          416MiB |
|    0   N/A  N/A      4220      G   /usr/bin/gnome-shell                        113MiB |
|    0   N/A  N/A      7233      G   ...irefox/3941/usr/lib/firefox/firefox      476MiB |
|    0   N/A  N/A      8787      G   ...irefox/3941/usr/lib/firefox/firefox      151MiB |
|    0   N/A  N/A      9794      G   ...irefox/3941/usr/lib/firefox/firefox       41MiB |
|    0   N/A  N/A     31467      G   ...sion,SpareRendererForSitePerProcess       71MiB |
|    0   N/A  N/A    116653      G   ...,WinRetrieveSuggestionsOnlyOnDemand       31MiB |
|    0   N/A  N/A    132611    C+G   warp-terminal                                20MiB |
+---------------------------------------------------------------------------------------+

2. Issue or feature description

using minikube, k8s, helm and gpu-operator.

i am getting:

Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown

for nvidia-operator-validator

3. Steps to reproduce the issue

I think i have something broken in my persmission/user setup and i am running out of ideas on how to resolve it.

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
kubectl get pods -n gpu-operator
NAME                                                         READY   STATUS                  RESTARTS        AGE
gpu-feature-discovery-qmktd                                  0/1     Init:0/1                0               9m28s
gpu-operator-574c687b59-pcjwr                                1/1     Running                 0               10m
gpu-operator-node-feature-discovery-gc-7cc7ccfff8-9vvk8      1/1     Running                 0               10m
gpu-operator-node-feature-discovery-master-d8597d549-qqkpv   1/1     Running                 0               10m
gpu-operator-node-feature-discovery-worker-xcwnx             1/1     Running                 0               10m
nvidia-container-toolkit-daemonset-r8ktc                     1/1     Running                 0               9m28s
nvidia-dcgm-exporter-mhxx4                                   0/1     Init:0/1                0               9m28s
nvidia-device-plugin-daemonset-v79cd                         0/1     Init:0/1                0               9m28s
nvidia-operator-validator-ptj47                              0/1     Init:CrashLoopBackOff   6 (3m28s ago)   9m28s
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
kubectl get ds -n gpu-operator
NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                        1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   9m57s
gpu-operator-node-feature-discovery-worker   1         1         1       1            1           <none>                                             10m
nvidia-container-toolkit-daemonset           1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true       9m57s
nvidia-dcgm-exporter                         1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true           9m57s
nvidia-device-plugin-daemonset               1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true           9m57s
nvidia-driver-daemonset                      0         0         0       0            0           nvidia.com/gpu.deploy.driver=true                  9m57s
nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             9m57s
nvidia-operator-validator                    1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true      9m57s
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl describe pod -n gpu-operator nvidia-operator-validator-ptj47
Name:                 nvidia-operator-validator-ptj47
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-operator-validator
Node:                 minikube/192.168.49.2
Start Time:           Mon, 11 Mar 2024 13:46:54 +0100
Labels:               app=nvidia-operator-validator
                      app.kubernetes.io/managed-by=gpu-operator
                      app.kubernetes.io/part-of=gpu-operator
                      controller-revision-hash=74c7484fb6
                      helm.sh/chart=gpu-operator-v23.9.2
                      pod-template-generation=1
Annotations:          <none>
Status:               Pending
IP:                   10.244.0.16
IPs:
  IP:           10.244.0.16
Controlled By:  DaemonSet/nvidia-operator-validator
Init Containers:
  driver-validation:
    Container ID:  docker://871f1cc1d632838d5e168db3bfe66f10cba3c84c070366cd36c654955e891f6f
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
    Image ID:      docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:9aefef081c3ab1123556374d2b15d0429f3990af2fbaccc3c9827801e1042703
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 11 Mar 2024 13:47:03 +0100
      Finished:     Mon, 11 Mar 2024 13:47:03 +0100
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
  toolkit-validation:
    Container ID:  docker://71762f7b569cd2ceba213aa845fe6c2598cec3889dfdf0902f9ef68f273cf622
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
    Image ID:      docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:9aefef081c3ab1123556374d2b15d0429f3990af2fbaccc3c9827801e1042703
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown
      Exit Code:    128
      Started:      Mon, 11 Mar 2024 13:52:54 +0100
      Finished:     Mon, 11 Mar 2024 13:52:54 +0100
    Ready:          False
    Restart Count:  6
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
      WITH_WAIT:               false
      COMPONENT:               toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
  cuda-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
  plugin-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                false
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
Containers:
  nvidia-operator-validator:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; sleep infinity
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjskm (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:
  kube-api-access-cjskm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.operator-validator=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  10m                default-scheduler  Successfully assigned gpu-operator/nvidia-operator-validator-ptj47 to minikube
  Normal   Pulling    10m                kubelet            Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2"
  Normal   Pulled     10m                kubelet            Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" in 2.116s (7.99s including waiting)
  Normal   Created    10m                kubelet            Created container driver-validation
  Normal   Started    10m                kubelet            Started container driver-validation
  Warning  Failed     10m (x3 over 10m)  kubelet            Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown
  Normal   Pulled   8m59s (x5 over 10m)    kubelet  Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2" already present on machine
  Normal   Created  8m58s (x5 over 10m)    kubelet  Created container toolkit-validation
  Warning  Failed   8m58s (x2 over 9m49s)  kubelet  Error: failed to start container "toolkit-validation": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown
  Warning  BackOff  32s (x46 over 10m)     kubelet  Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-ptj47_gpu-operator(7c2a5005-4339-4674-82c7-244051860212)
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
kubectl logs -n gpu-operator nvidia-operator-validator-ptj47
Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init)
Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-ptj47" is waiting to start: PodInitializing
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi

not able to

  • containerd logs journalctl -u containerd > containerd.log

it is huge and dont seem to have anything relevant. i can post it later if needed.

extra:

ls -l /dev/nvidia*
crw-rw---- 1 root vglusers 195,   0 Mar 11 11:24 /dev/nvidia0
crw-rw---- 1 root vglusers 195, 255 Mar 11 11:24 /dev/nvidiactl
crw-rw---- 1 root vglusers 195, 254 Mar 11 11:24 /dev/nvidia-modeset
crw-rw-rw- 1 root root     508,   0 Mar 11 11:24 /dev/nvidia-uvm
crw-rw-rw- 1 root root     508,   1 Mar 11 11:24 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
cr-------- 1 root root 511, 1 Mar 11 11:30 nvidia-cap1
cr--r--r-- 1 root root 511, 2 Mar 11 11:30 nvidia-cap2
getent group vglusers
vglusers:x:1002:leo,root
minikube ssh
docker@minikube:~$ ls -l /dev/nvidia*
crw-rw---- 1 root 1002 195, 254 Mar 11 12:40 /dev/nvidia-modeset
crw-rw-rw- 1 root root 508,   0 Mar 11 10:24 /dev/nvidia-uvm
crw-rw-rw- 1 root root 508,   1 Mar 11 10:24 /dev/nvidia-uvm-tools
crw-rw---- 1 root 1002 195,   0 Mar 11 10:24 /dev/nvidia0
crw-rw---- 1 root 1002 195, 255 Mar 11 10:24 /dev/nvidiactl

/dev/nvidia-caps:
total 0
cr-------- 1 root root 511, 1 Mar 11 12:40 nvidia-cap1
cr--r--r-- 1 root root 511, 2 Mar 11 12:40 nvidia-cap2
docker@minikube:~$

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions