Skip to content

no runtime for "nvidia" is configured #730

@yanis-incepto

Description

@yanis-incepto

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04
  • Kernel Version: Kubernetes 1.24.14
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Kops 1.24.1
  • GPU Operator Version: 23.9.2

2. Issue or feature description

kubectl describe pod nvidia-device-plugin-daemonset-w72xb -n gpu-operator
.... 
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               2m11s                 default-scheduler  Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-w72xb to i-071a4e5a302e4025b
  Warning  FailedCreatePodSandBox  12s (x10 over 2m11s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

It looks like the runtime isn't present as it's not found but it exists.

kubectl get runtimeclasses.node.k8s.io                                               
NAME     HANDLER   AGE
nvidia   nvidia    7d1h 
kubectl describe runtimeclasses.node.k8s.io nvidia                                   
Name:         nvidia
Namespace:    
Labels:       app.kubernetes.io/component=gpu-operator
Annotations:  <none>
API Version:  node.k8s.io/v1
Handler:      nvidia
Kind:         RuntimeClass
Metadata:
  Creation Timestamp:  2024-05-27T08:53:18Z
  Owner References:
    API Version:           nvidia.com/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  ClusterPolicy
    Name:                  cluster-policy
    UID:                   2c237c3d-07eb-4856-8316-046489793e3d
  Resource Version:        265073642
  UID:                     26fd5054-7344-4e6d-9029-a610ae0df560
Events:                    <none>

3. Steps to reproduce the issue

I installed the chart with helmfile

4. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE

 kubectl get pods -n gpu-operator                                                  
NAME                                                         READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-spbbk                                  0/1     Init:0/1   0          41s
gpu-operator-d97f85598-j7qt4                                 1/1     Running    0          7d1h
gpu-operator-node-feature-discovery-gc-84c477b7-67tk8        1/1     Running    0          6d20h
gpu-operator-node-feature-discovery-master-cb8bb7d48-x4hqj   1/1     Running    0          6d20h
gpu-operator-node-feature-discovery-worker-jfdsh             1/1     Running    0          85s
nvidia-container-toolkit-daemonset-vb6qn                     0/1     Init:0/1   0          41s
nvidia-dcgm-exporter-9xmbm                                   0/1     Init:0/1   0          41s
nvidia-device-plugin-daemonset-w72xb                         0/1     Init:0/1   0          41s
nvidia-driver-daemonset-v4n96                                0/1     Running    0          73s
nvidia-operator-validator-vbq6v                              0/1     Init:0/4   0          41s

kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE

 kubectl get ds -n gpu-operator                                                       
NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                          AGE
gpu-feature-discovery                        1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                       7d
gpu-operator-node-feature-discovery-worker   1         1         1       1            1           instance-type=gpu                                                      6d20h
nvidia-container-toolkit-daemonset           1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true                           7d
nvidia-dcgm-exporter                         1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true                               7d
nvidia-device-plugin-daemonset               1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true                               7d
nvidia-device-plugin-mps-control-daemon      0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true   7d
nvidia-driver-daemonset                      1         1         0       1            0           nvidia.com/gpu.deploy.driver=true                                      7d
nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                 7d
nvidia-operator-validator                    1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true                          7d

If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME

kubectl describe pod nvidia-device-plugin-daemonset-w72xb -n gpu-operator
.... 
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               2m11s                 default-scheduler  Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-w72xb to i-071a4e5a302e4025b
  Warning  FailedCreatePodSandBox  12s (x10 over 2m11s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi

kubectl exec nvidia-driver-daemonset-v4n96 -n gpu-operator -c nvidia-driver-ctr -- nvidia-smi
Mon Jun  3 10:01:38 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   30C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions