no runtime for "nvidia" is configured

### 1. Quick Debug Information
* OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04
* Kernel Version: Kubernetes 1.24.14
* Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
* K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Kops 1.24.1
* GPU Operator Version: 23.9.2


### 2. Issue or feature description

```
kubectl describe pod nvidia-device-plugin-daemonset-w72xb -n gpu-operator
.... 
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               2m11s                 default-scheduler  Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-w72xb to i-071a4e5a302e4025b
  Warning  FailedCreatePodSandBox  12s (x10 over 2m11s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
```

It looks like the runtime isn't present as it's not found but it exists. 

```
kubectl get runtimeclasses.node.k8s.io                                               
NAME     HANDLER   AGE
nvidia   nvidia    7d1h 
```
```
kubectl describe runtimeclasses.node.k8s.io nvidia                                   
Name:         nvidia
Namespace:    
Labels:       app.kubernetes.io/component=gpu-operator
Annotations:  <none>
API Version:  node.k8s.io/v1
Handler:      nvidia
Kind:         RuntimeClass
Metadata:
  Creation Timestamp:  2024-05-27T08:53:18Z
  Owner References:
    API Version:           nvidia.com/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  ClusterPolicy
    Name:                  cluster-policy
    UID:                   2c237c3d-07eb-4856-8316-046489793e3d
  Resource Version:        265073642
  UID:                     26fd5054-7344-4e6d-9029-a610ae0df560
Events:                    <none>

```
### 3. Steps to reproduce the issue
I installed the chart with helmfile 

### 4. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
```
 kubectl get pods -n gpu-operator                                                  
NAME                                                         READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-spbbk                                  0/1     Init:0/1   0          41s
gpu-operator-d97f85598-j7qt4                                 1/1     Running    0          7d1h
gpu-operator-node-feature-discovery-gc-84c477b7-67tk8        1/1     Running    0          6d20h
gpu-operator-node-feature-discovery-master-cb8bb7d48-x4hqj   1/1     Running    0          6d20h
gpu-operator-node-feature-discovery-worker-jfdsh             1/1     Running    0          85s
nvidia-container-toolkit-daemonset-vb6qn                     0/1     Init:0/1   0          41s
nvidia-dcgm-exporter-9xmbm                                   0/1     Init:0/1   0          41s
nvidia-device-plugin-daemonset-w72xb                         0/1     Init:0/1   0          41s
nvidia-driver-daemonset-v4n96                                0/1     Running    0          73s
nvidia-operator-validator-vbq6v                              0/1     Init:0/4   0          41s
```
kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
```
 kubectl get ds -n gpu-operator                                                       
NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                          AGE
gpu-feature-discovery                        1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                       7d
gpu-operator-node-feature-discovery-worker   1         1         1       1            1           instance-type=gpu                                                      6d20h
nvidia-container-toolkit-daemonset           1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true                           7d
nvidia-dcgm-exporter                         1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true                               7d
nvidia-device-plugin-daemonset               1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true                               7d
nvidia-device-plugin-mps-control-daemon      0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true   7d
nvidia-driver-daemonset                      1         1         0       1            0           nvidia.com/gpu.deploy.driver=true                                      7d
nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                 7d
nvidia-operator-validator                    1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true                          7d
```
If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
```
kubectl describe pod nvidia-device-plugin-daemonset-w72xb -n gpu-operator
.... 
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               2m11s                 default-scheduler  Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-w72xb to i-071a4e5a302e4025b
  Warning  FailedCreatePodSandBox  12s (x10 over 2m11s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
```
 
Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
```
kubectl exec nvidia-driver-daemonset-v4n96 -n gpu-operator -c nvidia-driver-ctr -- nvidia-smi
Mon Jun  3 10:01:38 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   30C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

no runtime for "nvidia" is configured #730

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

no runtime for "nvidia" is configured #730

Description

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions