nvidia-operator-validator Always Init:CrashLoopBackOff, but the rest of the components are installed and working correctly.

### HOST INFORMATION

1. OS and Architecture: Ubuntu 22.04, x86_64
2. Kubernetes Distribution: Vanilla Kubernetes
3. Kubernetes Version: v1.31.2
4. Host Node GPUs: NVIDIA V100, A100
5. GPU Operator Installation Method: Helm

### Steps to reproduce the issue

```bash
$ kubectl get pods -n nvidia-gpu-operator

gpu-feature-discovery-8g4pc                           2/2     Running                     
gpu-feature-discovery-j8797                           2/2     Running                     
gpu-feature-discovery-st644                           2/2     Running                     
nvdp-node-feature-discovery-worker-96gzj              1/1     Running                     
nvdp-node-feature-discovery-worker-xxl65              1/1     Running                     
nvdp-node-feature-discovery-worker-zt882              1/1     Running                     
nvidia-container-toolkit-daemonset-5vlk2              1/1     Running                     
nvidia-container-toolkit-daemonset-6chcr              1/1     Running                     
nvidia-container-toolkit-daemonset-rgdxz              1/1     Running                     
nvidia-cuda-validator-6hbzq                           0/1     Completed                   
nvidia-cuda-validator-b6thh                           0/1     Completed                   
nvidia-cuda-validator-wls5c                           0/1     Completed                   
nvidia-dcgm-exporter-589kn                            1/1     Running                     
nvidia-dcgm-exporter-hr66q                            1/1     Running                     
nvidia-dcgm-exporter-phrrd                            1/1     Running                     
nvidia-device-plugin-daemonset-88mbq                  2/2     Running                     
nvidia-device-plugin-daemonset-fm5dn                  2/2     Running                     
nvidia-device-plugin-daemonset-nz2st                  2/2     Running                     
nvidia-operator-validator-s8tfk                       0/1     Init:CrashLoopBackOff       
nvidia-operator-validator-vp6nk                       0/1     Init:CrashLoopBackOff       
nvidia-operator-validator-xdvt4                       0/1     Init:CrashLoopBackOff       
```

All components except `nvidia-operator-validator` are in a normal state and are functioning properly.
I can use DCGM normally and assign GPUs to pods without any issues.

However, `nvidia-operator-validator` is stuck in `Init:CrashLoopBackOff`.

```bash
$ kubectl logs nvidia-operator-validator-27lns -n nvidia-gpu-operator -c plugin-validation

time="2024-12-31T06:33:02Z" level=info msg="version: 65c864c1, commit: 65c864c"
time="2024-12-31T06:33:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2024-12-31T06:33:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 2"
time="2024-12-31T06:33:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 3"
time="2024-12-31T06:33:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 4"
time="2024-12-31T06:33:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 5"
time="2024-12-31T06:33:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 6"
time="2024-12-31T06:33:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 7"
time="2024-12-31T06:33:37Z" level=info msg="GPU resources are not yet discovered by the node, retry: 8"
time="2024-12-31T06:33:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 9"
time="2024-12-31T06:33:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 10"
time="2024-12-31T06:33:52Z" level=info msg="GPU resources are not yet discovered by the node, retry: 11"
time="2024-12-31T06:33:57Z" level=info msg="GPU resources are not yet discovered by the node, retry: 12"
time="2024-12-31T06:34:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 13"
time="2024-12-31T06:34:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 14"
time="2024-12-31T06:34:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 15"
time="2024-12-31T06:34:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 16"
time="2024-12-31T06:34:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 17"
time="2024-12-31T06:34:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 18"
time="2024-12-31T06:34:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 19"
time="2024-12-31T06:34:37Z" level=info msg="GPU resources are not yet discovered by the node, retry: 20"
time="2024-12-31T06:34:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 21"
time="2024-12-31T06:34:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 22"
time="2024-12-31T06:34:52Z" level=info msg="GPU resources are not yet discovered by the node, retry: 23"
time="2024-12-31T06:34:57Z" level=info msg="GPU resources are not yet discovered by the node, retry: 24"
time="2024-12-31T06:35:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 25"
time="2024-12-31T06:35:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 26"
time="2024-12-31T06:35:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 27"
time="2024-12-31T06:35:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 28"
time="2024-12-31T06:35:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 29"
time="2024-12-31T06:35:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 30"
time="2024-12-31T06:35:32Z" level=info msg="Error: error validating plugin installation: GPU resources are not discovered by the node"
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nvidia-operator-validator Always Init:CrashLoopBackOff, but the rest of the components are installed and working correctly. #1190

HOST INFORMATION

Steps to reproduce the issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nvidia-operator-validator Always Init:CrashLoopBackOff, but the rest of the components are installed and working correctly. #1190

Description

HOST INFORMATION

Steps to reproduce the issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions