Skip to content

nvidia-operator-validator Always Init:CrashLoopBackOff, but the rest of the components are installed and working correctly. #1190

@spiner-z

Description

@spiner-z

HOST INFORMATION

  1. OS and Architecture: Ubuntu 22.04, x86_64
  2. Kubernetes Distribution: Vanilla Kubernetes
  3. Kubernetes Version: v1.31.2
  4. Host Node GPUs: NVIDIA V100, A100
  5. GPU Operator Installation Method: Helm

Steps to reproduce the issue

$ kubectl get pods -n nvidia-gpu-operator

gpu-feature-discovery-8g4pc                           2/2     Running                     
gpu-feature-discovery-j8797                           2/2     Running                     
gpu-feature-discovery-st644                           2/2     Running                     
nvdp-node-feature-discovery-worker-96gzj              1/1     Running                     
nvdp-node-feature-discovery-worker-xxl65              1/1     Running                     
nvdp-node-feature-discovery-worker-zt882              1/1     Running                     
nvidia-container-toolkit-daemonset-5vlk2              1/1     Running                     
nvidia-container-toolkit-daemonset-6chcr              1/1     Running                     
nvidia-container-toolkit-daemonset-rgdxz              1/1     Running                     
nvidia-cuda-validator-6hbzq                           0/1     Completed                   
nvidia-cuda-validator-b6thh                           0/1     Completed                   
nvidia-cuda-validator-wls5c                           0/1     Completed                   
nvidia-dcgm-exporter-589kn                            1/1     Running                     
nvidia-dcgm-exporter-hr66q                            1/1     Running                     
nvidia-dcgm-exporter-phrrd                            1/1     Running                     
nvidia-device-plugin-daemonset-88mbq                  2/2     Running                     
nvidia-device-plugin-daemonset-fm5dn                  2/2     Running                     
nvidia-device-plugin-daemonset-nz2st                  2/2     Running                     
nvidia-operator-validator-s8tfk                       0/1     Init:CrashLoopBackOff       
nvidia-operator-validator-vp6nk                       0/1     Init:CrashLoopBackOff       
nvidia-operator-validator-xdvt4                       0/1     Init:CrashLoopBackOff       

All components except nvidia-operator-validator are in a normal state and are functioning properly.
I can use DCGM normally and assign GPUs to pods without any issues.

However, nvidia-operator-validator is stuck in Init:CrashLoopBackOff.

$ kubectl logs nvidia-operator-validator-27lns -n nvidia-gpu-operator -c plugin-validation

time="2024-12-31T06:33:02Z" level=info msg="version: 65c864c1, commit: 65c864c"
time="2024-12-31T06:33:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2024-12-31T06:33:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 2"
time="2024-12-31T06:33:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 3"
time="2024-12-31T06:33:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 4"
time="2024-12-31T06:33:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 5"
time="2024-12-31T06:33:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 6"
time="2024-12-31T06:33:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 7"
time="2024-12-31T06:33:37Z" level=info msg="GPU resources are not yet discovered by the node, retry: 8"
time="2024-12-31T06:33:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 9"
time="2024-12-31T06:33:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 10"
time="2024-12-31T06:33:52Z" level=info msg="GPU resources are not yet discovered by the node, retry: 11"
time="2024-12-31T06:33:57Z" level=info msg="GPU resources are not yet discovered by the node, retry: 12"
time="2024-12-31T06:34:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 13"
time="2024-12-31T06:34:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 14"
time="2024-12-31T06:34:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 15"
time="2024-12-31T06:34:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 16"
time="2024-12-31T06:34:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 17"
time="2024-12-31T06:34:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 18"
time="2024-12-31T06:34:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 19"
time="2024-12-31T06:34:37Z" level=info msg="GPU resources are not yet discovered by the node, retry: 20"
time="2024-12-31T06:34:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 21"
time="2024-12-31T06:34:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 22"
time="2024-12-31T06:34:52Z" level=info msg="GPU resources are not yet discovered by the node, retry: 23"
time="2024-12-31T06:34:57Z" level=info msg="GPU resources are not yet discovered by the node, retry: 24"
time="2024-12-31T06:35:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 25"
time="2024-12-31T06:35:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 26"
time="2024-12-31T06:35:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 27"
time="2024-12-31T06:35:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 28"
time="2024-12-31T06:35:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 29"
time="2024-12-31T06:35:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 30"
time="2024-12-31T06:35:32Z" level=info msg="Error: error validating plugin installation: GPU resources are not discovered by the node"

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions