-
Notifications
You must be signed in to change notification settings - Fork 413
Closed
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.
Description
HOST INFORMATION
- OS and Architecture: Ubuntu 22.04, x86_64
- Kubernetes Distribution: Vanilla Kubernetes
- Kubernetes Version: v1.31.2
- Host Node GPUs: NVIDIA V100, A100
- GPU Operator Installation Method: Helm
Steps to reproduce the issue
$ kubectl get pods -n nvidia-gpu-operator
gpu-feature-discovery-8g4pc 2/2 Running
gpu-feature-discovery-j8797 2/2 Running
gpu-feature-discovery-st644 2/2 Running
nvdp-node-feature-discovery-worker-96gzj 1/1 Running
nvdp-node-feature-discovery-worker-xxl65 1/1 Running
nvdp-node-feature-discovery-worker-zt882 1/1 Running
nvidia-container-toolkit-daemonset-5vlk2 1/1 Running
nvidia-container-toolkit-daemonset-6chcr 1/1 Running
nvidia-container-toolkit-daemonset-rgdxz 1/1 Running
nvidia-cuda-validator-6hbzq 0/1 Completed
nvidia-cuda-validator-b6thh 0/1 Completed
nvidia-cuda-validator-wls5c 0/1 Completed
nvidia-dcgm-exporter-589kn 1/1 Running
nvidia-dcgm-exporter-hr66q 1/1 Running
nvidia-dcgm-exporter-phrrd 1/1 Running
nvidia-device-plugin-daemonset-88mbq 2/2 Running
nvidia-device-plugin-daemonset-fm5dn 2/2 Running
nvidia-device-plugin-daemonset-nz2st 2/2 Running
nvidia-operator-validator-s8tfk 0/1 Init:CrashLoopBackOff
nvidia-operator-validator-vp6nk 0/1 Init:CrashLoopBackOff
nvidia-operator-validator-xdvt4 0/1 Init:CrashLoopBackOff All components except nvidia-operator-validator are in a normal state and are functioning properly.
I can use DCGM normally and assign GPUs to pods without any issues.
However, nvidia-operator-validator is stuck in Init:CrashLoopBackOff.
$ kubectl logs nvidia-operator-validator-27lns -n nvidia-gpu-operator -c plugin-validation
time="2024-12-31T06:33:02Z" level=info msg="version: 65c864c1, commit: 65c864c"
time="2024-12-31T06:33:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2024-12-31T06:33:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 2"
time="2024-12-31T06:33:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 3"
time="2024-12-31T06:33:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 4"
time="2024-12-31T06:33:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 5"
time="2024-12-31T06:33:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 6"
time="2024-12-31T06:33:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 7"
time="2024-12-31T06:33:37Z" level=info msg="GPU resources are not yet discovered by the node, retry: 8"
time="2024-12-31T06:33:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 9"
time="2024-12-31T06:33:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 10"
time="2024-12-31T06:33:52Z" level=info msg="GPU resources are not yet discovered by the node, retry: 11"
time="2024-12-31T06:33:57Z" level=info msg="GPU resources are not yet discovered by the node, retry: 12"
time="2024-12-31T06:34:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 13"
time="2024-12-31T06:34:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 14"
time="2024-12-31T06:34:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 15"
time="2024-12-31T06:34:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 16"
time="2024-12-31T06:34:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 17"
time="2024-12-31T06:34:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 18"
time="2024-12-31T06:34:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 19"
time="2024-12-31T06:34:37Z" level=info msg="GPU resources are not yet discovered by the node, retry: 20"
time="2024-12-31T06:34:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 21"
time="2024-12-31T06:34:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 22"
time="2024-12-31T06:34:52Z" level=info msg="GPU resources are not yet discovered by the node, retry: 23"
time="2024-12-31T06:34:57Z" level=info msg="GPU resources are not yet discovered by the node, retry: 24"
time="2024-12-31T06:35:02Z" level=info msg="GPU resources are not yet discovered by the node, retry: 25"
time="2024-12-31T06:35:07Z" level=info msg="GPU resources are not yet discovered by the node, retry: 26"
time="2024-12-31T06:35:12Z" level=info msg="GPU resources are not yet discovered by the node, retry: 27"
time="2024-12-31T06:35:17Z" level=info msg="GPU resources are not yet discovered by the node, retry: 28"
time="2024-12-31T06:35:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 29"
time="2024-12-31T06:35:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 30"
time="2024-12-31T06:35:32Z" level=info msg="Error: error validating plugin installation: GPU resources are not discovered by the node"Metadata
Metadata
Assignees
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.