-
Notifications
You must be signed in to change notification settings - Fork 413
Description
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
我按照文档安装,但kata-manager等容器并没有启动,我查看gpu-operator的日志看到报错:
{"level":"error","ts":1762424337.8351557,"msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"cluster-policy"},"namespace":"","name":"cluster-policy","reconcileID":"2281e293-6efc-49c8-ac29-2711707ecb58","error":"Operation cannot be fulfilled on clusterpolicies.nvidia.com "cluster-policy": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1762424337.935488,"logger":"controllers.ClusterPolicy","msg":"WARNING: failed to get GPU workload config for node; using default","NodeName":"k8s-10-1-3-198","SandboxEnabled":true,"Error":"invalid GPU workload config: kata","defaultGPUWorkloadConfig":"container"}
我此时启动了的pod如下:
root@k8s-10-1-3-198:~# kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-l9cdd 1/1 Running 4 (21h ago) 21h
gpu-operator-1762423735-node-feature-discovery-gc-67489989g9vfl 1/1 Running 0 21h
gpu-operator-1762423735-node-feature-discovery-master-5cbfhmfx4 1/1 Running 0 21h
gpu-operator-1762423735-node-feature-discovery-worker-fhtdk 1/1 Running 0 21h
gpu-operator-58c88f459d-9dk8z 1/1 Running 0 21h
nvidia-container-toolkit-daemonset-jnmqr 1/1 Running 0 21h
nvidia-cuda-validator-4kn5w 0/1 Completed 0 21h
nvidia-dcgm-exporter-h8hlg 1/1 Running 2 (21h ago) 21h
nvidia-device-plugin-daemonset-zb68n 1/1 Running 3 (21h ago) 21h
nvidia-operator-validator-nmn96 1/1 Running 0 21h
root@k8s-10-1-3-198:~# nvidia-smi
Fri Nov 7 07:59:50 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 0% 35C P8 18W / 370W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
To Reproduce
Detailed steps to reproduce the issue.
Expected behavior
A clear and concise description of what you expected to happen.
Environment (please provide the following information):
- GPU Operator Version: v25.10.-
- OS: Ubuntu22.04.1
- Kernel Version: 6.8.0-86-generic
- Container Runtime Version: v1.7.27
- Kubernetes Distro and Version: k8s v1.33.1
Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE - kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE - If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME - If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi - containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]