Skip to content

kata-manager pod not start #1871

@ys928

Description

@ys928

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug

我按照文档安装,但kata-manager等容器并没有启动,我查看gpu-operator的日志看到报错:

{"level":"error","ts":1762424337.8351557,"msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"cluster-policy"},"namespace":"","name":"cluster-policy","reconcileID":"2281e293-6efc-49c8-ac29-2711707ecb58","error":"Operation cannot be fulfilled on clusterpolicies.nvidia.com "cluster-policy": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1762424337.935488,"logger":"controllers.ClusterPolicy","msg":"WARNING: failed to get GPU workload config for node; using default","NodeName":"k8s-10-1-3-198","SandboxEnabled":true,"Error":"invalid GPU workload config: kata","defaultGPUWorkloadConfig":"container"}

我此时启动了的pod如下:

root@k8s-10-1-3-198:~# kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-l9cdd 1/1 Running 4 (21h ago) 21h
gpu-operator-1762423735-node-feature-discovery-gc-67489989g9vfl 1/1 Running 0 21h
gpu-operator-1762423735-node-feature-discovery-master-5cbfhmfx4 1/1 Running 0 21h
gpu-operator-1762423735-node-feature-discovery-worker-fhtdk 1/1 Running 0 21h
gpu-operator-58c88f459d-9dk8z 1/1 Running 0 21h
nvidia-container-toolkit-daemonset-jnmqr 1/1 Running 0 21h
nvidia-cuda-validator-4kn5w 0/1 Completed 0 21h
nvidia-dcgm-exporter-h8hlg 1/1 Running 2 (21h ago) 21h
nvidia-device-plugin-daemonset-zb68n 1/1 Running 3 (21h ago) 21h
nvidia-operator-validator-nmn96 1/1 Running 0 21h

root@k8s-10-1-3-198:~# nvidia-smi
Fri Nov 7 07:59:50 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 0% 35C P8 18W / 370W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

To Reproduce
Detailed steps to reproduce the issue.

Expected behavior
A clear and concise description of what you expected to happen.

Environment (please provide the following information):

  • GPU Operator Version: v25.10.-
  • OS: Ubuntu22.04.1
  • Kernel Version: 6.8.0-86-generic
  • Container Runtime Version: v1.7.27
  • Kubernetes Distro and Version: k8s v1.33.1

Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionCategorizes issue or PR as a support question.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions