-
Notifications
You must be signed in to change notification settings - Fork 412
Closed
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.
Description
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04
- Kernel Version: Kubernetes 1.24.14
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Kops 1.24.1
- GPU Operator Version: 23.9.2
2. Issue or feature description
kubectl describe pod nvidia-device-plugin-daemonset-w72xb -n gpu-operator
....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m11s default-scheduler Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-w72xb to i-071a4e5a302e4025b
Warning FailedCreatePodSandBox 12s (x10 over 2m11s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
It looks like the runtime isn't present as it's not found but it exists.
kubectl get runtimeclasses.node.k8s.io
NAME HANDLER AGE
nvidia nvidia 7d1h
kubectl describe runtimeclasses.node.k8s.io nvidia
Name: nvidia
Namespace:
Labels: app.kubernetes.io/component=gpu-operator
Annotations: <none>
API Version: node.k8s.io/v1
Handler: nvidia
Kind: RuntimeClass
Metadata:
Creation Timestamp: 2024-05-27T08:53:18Z
Owner References:
API Version: nvidia.com/v1
Block Owner Deletion: true
Controller: true
Kind: ClusterPolicy
Name: cluster-policy
UID: 2c237c3d-07eb-4856-8316-046489793e3d
Resource Version: 265073642
UID: 26fd5054-7344-4e6d-9029-a610ae0df560
Events: <none>
3. Steps to reproduce the issue
I installed the chart with helmfile
4. Information to attach (optional if deemed irrelevant)
kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-spbbk 0/1 Init:0/1 0 41s
gpu-operator-d97f85598-j7qt4 1/1 Running 0 7d1h
gpu-operator-node-feature-discovery-gc-84c477b7-67tk8 1/1 Running 0 6d20h
gpu-operator-node-feature-discovery-master-cb8bb7d48-x4hqj 1/1 Running 0 6d20h
gpu-operator-node-feature-discovery-worker-jfdsh 1/1 Running 0 85s
nvidia-container-toolkit-daemonset-vb6qn 0/1 Init:0/1 0 41s
nvidia-dcgm-exporter-9xmbm 0/1 Init:0/1 0 41s
nvidia-device-plugin-daemonset-w72xb 0/1 Init:0/1 0 41s
nvidia-driver-daemonset-v4n96 0/1 Running 0 73s
nvidia-operator-validator-vbq6v 0/1 Init:0/4 0 41s
kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
kubectl get ds -n gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 7d
gpu-operator-node-feature-discovery-worker 1 1 1 1 1 instance-type=gpu 6d20h
nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 7d
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 7d
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 7d
nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 7d
nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 7d
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 7d
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 7d
If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl describe pod nvidia-device-plugin-daemonset-w72xb -n gpu-operator
....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m11s default-scheduler Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-w72xb to i-071a4e5a302e4025b
Warning FailedCreatePodSandBox 12s (x10 over 2m11s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
kubectl exec nvidia-driver-daemonset-v4n96 -n gpu-operator -c nvidia-driver-ctr -- nvidia-smi
Mon Jun 3 10:01:38 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 10W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Metadata
Metadata
Assignees
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.