-
Notifications
You must be signed in to change notification settings - Fork 412
Open
Labels
questionCategorizes issue or PR as a support question.Categorizes issue or PR as a support question.
Description
Hi, guys
Can't understand what is wrong in my case. Everything seems to be ok, but it just doesn't work.
Generally, my cluster is installed in latest Rancher k3s.
GPU operator v25.3.0 was added from Rancher Apps.
Here is logs:
get pods - all is up and running
:~# kubectl get pods
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-vklrj 1/1 Running 0 22m
gpu-operator-56977fc4b6-96t6s 1/1 Running 0 23m
gpu-operator-node-feature-discovery-gc-78d798587d-7dldq 1/1 Running 0 23m
gpu-operator-node-feature-discovery-master-7b7b57c9f9-5mmmz 1/1 Running 0 23m
gpu-operator-node-feature-discovery-worker-7c2n9 1/1 Running 0 23m
nvidia-container-toolkit-daemonset-gggxj 1/1 Running 0 22m
nvidia-cuda-validator-ps8fl 0/1 Completed 0 21m
nvidia-dcgm-exporter-c7g4l 1/1 Running 0 22m
nvidia-device-plugin-daemonset-7wrzw 1/1 Running 0 22m
nvidia-mig-manager-qcgqd 1/1 Running 0 22m
nvidia-operator-validator-srhzw 1/1 Running 0 22m
kubectl describe node sees GPU.
kubectl describe node XXX
Capacity:
cpu: 128
ephemeral-storage: 2079140828Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1031635584Ki
nvidia.com/gpu: 8
pods: 110
Allocatable:
cpu: 128
ephemeral-storage: 2022588195893
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1031635584Ki
nvidia.com/gpu: 8
pods: 110
I am able to see GPU in docker:
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
but cri just does not work:
from documentation example
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
resources:
limits:
nvidia.com/gpu: 1:~# kubectl describe pod critest
Name: critest
Namespace: default
Priority: 0
Service Account: default
Node: localhost.localdomain/149.137.199.173
Start Time: Sat, 12 Apr 2025 12:16:08 +0000
Labels: <none>
Annotations: <none>
Status: Failed
IP: 10.42.0.185
IPs:
IP: 10.42.0.185
Containers:
nvidia-gpu:
Container ID: containerd://ade19e45be28c623f8b05923c93dc075d71f87340478a0b6f2501843c75ecd3c
Image: ubuntu
Image ID: docker.io/library/ubuntu@sha256:1e622c5f073b4f6bfad6632f2616c7f59ef256e96fe78bf6a595d1dc4376ac02
Port: <none>
Host Port: <none>
Command:
nvidia-smi
State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "nvidia-smi": executable file not found in $PATH
Exit Code: 128
Started: Thu, 01 Jan 1970 00:00:00 +0000
Finished: Sat, 12 Apr 2025 12:16:09 +0000
Ready: False
Restart Count: 0
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bjxqh (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-bjxqh:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 24m default-scheduler Successfully assigned default/critest to localhost.localdomain
Normal Pulling 24m kubelet Pulling image "ubuntu"
Normal Pulled 24m kubelet Successfully pulled image "ubuntu" in 353ms (353ms including waiting). Image size: 29727061 bytes.
Normal Created 24m kubelet Created container: nvidia-gpu
Warning Failed 24m kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "nvidia-smi": executable file not found in $PATH
Metadata
Metadata
Assignees
Labels
questionCategorizes issue or PR as a support question.Categorizes issue or PR as a support question.