-
Notifications
You must be signed in to change notification settings - Fork 435
Description
I am deploying the nvidia gpu operator with RKE2 Kubernetes in a two node clusters, one node having an nvidia GPU. The version of the nvidia-container-toolkit is v1.17.8. I am enabling cdi with cdi.enabled: true. I believe it installs correctly since all pods are running.
Then, I create a pod with runtimeClassName: nvidia-cdi which requests one nvidia.com/gpu. It fails to start it because:
Warning Failed 10s (x2 over 10s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=GPU-e78e68c0-4e6a-737e-f9cc-4223fe50aead
If I go to the node, I can see it in the output of nvidia-smi:
$> nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-e78e68c0-4e6a-737e-f9cc-4223fe50aead)
However, when checking /var/run/cdi/management.nvidia.com-gpu.yaml, it is not there. The only device I see is one with name: all. If I manually run: sudo /usr/local/nvidia/toolkit/nvidia-ctk cdi generate, I can see three devices:
- name: "0"
- name: GPU-e78e68c0-4e6a-737e-f9cc-4223fe50aead
- name: all
I believe that's the correct yaml. Therefore, I suspect thenvidia-container-toolkitis not correctly generating the CDI spec.
If I check its logs, I actually see a line stating that it is generating the CDI spec and there is no error:
$> kubectl logs nvidia-container-toolkit-daemonset-6jrl2 -n gpu-operator | grep CDI -n2
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
110-time="2025-08-11T13:36:58Z" level=info msg="Skipping: /host/dev/nvidia-uvm already exists"
111-time="2025-08-11T13:36:58Z" level=info msg="Skipping: /host/dev/nvidia-uvm-tools already exists"
112:time="2025-08-11T13:36:58Z" level=info msg="Generating CDI spec for management containers"
113-time="2025-08-11T13:36:58Z" level=info msg="Using /host/usr/lib64/libnvidia-ml.so.570.172.08"
114-time="2025-08-11T13:36:58Z" level=info msg="Selecting /host/dev/nvidia-modeset as /dev/nvidia-modeset"
I'm a bit at a loss. Can anyone help me debug this further? Do I perhaps need an extra flag in the operator? Is there perhaps something missing in my containerd config.toml file? Thanks for your time!