Skip to content

Wrong CDI spec creates error "Unresolvable CDI device" #1239

@manuelbuil

Description

@manuelbuil

I am deploying the nvidia gpu operator with RKE2 Kubernetes in a two node clusters, one node having an nvidia GPU. The version of the nvidia-container-toolkit is v1.17.8. I am enabling cdi with cdi.enabled: true. I believe it installs correctly since all pods are running.

Then, I create a pod with runtimeClassName: nvidia-cdi which requests one nvidia.com/gpu. It fails to start it because:

  Warning  Failed     10s (x2 over 10s)  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=GPU-e78e68c0-4e6a-737e-f9cc-4223fe50aead

If I go to the node, I can see it in the output of nvidia-smi:

$> nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-e78e68c0-4e6a-737e-f9cc-4223fe50aead)

However, when checking /var/run/cdi/management.nvidia.com-gpu.yaml, it is not there. The only device I see is one with name: all. If I manually run: sudo /usr/local/nvidia/toolkit/nvidia-ctk cdi generate, I can see three devices:

  • name: "0"
  • name: GPU-e78e68c0-4e6a-737e-f9cc-4223fe50aead
  • name: all
    I believe that's the correct yaml. Therefore, I suspect the nvidia-container-toolkit is not correctly generating the CDI spec.

If I check its logs, I actually see a line stating that it is generating the CDI spec and there is no error:

$> kubectl logs nvidia-container-toolkit-daemonset-6jrl2 -n gpu-operator | grep CDI -n2
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
110-time="2025-08-11T13:36:58Z" level=info msg="Skipping: /host/dev/nvidia-uvm already exists"
111-time="2025-08-11T13:36:58Z" level=info msg="Skipping: /host/dev/nvidia-uvm-tools already exists"
112:time="2025-08-11T13:36:58Z" level=info msg="Generating CDI spec for management containers"
113-time="2025-08-11T13:36:58Z" level=info msg="Using /host/usr/lib64/libnvidia-ml.so.570.172.08"
114-time="2025-08-11T13:36:58Z" level=info msg="Selecting /host/dev/nvidia-modeset as /dev/nvidia-modeset"

I'm a bit at a loss. Can anyone help me debug this further? Do I perhaps need an extra flag in the operator? Is there perhaps something missing in my containerd config.toml file? Thanks for your time!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions