Wrong CDI spec creates error "Unresolvable CDI device"

I am deploying the nvidia gpu operator with RKE2 Kubernetes in a two node clusters, one node having an nvidia GPU. The version of the nvidia-container-toolkit is `v1.17.8`. I am enabling cdi with `cdi.enabled: true`. I believe it installs correctly since all pods are running.

Then, I create a pod with `runtimeClassName: nvidia-cdi` which requests one `nvidia.com/gpu`. It fails to start it because:
```
  Warning  Failed     10s (x2 over 10s)  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=GPU-e78e68c0-4e6a-737e-f9cc-4223fe50aead
```
If I go to the node, I can see it in the output of `nvidia-smi`:
```
$> nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-e78e68c0-4e6a-737e-f9cc-4223fe50aead)
```
However, when checking `/var/run/cdi/management.nvidia.com-gpu.yaml`, it is not there. The only device I see is one with `name: all`. If I manually run: `sudo /usr/local/nvidia/toolkit/nvidia-ctk cdi generate`, I can see three devices:
* name: "0"
* name: GPU-e78e68c0-4e6a-737e-f9cc-4223fe50aead
* name: all
I believe that's the correct yaml. Therefore, I suspect the `nvidia-container-toolkit` is not correctly generating the CDI spec.

If I check its logs, I actually see a line stating that it is generating the CDI spec and there is no error:
```
$> kubectl logs nvidia-container-toolkit-daemonset-6jrl2 -n gpu-operator | grep CDI -n2
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
110-time="2025-08-11T13:36:58Z" level=info msg="Skipping: /host/dev/nvidia-uvm already exists"
111-time="2025-08-11T13:36:58Z" level=info msg="Skipping: /host/dev/nvidia-uvm-tools already exists"
112:time="2025-08-11T13:36:58Z" level=info msg="Generating CDI spec for management containers"
113-time="2025-08-11T13:36:58Z" level=info msg="Using /host/usr/lib64/libnvidia-ml.so.570.172.08"
114-time="2025-08-11T13:36:58Z" level=info msg="Selecting /host/dev/nvidia-modeset as /dev/nvidia-modeset"
```

I'm a bit at a loss. Can anyone help me debug this further? Do I perhaps need an extra flag in the operator? Is there perhaps something missing in my containerd config.toml file? Thanks for your time!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrong CDI spec creates error "Unresolvable CDI device" #1239

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wrong CDI spec creates error "Unresolvable CDI device" #1239

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions