-
Notifications
You must be signed in to change notification settings - Fork 412
Description
Describe the bug
After upgrading to 25.10.0 from 25.3.2, I get error:
CrashLoopBackOff (back-off 40s restarting failed container=hello-kubernetes pod=hello-kubernetes-69575f56b-9dzz4_test-ns(5e20c659-44d1-4c22-9d8f-560e3411fc58)) | Last state: Terminated with 128: StartError (failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=GPU-feefc289-ca5f-9917-2bf1-9477651da944), started: Thu, Jan 1 1970 1:00:00 am, finished: Sun, Nov 9 2025 12:34:53 am
Probably because CDI is not default and if I set runtimeClassName: nvidia-cdi it fails in the above way.
If I omit runtimeClassName, it starts and I can see nvidia devices in /dev, but there are no injected userspace files like nvidia-smi and libraries.
Do I need to change something in the operator configuration?
I use for toolkit:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
and for the devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
Environment (please provide the following information):
- GPU Operator Version: v25.10.0
- OS: Ubuntu 24.04
- Kernel Version: 6.8.0-generic
- Container Runtime Version: containerd 2.1.4
- Kubernetes Distro and Version: RKE 1.33.5