Skip to content

Upgrade to 25.10.0 - Pod with GPU will not start #1876

@xhejtman

Description

@xhejtman

Describe the bug

After upgrading to 25.10.0 from 25.3.2, I get error:
CrashLoopBackOff (back-off 40s restarting failed container=hello-kubernetes pod=hello-kubernetes-69575f56b-9dzz4_test-ns(5e20c659-44d1-4c22-9d8f-560e3411fc58)) | Last state: Terminated with 128: StartError (failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=GPU-feefc289-ca5f-9917-2bf1-9477651da944), started: Thu, Jan 1 1970 1:00:00 am, finished: Sun, Nov 9 2025 12:34:53 am

Probably because CDI is not default and if I set runtimeClassName: nvidia-cdi it fails in the above way.

If I omit runtimeClassName, it starts and I can see nvidia devices in /dev, but there are no injected userspace files like nvidia-smi and libraries.

Do I need to change something in the operator configuration?

I use for toolkit:

 - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
   value: "true"
 - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
   value: "false"

and for the devicePlugin:

env:
    - name: PASS_DEVICE_SPECS
      value: "true"
    - name: FAIL_ON_INIT_ERROR
      value: "true"
    - name: DEVICE_LIST_STRATEGY
      value: volume-mounts
    - name: DEVICE_ID_STRATEGY
      value: uuid
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all

Environment (please provide the following information):

  • GPU Operator Version: v25.10.0
  • OS: Ubuntu 24.04
  • Kernel Version: 6.8.0-generic
  • Container Runtime Version: containerd 2.1.4
  • Kubernetes Distro and Version: RKE 1.33.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageissue or PR has not been assigned a priority-px label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions