Skip to content

nvidia-cuda-validator v25.10.0 fails to allocate vector #1900

@davidshen84

Description

@davidshen84

I installed the nvidia-gpu-operator v25.10 on my K3S cluster. Most gpu operator related pods are started successfully, except for the cuda validator, which fails with the following message:

cuda-validation Failed to allocate device vector A (error code no CUDA-capable device is detected)!

cuda-validation [Vector addition of 50000 elements]

stream closed EOF for gpu-operator/nvidia-cuda-validator-r6nsb (cuda-validation)

I downgraded to v25.3.2 and everything worked.

My host system is Gentoo. I installed the nvidia driver and nvidia-container-toolkit directly using the host package manager.

I customised the operator with the following values:

    driver:
      enabled: false
    toolkit:
      enabled: false

    devicePlugin:
      config:
        name: device-plugin-config
        create: true
        default: "time-slicing"
        data:
          time-slicing: |-
            version: v1
            flags:
              migStrategy: none
            sharing:
              timeSlicing:
                renameByDefault: false
                failRequestsGreaterThanOne: true
                resources:
                  - name: nvidia.com/gpu
                    replicas: 4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions