Skip to content

gpu-operator on RKE2 using precompiled driver #1371

@govindkailas

Description

@govindkailas

I am trying to deploy the gpu-operator helm chart on an on-premise GPU cluster based on RKE2.

Worker nodes: Ubuntu 22.04.5 LTS
Kernel version: 5.15.0-135-generic

The worker nodes don't have any drivers installed as of now, we are planning to utilise the nvidia-driver-daemonset to handle the driver configs .

k get no 
NAME                     STATUS   ROLES                       AGE    VERSION
gkr-master-cx9lw-4ftwd   Ready    control-plane,etcd,master   6d5h   v1.30.6+rke2r1
gkr-master-cx9lw-4z6gv   Ready    control-plane,etcd,master   6d5h   v1.30.6+rke2r1
gkr-master-cx9lw-c4bnt   Ready    control-plane,etcd,master   6d5h   v1.30.6+rke2r1
gkr-worker-dd8sl-8l7zf   Ready    worker                      6d5h   v1.30.6+rke2r1
gkr-worker-dd8sl-fmswn   Ready    worker                      6d5h   v1.30.6+rke2r1
gkr-worker-dd8sl-qbn4g   Ready    worker                      6d5h   v1.30.6+rke2r1
gkr-worker-dd8sl-v6hm6   Ready    worker                      6d5h   v1.30.6+rke2r1

Helm chart deployment:

helm upgrade --install  -n gpu-operator gpu-operator  nvidia/gpu-operator  -f gpu-operator-values.yaml 
Release "gpu-operator" does not exist. Installing it now.
NAME: gpu-operator
LAST DEPLOYED: Thu Mar 27 17:42:52 2025
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

Helm Values,

helm get values  gpu-operator
USER-SUPPLIED VALUES:
driver:
  usePrecompiled: true
  version: 570
toolkit:
  env:
  - name: CONTAINERD_SOCKET
    value: /run/k3s/containerd/containerd.sock

Pods status:

k get po 
NAME                                                           READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-bggx2                                    0/1     Init:0/1   0          13m
gpu-feature-discovery-dtzbl                                    0/1     Init:0/1   0          13m
gpu-feature-discovery-f6rll                                    0/1     Init:0/1   0          13m
gpu-feature-discovery-ss57h                                    0/1     Init:0/1   0          13m
gpu-operator-868d98fc79-g6svh                                  1/1     Running    0          14m
gpu-operator-node-feature-discovery-gc-74d9855689-lg2z6        1/1     Running    0          14m
gpu-operator-node-feature-discovery-master-5cb7f479cb-hzg4p    1/1     Running    0          14m
gpu-operator-node-feature-discovery-worker-27228               1/1     Running    0          14m
gpu-operator-node-feature-discovery-worker-87vhw               1/1     Running    0          14m
gpu-operator-node-feature-discovery-worker-k468n               1/1     Running    0          14m
gpu-operator-node-feature-discovery-worker-zn5hj               1/1     Running    0          14m
nvidia-container-toolkit-daemonset-b5ncn                       0/1     Init:0/1   0          13m
nvidia-container-toolkit-daemonset-f9x7k                       0/1     Init:0/1   0          13m
nvidia-container-toolkit-daemonset-k82q8                       0/1     Init:0/1   0          13m
nvidia-container-toolkit-daemonset-rd6bw                       0/1     Init:0/1   0          13m
nvidia-dcgm-exporter-5wwfb                                     0/1     Init:0/1   0          13m
nvidia-dcgm-exporter-7lvbz                                     0/1     Init:0/1   0          13m
nvidia-dcgm-exporter-7vdph                                     0/1     Init:0/1   0          13m
nvidia-dcgm-exporter-wbkln                                     0/1     Init:0/1   0          13m
nvidia-device-plugin-daemonset-2tpg7                           0/1     Init:0/1   0          13m
nvidia-device-plugin-daemonset-4j5l9                           0/1     Init:0/1   0          13m
nvidia-device-plugin-daemonset-655vs                           0/1     Init:0/1   0          13m
nvidia-device-plugin-daemonset-lwsvr                           0/1     Init:0/1   0          13m
nvidia-driver-daemonset-5.15.0-135-generic-ubuntu22.04-4lc44   0/1     Running    0          13m
nvidia-driver-daemonset-5.15.0-135-generic-ubuntu22.04-mn4zt   0/1     Running    0          13m
nvidia-driver-daemonset-5.15.0-135-generic-ubuntu22.04-mzx4l   0/1     Running    0          13m
nvidia-driver-daemonset-5.15.0-135-generic-ubuntu22.04-s2fr6   0/1     Running    0          13m
nvidia-operator-validator-97dzg                                0/1     Init:0/4   0          13m
nvidia-operator-validator-kc74p                                0/1     Init:0/4   0          13m
nvidia-operator-validator-mbhqn                                0/1     Init:0/4   0          13m
nvidia-operator-validator-w9v8c                                0/1     Init:0/4   0          13m

nvidia driver ds logs:

k logs -f nvidia-driver-daemonset-5.15.0-135-generic-ubuntu22.04-4lc44

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver branch 570 for Linux kernel version 5.15.0-135-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  libnvidia-cfg1-570-server libnvidia-compute-570-server
  nvidia-compute-utils-570-server nvidia-firmware-570-server-570.86.15
  nvidia-kernel-common-570-server nvidia-kernel-source-570-server
Suggested packages:
  nvidia-driver-570-server
The following NEW packages will be installed:
  libnvidia-cfg1-570-server libnvidia-compute-570-server
  libnvidia-decode-570-server libnvidia-encode-570-server
  libnvidia-extra-570-server libnvidia-fbc1-570-server
  nvidia-compute-utils-570-server nvidia-firmware-570-server-570.86.15
  nvidia-headless-no-dkms-570-server nvidia-kernel-common-570-server
  nvidia-kernel-source-570-server nvidia-utils-570-server
0 upgraded, 12 newly installed, 0 to remove and 5 not upgraded.
Need to get 0 B/191 MB of archives.
After this operation, 492 MB of additional disk space will be used.
Get:1 file:/usr/local/repos ./ libnvidia-cfg1-570-server 570.86.15-0ubuntu0.22.04.4 [159 kB]
Get:2 file:/usr/local/repos ./ libnvidia-compute-570-server 570.86.15-0ubuntu0.22.04.4 [48.8 MB]
Get:3 file:/usr/local/repos ./ libnvidia-decode-570-server 570.86.15-0ubuntu0.22.04.4 [2839 kB]
Get:4 file:/usr/local/repos ./ libnvidia-encode-570-server 570.86.15-0ubuntu0.22.04.4 [113 kB]
Get:5 file:/usr/local/repos ./ libnvidia-extra-570-server 570.86.15-0ubuntu0.22.04.4 [78.1 kB]
Get:6 file:/usr/local/repos ./ libnvidia-fbc1-570-server 570.86.15-0ubuntu0.22.04.4 [110 kB]
Get:7 file:/usr/local/repos ./ nvidia-compute-utils-570-server 570.86.15-0ubuntu0.22.04.4 [127 kB]
Get:8 file:/usr/local/repos ./ nvidia-firmware-570-server-570.86.15 570.86.15-0ubuntu0.22.04.4 [65.6 MB]
Get:9 file:/usr/local/repos ./ nvidia-kernel-common-570-server 570.86.15-0ubuntu0.22.04.4 [129 kB]
Get:10 file:/usr/local/repos ./ nvidia-kernel-source-570-server 570.86.15-0ubuntu0.22.04.4 [72.6 MB]
Get:11 file:/usr/local/repos ./ nvidia-headless-no-dkms-570-server 570.86.15-0ubuntu0.22.04.4 [10.5 kB]
Get:12 file:/usr/local/repos ./ nvidia-utils-570-server 570.86.15-0ubuntu0.22.04.4 [558 kB]
Selecting previously unselected package libnvidia-cfg1-570-server:amd64.
(Reading database ... 12053 files and directories currently installed.)
Preparing to unpack .../00-libnvidia-cfg1-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking libnvidia-cfg1-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package libnvidia-compute-570-server:amd64.
Preparing to unpack .../01-libnvidia-compute-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking libnvidia-compute-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package libnvidia-decode-570-server:amd64.
Preparing to unpack .../02-libnvidia-decode-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking libnvidia-decode-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package libnvidia-encode-570-server:amd64.
Preparing to unpack .../03-libnvidia-encode-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking libnvidia-encode-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package libnvidia-extra-570-server:amd64.
Preparing to unpack .../04-libnvidia-extra-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking libnvidia-extra-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package libnvidia-fbc1-570-server:amd64.
Preparing to unpack .../05-libnvidia-fbc1-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking libnvidia-fbc1-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package nvidia-compute-utils-570-server.
Preparing to unpack .../06-nvidia-compute-utils-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking nvidia-compute-utils-570-server (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package nvidia-firmware-570-server-570.86.15.
Preparing to unpack .../07-nvidia-firmware-570-server-570.86.15_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking nvidia-firmware-570-server-570.86.15 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package nvidia-kernel-common-570-server.
Preparing to unpack .../08-nvidia-kernel-common-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking nvidia-kernel-common-570-server (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package nvidia-kernel-source-570-server.
Preparing to unpack .../09-nvidia-kernel-source-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking nvidia-kernel-source-570-server (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package nvidia-headless-no-dkms-570-server.
Preparing to unpack .../10-nvidia-headless-no-dkms-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking nvidia-headless-no-dkms-570-server (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package nvidia-utils-570-server.
Preparing to unpack .../11-nvidia-utils-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking nvidia-utils-570-server (570.86.15-0ubuntu0.22.04.4) ...
Setting up libnvidia-fbc1-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Setting up nvidia-kernel-source-570-server (570.86.15-0ubuntu0.22.04.4) ...
Setting up libnvidia-cfg1-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Setting up libnvidia-compute-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Setting up libnvidia-extra-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Setting up nvidia-compute-utils-570-server (570.86.15-0ubuntu0.22.04.4) ...
Warning: The home dir /nonexistent you specified can't be accessed: No such file or directory
Adding system user `nvidia-persistenced' (UID 100) ...
Adding new group `nvidia-persistenced' (GID 101) ...
Adding new user `nvidia-persistenced' (UID 100) with group `nvidia-persistenced' ...
Not creating home directory `/nonexistent'.
Setting up nvidia-firmware-570-server-570.86.15 (570.86.15-0ubuntu0.22.04.4) ...
Setting up libnvidia-decode-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Setting up nvidia-utils-570-server (570.86.15-0ubuntu0.22.04.4) ...
Setting up nvidia-kernel-common-570-server (570.86.15-0ubuntu0.22.04.4) ...
Setting up nvidia-headless-no-dkms-570-server (570.86.15-0ubuntu0.22.04.4) ...
Setting up libnvidia-encode-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Processing triggers for libc-bin (2.35-0ubuntu3.8) ...
Installing Closed NVIDIA driver kernel modules...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  linux-base linux-image-5.15.0-135-generic linux-modules-5.15.0-135-generic
Suggested packages:
  fdutils linux-doc | linux-source-5.15.0 linux-tools
  linux-headers-5.15.0-135-generic linux-modules-extra-5.15.0-135-generic
Recommended packages:
  grub-pc | grub-efi-amd64 | grub-efi-ia32 | grub | lilo initramfs-tools
  | linux-initramfs-tool
The following NEW packages will be installed:
  linux-base linux-image-5.15.0-135-generic linux-modules-5.15.0-135-generic
  linux-modules-nvidia-570-server-5.15.0-135-generic
  linux-objects-nvidia-570-server-5.15.0-135-generic
  linux-signatures-nvidia-5.15.0-135-generic
0 upgraded, 6 newly installed, 0 to remove and 5 not upgraded.
Need to get 0 B/121 MB of archives.
After this operation, 307 MB of additional disk space will be used.
Get:1 file:/usr/local/repos ./ linux-base 4.5ubuntu9 [17.8 kB]
Get:2 file:/usr/local/repos ./ linux-modules-5.15.0-135-generic 5.15.0-135.146 [22.7 MB]
Get:3 file:/usr/local/repos ./ linux-image-5.15.0-135-generic 5.15.0-135.146 [11.6 MB]
Get:4 file:/usr/local/repos ./ linux-signatures-nvidia-5.15.0-135-generic 5.15.0-135.146+1 [33.0 kB]
Get:5 file:/usr/local/repos ./ linux-objects-nvidia-570-server-5.15.0-135-generic 5.15.0-135.146+1 [86.5 MB]
Get:6 file:/usr/local/repos ./ linux-modules-nvidia-570-server-5.15.0-135-generic 5.15.0-135.146+1 [16.9 kB]
Preconfiguring packages ...
Selecting previously unselected package linux-base.
(Reading database ... 12650 files and directories currently installed.)
Preparing to unpack .../0-linux-base_4.5ubuntu9_all.deb ...
Unpacking linux-base (4.5ubuntu9) ...
Selecting previously unselected package linux-modules-5.15.0-135-generic.
Preparing to unpack .../1-linux-modules-5.15.0-135-generic_5.15.0-135.146_amd64.deb ...
Unpacking linux-modules-5.15.0-135-generic (5.15.0-135.146) ...
Selecting previously unselected package linux-image-5.15.0-135-generic.
Preparing to unpack .../2-linux-image-5.15.0-135-generic_5.15.0-135.146_amd64.deb ...
Unpacking linux-image-5.15.0-135-generic (5.15.0-135.146) ...
Selecting previously unselected package linux-signatures-nvidia-5.15.0-135-generic.
Preparing to unpack .../3-linux-signatures-nvidia-5.15.0-135-generic_5.15.0-135.146+1_amd64.deb ...
Unpacking linux-signatures-nvidia-5.15.0-135-generic (5.15.0-135.146+1) ...
Selecting previously unselected package linux-objects-nvidia-570-server-5.15.0-135-generic.
Preparing to unpack .../4-linux-objects-nvidia-570-server-5.15.0-135-generic_5.15.0-135.146+1_amd64.deb ...
Unpacking linux-objects-nvidia-570-server-5.15.0-135-generic (5.15.0-135.146+1) ...
Selecting previously unselected package linux-modules-nvidia-570-server-5.15.0-135-generic.
Preparing to unpack .../5-linux-modules-nvidia-570-server-5.15.0-135-generic_5.15.0-135.146+1_amd64.deb ...
Unpacking linux-modules-nvidia-570-server-5.15.0-135-generic (5.15.0-135.146+1) ...
Setting up linux-base (4.5ubuntu9) ...
Setting up linux-objects-nvidia-570-server-5.15.0-135-generic (5.15.0-135.146+1) ...
Setting up linux-image-5.15.0-135-generic (5.15.0-135.146) ...
I: /boot/vmlinuz.old is now a symlink to vmlinuz-5.15.0-135-generic
I: /boot/initrd.img.old is now a symlink to initrd.img-5.15.0-135-generic
I: /boot/vmlinuz is now a symlink to vmlinuz-5.15.0-135-generic
I: /boot/initrd.img is now a symlink to initrd.img-5.15.0-135-generic
Setting up linux-modules-5.15.0-135-generic (5.15.0-135.146) ...
Setting up linux-signatures-nvidia-5.15.0-135-generic (5.15.0-135.146+1) ...
Setting up linux-modules-nvidia-570-server-5.15.0-135-generic (5.15.0-135.146+1) ...
linux-image-nvidia-5.15.0-135-generic: constructing .ko files
nvidia-drm.ko: OK
nvidia-modeset.ko: OK
nvidia-peermem.ko: OK
nvidia-uvm.ko: OK
nvidia.ko: OK
Processing triggers for linux-image-5.15.0-135-generic (5.15.0-135.146) ...
Parsing kernel module parameters...
Configuring the following firmware search path in '/sys/module/firmware_class/parameters/path': /run/nvidia/driver/lib/firmware
WARNING: A search path is already configured in /sys/module/firmware_class/parameters/path
         Retaining the current configuration
Loading ipmi and i2c_core kernel modules...
Loading NVIDIA driver kernel modules...
+ modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': No such device
+ modprobe nvidia-uvm
modprobe: ERROR: could not insert 'nvidia_uvm': No such device
+ modprobe nvidia-modeset
modprobe: ERROR: could not insert 'nvidia_modeset': No such device
+ set +o xtrace -o nounset
Starting NVIDIA persistence daemon...
nvidia-persistenced failed to initialize. Check syslog for more details.
Mounting NVIDIA driver rootfs...
Done, now waiting for signal
^C

Why is it not finding the device?

I see this on the worker nodes,

 lspci | grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation Device 26b9 (rev a1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions