-
Notifications
You must be signed in to change notification settings - Fork 413
Closed
Description
I am trying to deploy the gpu-operator helm chart on an on-premise GPU cluster based on RKE2.
Worker nodes: Ubuntu 22.04.5 LTS
Kernel version: 5.15.0-135-generic
The worker nodes don't have any drivers installed as of now, we are planning to utilise the nvidia-driver-daemonset to handle the driver configs .
k get no
NAME STATUS ROLES AGE VERSION
gkr-master-cx9lw-4ftwd Ready control-plane,etcd,master 6d5h v1.30.6+rke2r1
gkr-master-cx9lw-4z6gv Ready control-plane,etcd,master 6d5h v1.30.6+rke2r1
gkr-master-cx9lw-c4bnt Ready control-plane,etcd,master 6d5h v1.30.6+rke2r1
gkr-worker-dd8sl-8l7zf Ready worker 6d5h v1.30.6+rke2r1
gkr-worker-dd8sl-fmswn Ready worker 6d5h v1.30.6+rke2r1
gkr-worker-dd8sl-qbn4g Ready worker 6d5h v1.30.6+rke2r1
gkr-worker-dd8sl-v6hm6 Ready worker 6d5h v1.30.6+rke2r1
Helm chart deployment:
helm upgrade --install -n gpu-operator gpu-operator nvidia/gpu-operator -f gpu-operator-values.yaml
Release "gpu-operator" does not exist. Installing it now.
NAME: gpu-operator
LAST DEPLOYED: Thu Mar 27 17:42:52 2025
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
Helm Values,
helm get values gpu-operator
USER-SUPPLIED VALUES:
driver:
usePrecompiled: true
version: 570
toolkit:
env:
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
Pods status:
k get po
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-bggx2 0/1 Init:0/1 0 13m
gpu-feature-discovery-dtzbl 0/1 Init:0/1 0 13m
gpu-feature-discovery-f6rll 0/1 Init:0/1 0 13m
gpu-feature-discovery-ss57h 0/1 Init:0/1 0 13m
gpu-operator-868d98fc79-g6svh 1/1 Running 0 14m
gpu-operator-node-feature-discovery-gc-74d9855689-lg2z6 1/1 Running 0 14m
gpu-operator-node-feature-discovery-master-5cb7f479cb-hzg4p 1/1 Running 0 14m
gpu-operator-node-feature-discovery-worker-27228 1/1 Running 0 14m
gpu-operator-node-feature-discovery-worker-87vhw 1/1 Running 0 14m
gpu-operator-node-feature-discovery-worker-k468n 1/1 Running 0 14m
gpu-operator-node-feature-discovery-worker-zn5hj 1/1 Running 0 14m
nvidia-container-toolkit-daemonset-b5ncn 0/1 Init:0/1 0 13m
nvidia-container-toolkit-daemonset-f9x7k 0/1 Init:0/1 0 13m
nvidia-container-toolkit-daemonset-k82q8 0/1 Init:0/1 0 13m
nvidia-container-toolkit-daemonset-rd6bw 0/1 Init:0/1 0 13m
nvidia-dcgm-exporter-5wwfb 0/1 Init:0/1 0 13m
nvidia-dcgm-exporter-7lvbz 0/1 Init:0/1 0 13m
nvidia-dcgm-exporter-7vdph 0/1 Init:0/1 0 13m
nvidia-dcgm-exporter-wbkln 0/1 Init:0/1 0 13m
nvidia-device-plugin-daemonset-2tpg7 0/1 Init:0/1 0 13m
nvidia-device-plugin-daemonset-4j5l9 0/1 Init:0/1 0 13m
nvidia-device-plugin-daemonset-655vs 0/1 Init:0/1 0 13m
nvidia-device-plugin-daemonset-lwsvr 0/1 Init:0/1 0 13m
nvidia-driver-daemonset-5.15.0-135-generic-ubuntu22.04-4lc44 0/1 Running 0 13m
nvidia-driver-daemonset-5.15.0-135-generic-ubuntu22.04-mn4zt 0/1 Running 0 13m
nvidia-driver-daemonset-5.15.0-135-generic-ubuntu22.04-mzx4l 0/1 Running 0 13m
nvidia-driver-daemonset-5.15.0-135-generic-ubuntu22.04-s2fr6 0/1 Running 0 13m
nvidia-operator-validator-97dzg 0/1 Init:0/4 0 13m
nvidia-operator-validator-kc74p 0/1 Init:0/4 0 13m
nvidia-operator-validator-mbhqn 0/1 Init:0/4 0 13m
nvidia-operator-validator-w9v8c 0/1 Init:0/4 0 13m
nvidia driver ds logs:
k logs -f nvidia-driver-daemonset-5.15.0-135-generic-ubuntu22.04-4lc44
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver branch 570 for Linux kernel version 5.15.0-135-generic
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
libnvidia-cfg1-570-server libnvidia-compute-570-server
nvidia-compute-utils-570-server nvidia-firmware-570-server-570.86.15
nvidia-kernel-common-570-server nvidia-kernel-source-570-server
Suggested packages:
nvidia-driver-570-server
The following NEW packages will be installed:
libnvidia-cfg1-570-server libnvidia-compute-570-server
libnvidia-decode-570-server libnvidia-encode-570-server
libnvidia-extra-570-server libnvidia-fbc1-570-server
nvidia-compute-utils-570-server nvidia-firmware-570-server-570.86.15
nvidia-headless-no-dkms-570-server nvidia-kernel-common-570-server
nvidia-kernel-source-570-server nvidia-utils-570-server
0 upgraded, 12 newly installed, 0 to remove and 5 not upgraded.
Need to get 0 B/191 MB of archives.
After this operation, 492 MB of additional disk space will be used.
Get:1 file:/usr/local/repos ./ libnvidia-cfg1-570-server 570.86.15-0ubuntu0.22.04.4 [159 kB]
Get:2 file:/usr/local/repos ./ libnvidia-compute-570-server 570.86.15-0ubuntu0.22.04.4 [48.8 MB]
Get:3 file:/usr/local/repos ./ libnvidia-decode-570-server 570.86.15-0ubuntu0.22.04.4 [2839 kB]
Get:4 file:/usr/local/repos ./ libnvidia-encode-570-server 570.86.15-0ubuntu0.22.04.4 [113 kB]
Get:5 file:/usr/local/repos ./ libnvidia-extra-570-server 570.86.15-0ubuntu0.22.04.4 [78.1 kB]
Get:6 file:/usr/local/repos ./ libnvidia-fbc1-570-server 570.86.15-0ubuntu0.22.04.4 [110 kB]
Get:7 file:/usr/local/repos ./ nvidia-compute-utils-570-server 570.86.15-0ubuntu0.22.04.4 [127 kB]
Get:8 file:/usr/local/repos ./ nvidia-firmware-570-server-570.86.15 570.86.15-0ubuntu0.22.04.4 [65.6 MB]
Get:9 file:/usr/local/repos ./ nvidia-kernel-common-570-server 570.86.15-0ubuntu0.22.04.4 [129 kB]
Get:10 file:/usr/local/repos ./ nvidia-kernel-source-570-server 570.86.15-0ubuntu0.22.04.4 [72.6 MB]
Get:11 file:/usr/local/repos ./ nvidia-headless-no-dkms-570-server 570.86.15-0ubuntu0.22.04.4 [10.5 kB]
Get:12 file:/usr/local/repos ./ nvidia-utils-570-server 570.86.15-0ubuntu0.22.04.4 [558 kB]
Selecting previously unselected package libnvidia-cfg1-570-server:amd64.
(Reading database ... 12053 files and directories currently installed.)
Preparing to unpack .../00-libnvidia-cfg1-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking libnvidia-cfg1-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package libnvidia-compute-570-server:amd64.
Preparing to unpack .../01-libnvidia-compute-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking libnvidia-compute-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package libnvidia-decode-570-server:amd64.
Preparing to unpack .../02-libnvidia-decode-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking libnvidia-decode-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package libnvidia-encode-570-server:amd64.
Preparing to unpack .../03-libnvidia-encode-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking libnvidia-encode-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package libnvidia-extra-570-server:amd64.
Preparing to unpack .../04-libnvidia-extra-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking libnvidia-extra-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package libnvidia-fbc1-570-server:amd64.
Preparing to unpack .../05-libnvidia-fbc1-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking libnvidia-fbc1-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package nvidia-compute-utils-570-server.
Preparing to unpack .../06-nvidia-compute-utils-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking nvidia-compute-utils-570-server (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package nvidia-firmware-570-server-570.86.15.
Preparing to unpack .../07-nvidia-firmware-570-server-570.86.15_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking nvidia-firmware-570-server-570.86.15 (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package nvidia-kernel-common-570-server.
Preparing to unpack .../08-nvidia-kernel-common-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking nvidia-kernel-common-570-server (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package nvidia-kernel-source-570-server.
Preparing to unpack .../09-nvidia-kernel-source-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking nvidia-kernel-source-570-server (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package nvidia-headless-no-dkms-570-server.
Preparing to unpack .../10-nvidia-headless-no-dkms-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking nvidia-headless-no-dkms-570-server (570.86.15-0ubuntu0.22.04.4) ...
Selecting previously unselected package nvidia-utils-570-server.
Preparing to unpack .../11-nvidia-utils-570-server_570.86.15-0ubuntu0.22.04.4_amd64.deb ...
Unpacking nvidia-utils-570-server (570.86.15-0ubuntu0.22.04.4) ...
Setting up libnvidia-fbc1-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Setting up nvidia-kernel-source-570-server (570.86.15-0ubuntu0.22.04.4) ...
Setting up libnvidia-cfg1-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Setting up libnvidia-compute-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Setting up libnvidia-extra-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Setting up nvidia-compute-utils-570-server (570.86.15-0ubuntu0.22.04.4) ...
Warning: The home dir /nonexistent you specified can't be accessed: No such file or directory
Adding system user `nvidia-persistenced' (UID 100) ...
Adding new group `nvidia-persistenced' (GID 101) ...
Adding new user `nvidia-persistenced' (UID 100) with group `nvidia-persistenced' ...
Not creating home directory `/nonexistent'.
Setting up nvidia-firmware-570-server-570.86.15 (570.86.15-0ubuntu0.22.04.4) ...
Setting up libnvidia-decode-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Setting up nvidia-utils-570-server (570.86.15-0ubuntu0.22.04.4) ...
Setting up nvidia-kernel-common-570-server (570.86.15-0ubuntu0.22.04.4) ...
Setting up nvidia-headless-no-dkms-570-server (570.86.15-0ubuntu0.22.04.4) ...
Setting up libnvidia-encode-570-server:amd64 (570.86.15-0ubuntu0.22.04.4) ...
Processing triggers for libc-bin (2.35-0ubuntu3.8) ...
Installing Closed NVIDIA driver kernel modules...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
linux-base linux-image-5.15.0-135-generic linux-modules-5.15.0-135-generic
Suggested packages:
fdutils linux-doc | linux-source-5.15.0 linux-tools
linux-headers-5.15.0-135-generic linux-modules-extra-5.15.0-135-generic
Recommended packages:
grub-pc | grub-efi-amd64 | grub-efi-ia32 | grub | lilo initramfs-tools
| linux-initramfs-tool
The following NEW packages will be installed:
linux-base linux-image-5.15.0-135-generic linux-modules-5.15.0-135-generic
linux-modules-nvidia-570-server-5.15.0-135-generic
linux-objects-nvidia-570-server-5.15.0-135-generic
linux-signatures-nvidia-5.15.0-135-generic
0 upgraded, 6 newly installed, 0 to remove and 5 not upgraded.
Need to get 0 B/121 MB of archives.
After this operation, 307 MB of additional disk space will be used.
Get:1 file:/usr/local/repos ./ linux-base 4.5ubuntu9 [17.8 kB]
Get:2 file:/usr/local/repos ./ linux-modules-5.15.0-135-generic 5.15.0-135.146 [22.7 MB]
Get:3 file:/usr/local/repos ./ linux-image-5.15.0-135-generic 5.15.0-135.146 [11.6 MB]
Get:4 file:/usr/local/repos ./ linux-signatures-nvidia-5.15.0-135-generic 5.15.0-135.146+1 [33.0 kB]
Get:5 file:/usr/local/repos ./ linux-objects-nvidia-570-server-5.15.0-135-generic 5.15.0-135.146+1 [86.5 MB]
Get:6 file:/usr/local/repos ./ linux-modules-nvidia-570-server-5.15.0-135-generic 5.15.0-135.146+1 [16.9 kB]
Preconfiguring packages ...
Selecting previously unselected package linux-base.
(Reading database ... 12650 files and directories currently installed.)
Preparing to unpack .../0-linux-base_4.5ubuntu9_all.deb ...
Unpacking linux-base (4.5ubuntu9) ...
Selecting previously unselected package linux-modules-5.15.0-135-generic.
Preparing to unpack .../1-linux-modules-5.15.0-135-generic_5.15.0-135.146_amd64.deb ...
Unpacking linux-modules-5.15.0-135-generic (5.15.0-135.146) ...
Selecting previously unselected package linux-image-5.15.0-135-generic.
Preparing to unpack .../2-linux-image-5.15.0-135-generic_5.15.0-135.146_amd64.deb ...
Unpacking linux-image-5.15.0-135-generic (5.15.0-135.146) ...
Selecting previously unselected package linux-signatures-nvidia-5.15.0-135-generic.
Preparing to unpack .../3-linux-signatures-nvidia-5.15.0-135-generic_5.15.0-135.146+1_amd64.deb ...
Unpacking linux-signatures-nvidia-5.15.0-135-generic (5.15.0-135.146+1) ...
Selecting previously unselected package linux-objects-nvidia-570-server-5.15.0-135-generic.
Preparing to unpack .../4-linux-objects-nvidia-570-server-5.15.0-135-generic_5.15.0-135.146+1_amd64.deb ...
Unpacking linux-objects-nvidia-570-server-5.15.0-135-generic (5.15.0-135.146+1) ...
Selecting previously unselected package linux-modules-nvidia-570-server-5.15.0-135-generic.
Preparing to unpack .../5-linux-modules-nvidia-570-server-5.15.0-135-generic_5.15.0-135.146+1_amd64.deb ...
Unpacking linux-modules-nvidia-570-server-5.15.0-135-generic (5.15.0-135.146+1) ...
Setting up linux-base (4.5ubuntu9) ...
Setting up linux-objects-nvidia-570-server-5.15.0-135-generic (5.15.0-135.146+1) ...
Setting up linux-image-5.15.0-135-generic (5.15.0-135.146) ...
I: /boot/vmlinuz.old is now a symlink to vmlinuz-5.15.0-135-generic
I: /boot/initrd.img.old is now a symlink to initrd.img-5.15.0-135-generic
I: /boot/vmlinuz is now a symlink to vmlinuz-5.15.0-135-generic
I: /boot/initrd.img is now a symlink to initrd.img-5.15.0-135-generic
Setting up linux-modules-5.15.0-135-generic (5.15.0-135.146) ...
Setting up linux-signatures-nvidia-5.15.0-135-generic (5.15.0-135.146+1) ...
Setting up linux-modules-nvidia-570-server-5.15.0-135-generic (5.15.0-135.146+1) ...
linux-image-nvidia-5.15.0-135-generic: constructing .ko files
nvidia-drm.ko: OK
nvidia-modeset.ko: OK
nvidia-peermem.ko: OK
nvidia-uvm.ko: OK
nvidia.ko: OK
Processing triggers for linux-image-5.15.0-135-generic (5.15.0-135.146) ...
Parsing kernel module parameters...
Configuring the following firmware search path in '/sys/module/firmware_class/parameters/path': /run/nvidia/driver/lib/firmware
WARNING: A search path is already configured in /sys/module/firmware_class/parameters/path
Retaining the current configuration
Loading ipmi and i2c_core kernel modules...
Loading NVIDIA driver kernel modules...
+ modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': No such device
+ modprobe nvidia-uvm
modprobe: ERROR: could not insert 'nvidia_uvm': No such device
+ modprobe nvidia-modeset
modprobe: ERROR: could not insert 'nvidia_modeset': No such device
+ set +o xtrace -o nounset
Starting NVIDIA persistence daemon...
nvidia-persistenced failed to initialize. Check syslog for more details.
Mounting NVIDIA driver rootfs...
Done, now waiting for signal
^C
Why is it not finding the device?
I see this on the worker nodes,
lspci | grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation Device 26b9 (rev a1)
Metadata
Metadata
Assignees
Labels
No labels