-
Notifications
You must be signed in to change notification settings - Fork 413
Description
We have been using GPU operator on our EKS clusters with EKS Optimized AMI. This setup has been working fine so far. We updated our node groups to start using Ubuntu based ami for EKS from Canonical. Once we made this change we started seeing below errors in the nvidia-driver-ctr container inside the nvidia-driver-daemonset pod.
Errors from nvidia-driver-ctr container inside the nvidia-driver-daemonset pod:
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-550.127.08
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.127.08..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 550.127.08 for Linux kernel version 5.15.0-1075-aws
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.15.0-1075-aws
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
The kernel was built by: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
You are using: cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-nano-timer.o] Error 1
make[2]: *** Waiting for unfinished jobs....
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-pci.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-dmabuf.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-i2c.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-procfs.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-pat.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-cray.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-mmap.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-acpi.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-dma.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-p2p.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/os-mlock.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/os-usermap.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vm.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-modeset-interface.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-usermap.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/os-registry.o] Error 1
In file included from ./include/linux/thread_info.h:23,
from ./arch/x86/include/asm/preempt.h:7,
from ./include/linux/preempt.h:78,
from ./include/linux/spinlock.h:55,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-lock.h:29,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:32,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h: In function 'NV_GET_USER_PAGES':
./arch/x86/include/asm/current.h:18:17: warning: passing argument 1 of 'get_user_pages' makes integer from pointer without a cast [-Wint-conversion]
18 | #define current get_current()
| ^~~~~~~~~~~~~
| |
| struct task_struct *
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:31: note: in expansion of macro 'current'
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
| ^~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
./include/linux/mm.h:1853:35: note: expected 'long unsigned int' but argument is of type 'struct task_struct *'
1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
| ~~~~~~~~~~~~~~^~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:47: warning: passing argument 2 of 'get_user_pages' makes integer from pointer without a cast [-Wint-conversion]
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
./include/linux/mm.h:1853:56: note: expected 'long unsigned int' but argument is of type 'struct mm_struct *'
1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
| ~~~~~~~~~~~~~~^~~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:60: warning: passing argument 4 of 'get_user_pages' makes pointer from integer without a cast [-Wint-conversion]
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
| ^~~~~~~~
| |
| long unsigned int
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
./include/linux/mm.h:1854:46: note: expected 'struct page **' but argument is of type 'long unsigned int'
1854 | unsigned int gup_flags, struct page **pages,
| ~~~~~~~~~~~~~~^~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:70: warning: passing argument 5 of 'get_user_pages' makes pointer from integer without a cast [-Wint-conversion]
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
| ^~~~~
| |
| int
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
./include/linux/mm.h:1855:32: note: expected 'struct vm_area_struct **' but argument is of type 'int'
1855 | struct vm_area_struct **vmas);
| ~~~~~~~~~~~~~~~~~~~~~~~~^~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:16: error: too many arguments to function 'get_user_pages'
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
| ^~~~~~~~~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
./include/linux/mm.h:1853:6: note: declared here
1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
| ^~~~~~~~~~~~~~
In file included from ./include/linux/thread_info.h:23,
from ./arch/x86/include/asm/preempt.h:7,
from ./include/linux/preempt.h:78,
from ./include/linux/spinlock.h:55,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-lock.h:29,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:32,
from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h: In function 'NV_GET_USER_PAGES':
./arch/x86/include/asm/current.h:18:17: warning: passing argument 1 of 'get_user_pages' makes integer from pointer without a cast [-Wint-conversion]
18 | #define current get_current()
| ^~~~~~~~~~~~~
| |
| struct task_struct *
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:31: note: in expansion of macro 'current'
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
| ^~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
./include/linux/mm.h:1853:35: note: expected 'long unsigned int' but argument is of type 'struct task_struct *'
1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
| ~~~~~~~~~~~~~~^~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:47: warning: passing argument 2 of 'get_user_pages' makes integer from pointer without a cast [-Wint-conversion]
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
./include/linux/mm.h:1853:56: note: expected 'long unsigned int' but argument is of type 'struct mm_struct *'
1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
| ~~~~~~~~~~~~~~^~~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:60: warning: passing argument 4 of 'get_user_pages' makes pointer from integer without a cast [-Wint-conversion]
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
| ^~~~~~~~
| |
| long unsigned int
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
./include/linux/mm.h:1854:46: note: expected 'struct page **' but argument is of type 'long unsigned int'
1854 | unsigned int gup_flags, struct page **pages,
| ~~~~~~~~~~~~~~^~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:70: warning: passing argument 5 of 'get_user_pages' makes pointer from integer without a cast [-Wint-conversion]
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
| ^~~~~
| |
| int
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
./include/linux/mm.h:1855:32: note: expected 'struct vm_area_struct **' but argument is of type 'int'
1855 | struct vm_area_struct **vmas);
| ~~~~~~~~~~~~~~~~~~~~~~~~^~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:16: error: too many arguments to function 'get_user_pages'
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
| ^~~~~~~~~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
./include/linux/mm.h:1853:6: note: declared here
1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
| ^~~~~~~~~~~~~~
In file included from ./include/linux/thread_info.h:23,
from ./arch/x86/include/asm/preempt.h:7,
from ./include/linux/preempt.h:78,
from ./include/linux/spinlock.h:55,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-lock.h:29,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:32,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h: In function 'NV_GET_USER_PAGES':
./arch/x86/include/asm/current.h:18:17: warning: passing argument 1 of 'get_user_pages' makes integer from pointer without a cast [-Wint-conversion]
18 | #define current get_current()
| ^~~~~~~~~~~~~
| |
| struct task_struct *
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:31: note: in expansion of macro 'current'
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
| ^~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
./include/linux/mm.h:1853:35: note: expected 'long unsigned int' but argument is of type 'struct task_struct *'
1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
| ~~~~~~~~~~~~~~^~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:47: warning: passing argument 2 of 'get_user_pages' makes integer from pointer without a cast [-Wint-conversion]
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
./include/linux/mm.h:1853:56: note: expected 'long unsigned int' but argument is of type 'struct mm_struct *'
1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
| ~~~~~~~~~~~~~~^~~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:60: warning: passing argument 4 of 'get_user_pages' makes pointer from integer without a cast [-Wint-conversion]
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
| ^~~~~~~~
| |
| long unsigned int
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
./include/linux/mm.h:1854:46: note: expected 'struct page **' but argument is of type 'long unsigned int'
1854 | unsigned int gup_flags, struct page **pages,
| ~~~~~~~~~~~~~~^~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:70: warning: passing argument 5 of 'get_user_pages' makes pointer from integer without a cast [-Wint-conversion]
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
| ^~~~~
| |
| int
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
./include/linux/mm.h:1855:32: note: expected 'struct vm_area_struct **' but argument is of type 'int'
1855 | struct vm_area_struct **vmas);
| ~~~~~~~~~~~~~~~~~~~~~~~~^~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:16: error: too many arguments to function 'get_user_pages'
109 | return get_user_pages(current, current->mm, start, nr_pages, write,
| ^~~~~~~~~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
./include/linux/mm.h:1853:6: note: declared here
1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
| ^~~~~~~~~~~~~~
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.o] Error 1
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.o] Error 1
make[1]: *** [Makefile:1910: /usr/src/nvidia-550.127.08/kernel] Error 2
make: *** [Makefile:89: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Setup details:
Kubernetes version: v1.28
GPU Operator version: v24.9.1
GPU Driver version: 550.127.08
Ubuntu AMI Name: ubuntu-eks/k8s_1.28/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20250123
Kernel Version: 5.15.0-1075-aws
Helm values used for installing GPU Operator:
dcgmExporter:
config:
name: nvidia-metrics-config
env:
- name: DCGM_EXPORTER_COLLECTORS
value: /etc/dcgm-exporter/dcgm-metrics.csv
tolerations:
- key: nvidia.com/gpu
operator: Exists
resources:
requests:
cpu: 100m
memory: 384Mi
limits:
cpu: 200m
memory: 460Mi
devicePlugin:
tolerations:
- key: nvidia.com/gpu
operator: Exists
driver:
# Prevent the GPU operator from appending the OS version to the image tag by
# specifying it as a digest, based on information in this comment:
# https://github.com/NVIDIA/gpu-operator/issues/542#issuecomment-1612215289
# This digest value can be determined with crane as follows:
# crane digest nvcr.io/nvidia/driver:550.127.08-ubuntu20.04
# This will need to be changed when we upgrade to a newer version of the operator and host OS
version: sha256:042659b349b3d6d915e551302bce80d2b799918fec31f747ae6a7c7ee4a9fc97
upgradePolicy:
autoUpgrade: false
tolerations:
- key: nvidia.com/gpu
operator: Exists
gfd:
tolerations:
- key: nvidia.com/gpu
operator: Exists
operator:
resources:
requests:
cpu: 400m
memory: 800Mi
limits:
cpu: 500m
memory: 1000Mi
toolkit:
env:
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
psp:
enabled: false
node-feature-discovery:
# Override name to avoid hitting 63 character limits
fullnameOverride: "nfd"
master:
resources:
requests:
cpu: 400m
memory: 800Mi
limits:
cpu: 500m
memory: 1000Mi
worker:
nodeSelector:
node.abakus.volvocars.ai/has-gpu: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: "false"
Current status of PODs:
❯ kubectl get pods -n abakus-gpu-support
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-lcsjj 0/1 Init:0/2 0 53m
nfd-worker-nnzdk 1/1 Running 0 54m
nvidia-container-toolkit-daemonset-tv9d4 0/1 Init:0/1 0 53m
nvidia-dcgm-exporter-hgqqf 0/1 Init:0/1 0 53m
nvidia-device-plugin-daemonset-4sd7c 0/1 Init:0/1 0 53m
nvidia-driver-daemonset-tfhwm 0/1 Running 7 (3m51s ago) 54m
nvidia-operator-validator-klzrj 0/1 Init:0/4 0 53m
Events from GPU Operator validator pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 28s (x303 over 65m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Events from GPU Driver daemonset pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 2m41s (x310 over 64m) kubelet Startup probe failed: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
We suspect that this issue is due to this image being used by us: nvcr.io/nvidia/driver:550.127.08-ubuntu20.04. We couldn't find any precompiled image for aws for 5.15.0-1075-aws kernel version here which may resolve this issue for us. Could someone please confirm what is it that we need to change for Ubuntu images to work in our setup?