Skip to content

GPU Operator breaks with Ubuntu Image on EKS 1.29 #1251

@mukgarg

Description

@mukgarg

We have been using GPU operator on our EKS clusters with EKS Optimized AMI. This setup has been working fine so far. We updated our node groups to start using Ubuntu based ami for EKS from Canonical. Once we made this change we started seeing below errors in the nvidia-driver-ctr container inside the nvidia-driver-daemonset pod.

Errors from nvidia-driver-ctr container inside the nvidia-driver-daemonset pod:

DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-550.127.08
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.127.08..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 550.127.08 for Linux kernel version 5.15.0-1075-aws

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.15.0-1075-aws
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
  You are using:           cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-nano-timer.o] Error 1
make[2]: *** Waiting for unfinished jobs....
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-pci.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-dmabuf.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-i2c.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-procfs.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-pat.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-cray.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-mmap.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-acpi.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-dma.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-p2p.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/os-mlock.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/os-usermap.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vm.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-modeset-interface.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-usermap.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/os-registry.o] Error 1
In file included from ./include/linux/thread_info.h:23,
                 from ./arch/x86/include/asm/preempt.h:7,
                 from ./include/linux/preempt.h:78,
                 from ./include/linux/spinlock.h:55,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-lock.h:29,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:32,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h: In function 'NV_GET_USER_PAGES':
./arch/x86/include/asm/current.h:18:17: warning: passing argument 1 of 'get_user_pages' makes integer from pointer without a cast [-Wint-conversion]
   18 | #define current get_current()
      |                 ^~~~~~~~~~~~~
      |                 |
      |                 struct task_struct *
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:31: note: in expansion of macro 'current'
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                               ^~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
./include/linux/mm.h:1853:35: note: expected 'long unsigned int' but argument is of type 'struct task_struct *'
 1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
      |                     ~~~~~~~~~~~~~~^~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:47: warning: passing argument 2 of 'get_user_pages' makes integer from pointer without a cast [-Wint-conversion]
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
./include/linux/mm.h:1853:56: note: expected 'long unsigned int' but argument is of type 'struct mm_struct *'
 1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
      |                                          ~~~~~~~~~~~~~~^~~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:60: warning: passing argument 4 of 'get_user_pages' makes pointer from integer without a cast [-Wint-conversion]
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                                                            ^~~~~~~~
      |                                                            |
      |                                                            long unsigned int
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
./include/linux/mm.h:1854:46: note: expected 'struct page **' but argument is of type 'long unsigned int'
 1854 |        unsigned int gup_flags, struct page **pages,
      |                                ~~~~~~~~~~~~~~^~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:70: warning: passing argument 5 of 'get_user_pages' makes pointer from integer without a cast [-Wint-conversion]
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                                                                      ^~~~~
      |                                                                      |
      |                                                                      int
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
./include/linux/mm.h:1855:32: note: expected 'struct vm_area_struct **' but argument is of type 'int'
 1855 |        struct vm_area_struct **vmas);
      |        ~~~~~~~~~~~~~~~~~~~~~~~~^~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:16: error: too many arguments to function 'get_user_pages'
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                ^~~~~~~~~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.c:27:
./include/linux/mm.h:1853:6: note: declared here
 1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
      |      ^~~~~~~~~~~~~~
In file included from ./include/linux/thread_info.h:23,
                 from ./arch/x86/include/asm/preempt.h:7,
                 from ./include/linux/preempt.h:78,
                 from ./include/linux/spinlock.h:55,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-lock.h:29,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:32,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h: In function 'NV_GET_USER_PAGES':
./arch/x86/include/asm/current.h:18:17: warning: passing argument 1 of 'get_user_pages' makes integer from pointer without a cast [-Wint-conversion]
   18 | #define current get_current()
      |                 ^~~~~~~~~~~~~
      |                 |
      |                 struct task_struct *
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:31: note: in expansion of macro 'current'
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                               ^~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
./include/linux/mm.h:1853:35: note: expected 'long unsigned int' but argument is of type 'struct task_struct *'
 1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
      |                     ~~~~~~~~~~~~~~^~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:47: warning: passing argument 2 of 'get_user_pages' makes integer from pointer without a cast [-Wint-conversion]
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
./include/linux/mm.h:1853:56: note: expected 'long unsigned int' but argument is of type 'struct mm_struct *'
 1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
      |                                          ~~~~~~~~~~~~~~^~~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:60: warning: passing argument 4 of 'get_user_pages' makes pointer from integer without a cast [-Wint-conversion]
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                                                            ^~~~~~~~
      |                                                            |
      |                                                            long unsigned int
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
./include/linux/mm.h:1854:46: note: expected 'struct page **' but argument is of type 'long unsigned int'
 1854 |        unsigned int gup_flags, struct page **pages,
      |                                ~~~~~~~~~~~~~~^~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:70: warning: passing argument 5 of 'get_user_pages' makes pointer from integer without a cast [-Wint-conversion]
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                                                                      ^~~~~
      |                                                                      |
      |                                                                      int
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
./include/linux/mm.h:1855:32: note: expected 'struct vm_area_struct **' but argument is of type 'int'
 1855 |        struct vm_area_struct **vmas);
      |        ~~~~~~~~~~~~~~~~~~~~~~~~^~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:16: error: too many arguments to function 'get_user_pages'
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                ^~~~~~~~~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.c:27:
./include/linux/mm.h:1853:6: note: declared here
 1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
      |      ^~~~~~~~~~~~~~
In file included from ./include/linux/thread_info.h:23,
                 from ./arch/x86/include/asm/preempt.h:7,
                 from ./include/linux/preempt.h:78,
                 from ./include/linux/spinlock.h:55,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-lock.h:29,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:32,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h: In function 'NV_GET_USER_PAGES':
./arch/x86/include/asm/current.h:18:17: warning: passing argument 1 of 'get_user_pages' makes integer from pointer without a cast [-Wint-conversion]
   18 | #define current get_current()
      |                 ^~~~~~~~~~~~~
      |                 |
      |                 struct task_struct *
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:31: note: in expansion of macro 'current'
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                               ^~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
./include/linux/mm.h:1853:35: note: expected 'long unsigned int' but argument is of type 'struct task_struct *'
 1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
      |                     ~~~~~~~~~~~~~~^~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:47: warning: passing argument 2 of 'get_user_pages' makes integer from pointer without a cast [-Wint-conversion]
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
./include/linux/mm.h:1853:56: note: expected 'long unsigned int' but argument is of type 'struct mm_struct *'
 1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
      |                                          ~~~~~~~~~~~~~~^~~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:60: warning: passing argument 4 of 'get_user_pages' makes pointer from integer without a cast [-Wint-conversion]
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                                                            ^~~~~~~~
      |                                                            |
      |                                                            long unsigned int
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
./include/linux/mm.h:1854:46: note: expected 'struct page **' but argument is of type 'long unsigned int'
 1854 |        unsigned int gup_flags, struct page **pages,
      |                                ~~~~~~~~~~~~~~^~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:70: warning: passing argument 5 of 'get_user_pages' makes pointer from integer without a cast [-Wint-conversion]
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                                                                      ^~~~~
      |                                                                      |
      |                                                                      int
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
./include/linux/mm.h:1855:32: note: expected 'struct vm_area_struct **' but argument is of type 'int'
 1855 |        struct vm_area_struct **vmas);
      |        ~~~~~~~~~~~~~~~~~~~~~~~~^~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:34,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
/usr/src/nvidia-550.127.08/kernel/common/inc/nv-mm.h:109:16: error: too many arguments to function 'get_user_pages'
  109 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                ^~~~~~~~~~~~~~
In file included from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-pgprot.h:30,
                 from /usr/src/nvidia-550.127.08/kernel/common/inc/nv-linux.h:33,
                 from /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.c:27:
./include/linux/mm.h:1853:6: note: declared here
 1853 | long get_user_pages(unsigned long start, unsigned long nr_pages,
      |      ^~~~~~~~~~~~~~
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/os-interface.o] Error 1
cc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/nv-vtophys.o] Error 1
make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-550.127.08/kernel/nvidia/os-pci.o] Error 1
make[1]: *** [Makefile:1910: /usr/src/nvidia-550.127.08/kernel] Error 2
make: *** [Makefile:89: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

Setup details:

Kubernetes version: v1.28
GPU Operator version: v24.9.1
GPU Driver version: 550.127.08
Ubuntu AMI Name: ubuntu-eks/k8s_1.28/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20250123
Kernel Version: 5.15.0-1075-aws

Helm values used for installing GPU Operator:

dcgmExporter:
                  config:
                    name: nvidia-metrics-config
                  env:
                  - name: DCGM_EXPORTER_COLLECTORS
                    value: /etc/dcgm-exporter/dcgm-metrics.csv
                  tolerations:
                  - key: nvidia.com/gpu
                    operator: Exists
                  resources:
                    requests:
                      cpu: 100m
                      memory: 384Mi
                    limits:
                      cpu: 200m
                      memory: 460Mi
                devicePlugin:
                  tolerations:
                  - key: nvidia.com/gpu
                    operator: Exists
                driver:
                  # Prevent the GPU operator from appending the OS version to the image tag by
                  # specifying it as a digest, based on information in this comment:
                  # https://github.com/NVIDIA/gpu-operator/issues/542#issuecomment-1612215289
                  # This digest value can be determined with crane as follows:
                  # crane digest nvcr.io/nvidia/driver:550.127.08-ubuntu20.04
                  # This will need to be changed when we upgrade to a newer version of the operator and host OS
                  version: sha256:042659b349b3d6d915e551302bce80d2b799918fec31f747ae6a7c7ee4a9fc97
                  upgradePolicy:
                    autoUpgrade: false
                  tolerations:
                  - key: nvidia.com/gpu
                    operator: Exists
                gfd:
                  tolerations:
                  - key: nvidia.com/gpu
                    operator: Exists
                operator:
                  resources:
                    requests:
                      cpu: 400m
                      memory: 800Mi
                    limits:
                      cpu: 500m
                      memory: 1000Mi
                toolkit:
                  env:
                  - name: CONTAINERD_SET_AS_DEFAULT
                    value: "true"
                psp:
                  enabled: false
                node-feature-discovery:
                  # Override name to avoid hitting 63 character limits
                  fullnameOverride: "nfd"
                  master:
                    resources:
                      requests:
                        cpu: 400m
                        memory: 800Mi
                      limits:
                        cpu: 500m
                        memory: 1000Mi
                  worker:
                    nodeSelector:
                      node.abakus.volvocars.ai/has-gpu: "true"
                    tolerations:
                    - key: nvidia.com/gpu
                      operator: Exists
                validator:
                  plugin:
                    env:
                    - name: WITH_WORKLOAD
                      value: "false"

Current status of PODs:

❯ kubectl get pods -n abakus-gpu-support
NAME                                       READY   STATUS     RESTARTS        AGE
gpu-feature-discovery-lcsjj                0/1     Init:0/2   0               53m
nfd-worker-nnzdk                           1/1     Running    0               54m
nvidia-container-toolkit-daemonset-tv9d4   0/1     Init:0/1   0               53m
nvidia-dcgm-exporter-hgqqf                 0/1     Init:0/1   0               53m
nvidia-device-plugin-daemonset-4sd7c       0/1     Init:0/1   0               53m
nvidia-driver-daemonset-tfhwm              0/1     Running    7 (3m51s ago)   54m
nvidia-operator-validator-klzrj            0/1     Init:0/4   0               53m

Events from GPU Operator validator pod:

Events:
  Type     Reason                  Age                  From     Message
  ----     ------                  ----                 ----     -------
  Warning  FailedCreatePodSandBox  28s (x303 over 65m)  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Events from GPU Driver daemonset pod:

Events:
  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Warning  Unhealthy  2m41s (x310 over 64m)  kubelet  Startup probe failed: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

We suspect that this issue is due to this image being used by us: nvcr.io/nvidia/driver:550.127.08-ubuntu20.04. We couldn't find any precompiled image for aws for 5.15.0-1075-aws kernel version here which may resolve this issue for us. Could someone please confirm what is it that we need to change for Ubuntu images to work in our setup?

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions