-
Notifications
You must be signed in to change notification settings - Fork 186
Description
I'm running the v24.9.0 release of the Nvidia GPU Operator, and attempted to install Talos 1.9.0-alpha.2 on my nodes (from 1.8.2). However, it is now unable to find and validate the drivers. I previously had to make some custom modifications to the operator validator logic to make it search under /glibc/lib, relative to the driverInstallDir, but these no longer help either.
These are the driverInstallDir values I have tried, with no success:
- /run/nvidia/driver (the default one from Nvidia)
- /usr/local
- /usr/local/glibc
- /usr/local/glibc/usr
From browsing the talos filesystem, as far as I can tell, nvidia-smi and other executables are located in /usr/local/bin, while all the libraries now are located under /usr/lib/glibc/lib64, and symlinked to a few other places as well.
As the Nvidia components do not search glibc by default, I cannot see what value of driverInstallDir that would currently allow these components to find both the libraries, as well as the required binaries. (example discovery logic in the gpu operator validator https://github.com/NVIDIA/gpu-operator/blob/79b1240221f22bbbc60c6c4b659aace48f0b3f42/validator/find.go#L35, also see a few lines below for discovery of the binaries)
From the description of c7eb377, it seemed like it should "just work" now with the gpu operator. Any pointers as to what I might be doing wrong?