Skip to content

gpu-operator on talos 1.9 alpha #527

@Hexoplon

Description

@Hexoplon

I'm running the v24.9.0 release of the Nvidia GPU Operator, and attempted to install Talos 1.9.0-alpha.2 on my nodes (from 1.8.2). However, it is now unable to find and validate the drivers. I previously had to make some custom modifications to the operator validator logic to make it search under /glibc/lib, relative to the driverInstallDir, but these no longer help either.

These are the driverInstallDir values I have tried, with no success:

  • /run/nvidia/driver (the default one from Nvidia)
  • /usr/local
  • /usr/local/glibc
  • /usr/local/glibc/usr

From browsing the talos filesystem, as far as I can tell, nvidia-smi and other executables are located in /usr/local/bin, while all the libraries now are located under /usr/lib/glibc/lib64, and symlinked to a few other places as well.

As the Nvidia components do not search glibc by default, I cannot see what value of driverInstallDir that would currently allow these components to find both the libraries, as well as the required binaries. (example discovery logic in the gpu operator validator https://github.com/NVIDIA/gpu-operator/blob/79b1240221f22bbbc60c6c4b659aace48f0b3f42/validator/find.go#L35, also see a few lines below for discovery of the binaries)

From the description of c7eb377, it seemed like it should "just work" now with the gpu operator. Any pointers as to what I might be doing wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions