Skip to content

Conversation

@hydazz
Copy link

@hydazz hydazz commented Oct 19, 2025

This is a starter PR to add support for Talos OS's different nvidia paths.

Tested the gpu component with the changes here in my environment and it works.

Feedback is needed, as i'm unsure how to add the usr/local/glibc path to CDI nicely, I don't believe getTalosLibrarySearchPaths will cut it globally...

Signed-off-by: hydazz <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@asymingt
Copy link

I can independently confirm that this works on my Talos cluster. After installing with helm I finally see resource slices being made available on a machine with five RTX A4000 GPUs. Thank you, @hydazz!

$ kubectl get resourceslices
NAME                                 NODE            DRIVER           POOL            AGE
talos-pxs-ia1-gpu.nvidia.com-gsvft   talos-pxs-ia1   gpu.nvidia.com   talos-pxs-ia1   11m

For repeatability, you will need a container image to be built. I have pushed one to asymingt/k8s-dra-driver-gpu. You will need to modify this line to asymingt/k8s-dra-driver-gpu:v25.8.0-dev before installing the chart this way:

helm upgrade -i nvidia-dra-driver-gpu ./k8s-dra-driver-gpu/deployments/helm/nvidia-dra-driver-gpu   \
   --create-namespace  --namespace drivers  \
   --set gpuResourcesEnabledOverride=true   \
   --set resources.gpus.enabled=true  \
   --set resources.computeDomains.enabled=false 
   --wait

To optionally rebuild the container image, install docker + qemu-binfmt + buildx, checkout this code and run:

export IMAGE_NAME=<your_docker_hub>/k8s-dra-driver-gpu
export VERSION=v25.8.0-dev
export PUSH_ON_BUILD=true
export BUILD_MULTI_ARCH_IMAGES=true

make -f deployments/container/Makefile build

@hydazz
Copy link
Author

hydazz commented Nov 15, 2025

I believe this is set to be fixed on the Talos side, by them installing the nvidia stuff where this expects it to go, not the other way around

@asymingt
Copy link

asymingt commented Nov 16, 2025

While we wait for Talos to update its driver install location, I've been trying to get MPS working on Talos using this PR branch and the following helm values.

gpuResourcesEnabledOverride: true
resources:
  gpus:
    enabled: true
  computeDomains:
    enabled: false
featureGates:
  MPSSupport: true

Looks like the mps-control-daemon keeps restarting with the following error:

$ k logs mps-control-daemon-49e0f7b0-e884-4ab0-ac35-b50bca50f681-e4dlqgl -n drivers
chroot: can't execute 'sh': No such file or directory

It's probably related to this issue: #469

I've opened a PR to fix it on your branch: hydazz#1

@klueska klueska added the feature issue/PR that proposes a new feature or functionality label Nov 24, 2025
@klueska klueska added this to the unscheduled milestone Nov 24, 2025
@klueska
Copy link
Collaborator

klueska commented Nov 24, 2025

@hydazz given your comment about Talos adjusting themselves to accommodate the existing search paths, how would you propose moving forward with this PR?

@hydazz
Copy link
Author

hydazz commented Nov 25, 2025

@hydazz given your comment about Talos adjusting themselves to accommodate the existing search paths, how would you propose moving forward with this PR?

@klueska I don't have definitive knowledge, I just inferred that conclusion based on:
https://discord.com/channels/673534664354430999/942576972943491113/1434096797562703983

if we could move this to a github discussion under extensions repo, we could collaborate more (I believe /usr/local/glibc/usr/lib was a wrong choice of path from first place [thinking about merged /usr/ though it was kind of done to fix issues with musl libs co-existing), but let's do a discussion on a good path moving forward and trying to use the operator and dra plugins as much as possible without platform specific hacks

(I could not find such referenced discussion)

siderolabs/extensions#836
siderolabs/extensions#476
#605

I don't know if there is talks between nvidia/talos, or whats outside of linked above, but it could easily be fixed here, just with something better than getTalosLibrarySearchPaths (pretty easily?), or on the extension side, but thats probably a larger change.

Perhaps @frezbo would have more insight?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants