Skip to content

Conversation

@jfroy
Copy link
Contributor

@jfroy jfroy commented Nov 18, 2025

The nvinfo instance passed to validateFlags needs to have the driver root path, otherwise the HasNvml call will fail (libnvidia-ml.so.1 won't be found). This will in turn cause validateFlags to error out if AnyCDIEnabled is true.

@elezar

The `nvinfo` instance passed to `validateFlags` needs to have
the driver root path, otherwise the `HasNvml` call will fail
(`libnvidia-ml.so.1` won't be found). This will in turn cause
`validateFlags` to error out if `AnyCDIEnabled` is true.

Signed-off-by: Jean-Francois Roy <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ArangoGutierrez
Copy link
Collaborator

/ok to test 7ecfe53

@elezar
Copy link
Member

elezar commented Nov 20, 2025

Thanks @jfroy -- the fix is definitely valid, but I'm just trying to figure out why this doesn't break on a larger scale at present? Are you using the GPU Operator to deploy the plugin, or are you applying customizations?

(Note that if the device plugin container is started with the nvidia runtime, then libnvidia-ml.so will be available at the "standard" path. This change should ONLY be required if the driver libraries are mounted in using some other mechanism).

@elezar elezar added this to the v0.18.1 milestone Nov 20, 2025
@elezar
Copy link
Member

elezar commented Nov 20, 2025

/cherry-pick release-0.18

@elezar elezar merged commit e9fa999 into NVIDIA:main Nov 20, 2025
11 checks passed
@github-actions
Copy link

❌ Failed to create backport PR for release-0.18

Error: Command failed: git cherry-pick -x 7ecfe53
error: commit 7ecfe53 is a merge but no -m option was given.
fatal: cherry-pick failed

Please backport manually.

@elezar
Copy link
Member

elezar commented Nov 20, 2025

Created a manual backport as #1528

karthikvetrivel pushed a commit that referenced this pull request Nov 20, 2025
Pass driver root to nvinfo.New in device plugin main

Signed-off-by: Karthik Vetrivel <[email protected]>
@jfroy
Copy link
Contributor Author

jfroy commented Nov 20, 2025

Thanks @jfroy -- the fix is definitely valid, but I'm just trying to figure out why this doesn't break on a larger scale at present? Are you using the GPU Operator to deploy the plugin, or are you applying customizations?

(Note that if the device plugin container is started with the nvidia runtime, then libnvidia-ml.so will be available at the "standard" path. This change should ONLY be required if the driver libraries are mounted in using some other mechanism).

I am using GPU Operator 25.10. You can see my fluxcd helmrelease here.

Operator driver management is disabled, but the driver is not host installed either. Instead, I have a Talos system service that mounts the driver at /run/nvidia/driver, which is the default driver install path. So it's basically a driver container, just not managed by the operator.

The device plugin daemonset is using the nvidia runtime and the injected entrypoint shell script, which sources driver-ready. That file looks like this:

/ # cat /host/run/nvidia/validations/driver-ready
IS_HOST_DRIVER=false
NVIDIA_DRIVER_ROOT=/run/nvidia/driver
DRIVER_ROOT_CTR_PATH=/driver-root
NVIDIA_DEV_ROOT=/
DEV_ROOT_CTR_PATH=/host

--

All that being said, I think I observed a failure with the device plugin because I didn't have NVIDIA/nvidia-container-toolkit#1444 figured out at the time. I investigated the device plugin first and came up with this patch. I believe (but have not tested) that with 1444 this patch would not be necessary. But still, probably safe and good to pick it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants