Pass driver root to nvinfo.New in device plugin main #1505

jfroy · 2025-11-18T04:14:53Z

The nvinfo instance passed to validateFlags needs to have the driver root path, otherwise the HasNvml call will fail (libnvidia-ml.so.1 won't be found). This will in turn cause validateFlags to error out if AnyCDIEnabled is true.

@elezar

The `nvinfo` instance passed to `validateFlags` needs to have the driver root path, otherwise the `HasNvml` call will fail (`libnvidia-ml.so.1` won't be found). This will in turn cause `validateFlags` to error out if `AnyCDIEnabled` is true. Signed-off-by: Jean-Francois Roy <[email protected]>

copy-pr-bot · 2025-11-18T04:14:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ArangoGutierrez · 2025-11-20T10:36:35Z

/ok to test 7ecfe53

elezar · 2025-11-20T14:24:13Z

Thanks @jfroy -- the fix is definitely valid, but I'm just trying to figure out why this doesn't break on a larger scale at present? Are you using the GPU Operator to deploy the plugin, or are you applying customizations?

(Note that if the device plugin container is started with the nvidia runtime, then libnvidia-ml.so will be available at the "standard" path. This change should ONLY be required if the driver libraries are mounted in using some other mechanism).

elezar · 2025-11-20T14:28:16Z

/cherry-pick release-0.18

github-actions · 2025-11-20T14:31:59Z

❌ Failed to create backport PR for release-0.18

Error: Command failed: git cherry-pick -x 7ecfe53
error: commit 7ecfe53 is a merge but no -m option was given.
fatal: cherry-pick failed

Please backport manually.

elezar · 2025-11-20T14:39:49Z

Created a manual backport as #1528

Pass driver root to nvinfo.New in device plugin main Signed-off-by: Karthik Vetrivel <[email protected]>

jfroy · 2025-11-20T16:50:16Z

Thanks @jfroy -- the fix is definitely valid, but I'm just trying to figure out why this doesn't break on a larger scale at present? Are you using the GPU Operator to deploy the plugin, or are you applying customizations?

(Note that if the device plugin container is started with the nvidia runtime, then libnvidia-ml.so will be available at the "standard" path. This change should ONLY be required if the driver libraries are mounted in using some other mechanism).

I am using GPU Operator 25.10. You can see my fluxcd helmrelease here.

Operator driver management is disabled, but the driver is not host installed either. Instead, I have a Talos system service that mounts the driver at /run/nvidia/driver, which is the default driver install path. So it's basically a driver container, just not managed by the operator.

The device plugin daemonset is using the nvidia runtime and the injected entrypoint shell script, which sources driver-ready. That file looks like this:

/ # cat /host/run/nvidia/validations/driver-ready
IS_HOST_DRIVER=false
NVIDIA_DRIVER_ROOT=/run/nvidia/driver
DRIVER_ROOT_CTR_PATH=/driver-root
NVIDIA_DEV_ROOT=/
DEV_ROOT_CTR_PATH=/host

--

All that being said, I think I observed a failure with the device plugin because I didn't have NVIDIA/nvidia-container-toolkit#1444 figured out at the time. I investigated the device plugin first and came up with this patch. I believe (but have not tested) that with 1444 this patch would not be necessary. But still, probably safe and good to pick it up.

ArangoGutierrez requested review from ArangoGutierrez and elezar November 18, 2025 15:38

Merge branch 'main' into propagate-driver-root

7ecfe53

elezar approved these changes Nov 20, 2025

View reviewed changes

elezar added this to the v0.18.1 milestone Nov 20, 2025

github-actions bot added the cherry-pick/release-0.18 label Nov 20, 2025

elezar merged commit e9fa999 into NVIDIA:main Nov 20, 2025
11 checks passed

elezar mentioned this pull request Nov 20, 2025

Pass driver root to nvinfo.New in device plugin main #1528

Merged

karthikvetrivel mentioned this pull request Nov 20, 2025

fix: handle merge commits in backport bot with -m 1 #1529

Merged

karthikvetrivel pushed a commit that referenced this pull request Nov 20, 2025

Merge pull request #1505 from jfroy/propagate-driver-root

9068b05

Pass driver root to nvinfo.New in device plugin main Signed-off-by: Karthik Vetrivel <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pass driver root to nvinfo.New in device plugin main #1505

Pass driver root to nvinfo.New in device plugin main #1505

Uh oh!

jfroy commented Nov 18, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 18, 2025

Uh oh!

ArangoGutierrez commented Nov 20, 2025

Uh oh!

elezar commented Nov 20, 2025

Uh oh!

elezar commented Nov 20, 2025

Uh oh!

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

elezar commented Nov 20, 2025

Uh oh!

jfroy commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pass driver root to nvinfo.New in device plugin main #1505

Pass driver root to nvinfo.New in device plugin main #1505

Uh oh!

Conversation

jfroy commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Nov 18, 2025

Uh oh!

ArangoGutierrez commented Nov 20, 2025

Uh oh!

elezar commented Nov 20, 2025

Uh oh!

elezar commented Nov 20, 2025

Uh oh!

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

elezar commented Nov 20, 2025

Uh oh!

jfroy commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jfroy commented Nov 18, 2025 •

edited

Loading