-
Notifications
You must be signed in to change notification settings - Fork 759
Pass driver root to nvinfo.New in device plugin main #1505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The `nvinfo` instance passed to `validateFlags` needs to have the driver root path, otherwise the `HasNvml` call will fail (`libnvidia-ml.so.1` won't be found). This will in turn cause `validateFlags` to error out if `AnyCDIEnabled` is true. Signed-off-by: Jean-Francois Roy <[email protected]>
|
/ok to test 7ecfe53 |
|
Thanks @jfroy -- the fix is definitely valid, but I'm just trying to figure out why this doesn't break on a larger scale at present? Are you using the GPU Operator to deploy the plugin, or are you applying customizations? (Note that if the device plugin container is started with the |
|
/cherry-pick release-0.18 |
|
Created a manual backport as #1528 |
Pass driver root to nvinfo.New in device plugin main Signed-off-by: Karthik Vetrivel <[email protected]>
I am using GPU Operator 25.10. You can see my fluxcd helmrelease here. Operator driver management is disabled, but the driver is not host installed either. Instead, I have a Talos system service that mounts the driver at The device plugin daemonset is using the nvidia runtime and the injected entrypoint shell script, which sources driver-ready. That file looks like this: -- All that being said, I think I observed a failure with the device plugin because I didn't have NVIDIA/nvidia-container-toolkit#1444 figured out at the time. I investigated the device plugin first and came up with this patch. I believe (but have not tested) that with 1444 this patch would not be necessary. But still, probably safe and good to pick it up. |
The
nvinfoinstance passed tovalidateFlagsneeds to have the driver root path, otherwise theHasNvmlcall will fail (libnvidia-ml.so.1won't be found). This will in turn causevalidateFlagsto error out ifAnyCDIEnabledis true.@elezar