-
Notifications
You must be signed in to change notification settings - Fork 98
Add init container to plugin pod (GPU driver dependency check) #389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
71e06de to
bccdaf5
Compare
deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml
Outdated
Show resolved
Hide resolved
deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml
Outdated
Show resolved
Hide resolved
deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml
Outdated
Show resolved
Hide resolved
| # actively wait for the GPU driver to be set up properly (this init | ||
| # container is meant to be a long-running init container that only | ||
| # exits upon success). That allows for auto-healing the DRA driver | ||
| # instalaltion right after the GPU driver setup has been fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: typo
deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml
Outdated
Show resolved
Hide resolved
deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml
Outdated
Show resolved
Hide resolved
deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml
Outdated
Show resolved
Hide resolved
deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml
Outdated
Show resolved
Hide resolved
2fd2c25 to
71bde39
Compare
b9d821c to
98c5637
Compare
|
Current state. I tested some permutations. Scenario I) operator-provided driver, forgot to set
|
8f21899 to
988590d
Compare
deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml
Outdated
Show resolved
Hide resolved
5dab7db to
bc43cd6
Compare
|
Will need to review things further. Importantly, I finally managed for the init container to detect when the operator-provided driver gets mounted at
Some output: When the DRA driver is installed before the GPU Operator, the DRA driver's plugin daemonset may at some point delete and recreate its pods (including the init container); triggered by one of the mounts that the GPU Operator performs. That is fine. Main err msg and scenario-specific hints currently read:
|
| fi | ||
|
|
||
| # Log top-level entries in /driver-root (this may be valuable debug info). | ||
| echo "current contents: [$(/bin/ls -1xAw0 /driver-root 2>/dev/null)]." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows what I was trying to achieve (condensed information on a single line, also showing how the contents of /driver-root evolved during init container runtime):
2025-06-08T14:13:52Z /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [].
2025-06-08T14:14:02Z /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
|
Dropping the container image layer caching changes from this patch; moved them: #397 Next: squashing commits a bit, will squash to one commit before landing. |
7ea2466 to
b13b387
Compare
|
Current state, testing host-provided driver at /operator-provided driver
So, above the ds-managed pod was deleted & recreated as of a mount event
Above, we first see |
f033684 to
9c7c0c5
Compare
|
last force-push: squashing commits w/o change, i.e. the test output in the previous comment corresponds to squashed commit |
klueska
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some small nits, otherwise looks good.
Could definitely benefit from being migrated into go code (and potentially contributed back to the gpu-operator validator in terms of the error handling that's done).
9c7c0c5 to
0924a0a
Compare
09aeeda to
6af5954
Compare
|
For the record, how this evolved (describing the scenario of the operator-provided GPU driver):
|
deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml
Outdated
Show resolved
Hide resolved
klueska
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the back and forth on this. I feel like we learned alot while putting this PR together. Not necessarily about the 100% right way to do this, but definitely alot about how not to do it.
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
c985522 to
b33f9fd
Compare
|
Tested the latest here again manually. Host-provided driver: ✔️ Operator-provided driver: ✔️ |
We need to give users some troubleshooting guidance.