Add init container to plugin pod (GPU driver dependency check) #389

jgehrcke · 2025-06-04T17:39:48Z

We need to give users some troubleshooting guidance.

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml

jgehrcke · 2025-06-04T17:51:46Z

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml

+          # actively wait for the GPU driver to be set up properly (this init
+          # container is meant to be a long-running init container that only
+          # exits upon success). That allows for auto-healing the DRA driver
+          # instalaltion right after the GPU driver setup has been fixed.


note to self: typo

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml

hack/kubelet-plugin-prestart.sh

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml

hack/kubelet-plugin-prestart.sh

jgehrcke · 2025-06-05T17:43:28Z

Current state. I tested some permutations.

Scenario I) operator-provided driver, forgot to set `nvidiaDriverRoot`

Init container log output:

$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-s5sbb init-container
run: chroot /driver-root nvidia-smi
chroot: failed to run command 'nvidia-smi': No such file or directory

Followed by this msg (not in code block, to not make it too wide):

nvidia-smi failed (see error above). Has the NVIDIA GPU driver been set up? The GPU driver is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/') in the host filesystem. If that path appears to be unexpected: review and adjust the 'nvidiaDriverRoot' Helm chart variable. If the value is expected: review if the GPU driver has actually been installed under NVIDIA_DRIVER_ROOT.

and this hint:

Hint: /run/nvidia/driver/usr/bin/nvidia-smi exists on the host, you may want to re-install the DRA driver Helm chart with --set nvidiaDriverRoot=/run/nvidia/driver

Scenario II) NVIDIA_DRIVER_ROOT does not exist on host

$ kubectl get pods  -n nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-jsx8r
NAME                                         READY   STATUS     RESTARTS   AGE
nvidia-dra-driver-gpu-kubelet-plugin-jsx8r   0/1     Init:0/1   0          74s

Clear error message in describe output:

$ kubectl describe pods  -n nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-jsx8r | tail -n 5
Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    104s                default-scheduler  Successfully assigned nvidia-dra-driver-gpu/nvidia-dra-driver-gpu-kubelet-plugin-jsx8r to sc-starwars-mab6-b00
  Warning  FailedMount  40s (x8 over 103s)  kubelet            MountVolume.SetUp failed for volume "driver-root" : hostPath type check failed: /does/not/exist is not a directory

Zooming in, the relevant detail:

hostPath type check failed: /does/not/exist is not a directory

During that retry loop I then created that directory on the host. We then got past the MountVolume.SetUp failure, and the inner validation logic emitted first

nvidia-smi failed (see error above). [...]

and then, for this special case:

Hint: Directory /does/not/exist on the host appears to be empty

Scenario III) operator-provided driver, orphanded directory

In my testing I found that it can easily happen that /run/nvidia/driver exists, but is orphaned:

root@SC-Starwars-MAB5-B00:/run/nvidia/driver# tree
.
├── etc
└── lib
    └── firmware

3 directories, 0 files

In that case, the init container logs the same "nvidia-smi failed (see error above). [...]" as above. Additionally, for this special case:

If you chose the NVIDIA GPU Operator to manage the GPU driver (NVIDIA_DRIVER_ROOT is set to /run/nvidia/driver): make sure that Operator is deployed and healthy.

Hint: Directory /run/nvidia/driver not empty, but does not seem to contain GPU driver binaries

Scenario IV) operator-provided driver, happy path:

$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-7gddc init-container
run: chroot /driver-root nvidia-smi
Thu Jun  5 16:55:04 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.32                 Driver Version: 580.32         CUDA Version: 13.0     |
SNIP
chrooted nvidia-smi returned with code 0: success, leave

Scenario V) host-provided driver, happy path:

$ kubectl logs -n nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-d7cml init-container
<snip>
chrooted nvidia-smi returned with code 0: success, leave

hack/kubelet-plugin-prestart.sh

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml

hack/kubelet-plugin-prestart.sh

jgehrcke · 2025-06-08T14:16:29Z

Will need to review things further. Importantly, I finally managed for the init container to detect when the operator-provided driver gets mounted at /run/nvidia/driver: that requires a combination of two strategies:

Not mounting /run/nvidia/driver into the init container, because that's too deep down the hierarchy (the Operator mounts a volume at /run/nvidia.
Even with a mount further up the tree (in init container and plugin container) we need mountPropagation: HostToContainer on that mount to pick up mount changes further down the hierarchy.

Some output:

$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-qdmq7 init-container -f
create symlink: /driver-root -> /host-run/nvidia/driver
2025-06-08T13:39:48Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
<snip>

$ helm install gpu-operator nvidia/gpu-operator ...
...

$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-qdmq7 init-container -f
...
2025-06-08T13:42:18Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-08T13:42:28Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-08T13:42:38Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-08T13:42:48Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: '/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1', current contents: [bin  boot  cuda-keyring_1.1-1_all.deb  dev  drivers  etc  home  host-etc  lib  licenses  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var].
invoke: env -i LD_PRELOAD=/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1 /driver-root/usr/bin/nvidia-smi
<snip>
nvidia-smi returned with code 0: success, leave

When the DRA driver is installed before the GPU Operator, the DRA driver's plugin daemonset may at some point delete and recreate its pods (including the init container); triggered by one of the mounts that the GPU Operator performs. That is fine.

Main err msg and scenario-specific hints currently read:

Check failed. Has the NVIDIA GPU driver been set up? It is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/run/nvidia/driver') in the host filesystem. If that path appears to be unexpected: review the DRA driver's 'nvidiaDriverRoot' Helm chart variable. Otherwise, review if the GPU driver has actually been installed under that path.

Hint: Directory /run/nvidia/driver is not empty but at least one of the binaries wasn't found.

Hint: NVIDIA_DRIVER_ROOT is set to '/run/nvidia/driver' which typically means that the NVIDIA GPU Operator manages the GPU driver. Make sure that the GPU Operator is deployed and healthy.

deployments/container/Dockerfile

deployments/container/Makefile

jgehrcke · 2025-06-08T14:28:06Z

hack/kubelet-plugin-prestart.sh

+    fi
+
+    # Log top-level entries in /driver-root (this may be valuable debug info).
+    echo "current contents: [$(/bin/ls -1xAw0 /driver-root 2>/dev/null)]."


This shows what I was trying to achieve (condensed information on a single line, also showing how the contents of /driver-root evolved during init container runtime):

2025-06-08T14:13:52Z /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: []. 2025-06-08T14:14:02Z /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].

hack/kubelet-plugin-prestart.sh

jgehrcke · 2025-06-10T09:18:03Z

Dropping the container image layer caching changes from this patch; moved them: #397

Next: squashing commits a bit, will squash to one commit before landing.

jgehrcke · 2025-06-10T12:50:44Z

Current state, testing

host-provided driver at /

$ kubectl logs -n nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-kflln init-container
2025-06-10T12:35:14Z  /driver-root (/ on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: '/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1', current contents: [bin  boot  cdrom  dev  does  etc  home  lib  lost+found  media  mnt  opt  proc  root  run  sbin  snap  srv  swap.img  sys  tmp  usr  var].
invoke: env -i LD_PRELOAD=/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1 /driver-root/usr/bin/nvidia-smi
<snip>
nvidia-smi returned with code 0: success, leave

operator-provided driver

initial condition: no dra driver, no gpu operator
install dra driver
inspect init container logs:

$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-q7flz init-container -f
2025-06-10T12:42:25Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [].

Check failed. Has the NVIDIA GPU driver been set up? It is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/run/nvidia/driver') in the host filesystem. If that path appears to be unexpected: review the DRA driver's 'nvidiaDriverRoot' Helm chart variable. Otherwise, review if the GPU driver has actually been installed under that path.
Hint: Directory /run/nvidia/driver on the host is empty
Hint: NVIDIA_DRIVER_ROOT is set to '/run/nvidia/driver' which typically means that the NVIDIA GPU Operator manages the GPU driver. Make sure that the GPU Operator is deployed and healthy.

2025-06-10T12:42:35Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [].
2025-06-10T12:42:45Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [].
...

install gpu operator
keep inspecting init container logs

$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-q7flz init-container -f
...
2025-06-10T12:43:05Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [].
2025-06-10T12:43:15Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [].
2025-06-10T12:43:25.736Z: received SIGTERM
...

So, above the ds-managed pod was deleted & recreated as of a mount event

inspect init container logs (new pod id):

$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-vjb2q init-container -f
2025-06-10T12:43:27Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].

Check failed. Has the NVIDIA GPU driver been set up? It is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/run/nvidia/driver') in the host filesystem. If that path appears to be unexpected: review the DRA driver's 'nvidiaDriverRoot' Helm chart variable. Otherwise, review if the GPU driver has actually been installed under that path.
Hint: Directory /run/nvidia/driver is not empty but at least one of the binaries wasn't found.
Hint: NVIDIA_DRIVER_ROOT is set to '/run/nvidia/driver' which typically means that the NVIDIA GPU Operator manages the GPU driver. Make sure that the GPU Operator is deployed and healthy.

2025-06-10T12:43:37Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-10T12:43:47Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-10T12:43:57Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-10T12:44:07Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-10T12:44:17Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-10T12:44:27Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: '/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1', current contents: [bin  boot  cuda-keyring_1.1-1_all.deb  dev  drivers  etc  home  host-etc  lib  licenses  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var].
invoke: env -i LD_PRELOAD=/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1 /driver-root/usr/bin/nvidia-smi
<snip-nvidia-smi-output>
nvidia-smi returned with code 0: success, leave

Above, we first see [lib] to pop up (intermediate state). A minute later, the driver volume is mounted and the check passes.

jgehrcke · 2025-06-10T12:59:27Z

last force-push: squashing commits w/o change, i.e. the test output in the previous comment corresponds to squashed commit

hack/kubelet-plugin-prestart.sh

klueska

Just some small nits, otherwise looks good.
Could definitely benefit from being migrated into go code (and potentially contributed back to the gpu-operator validator in terms of the error handling that's done).

hack/kubelet-plugin-prestart.sh

jgehrcke · 2025-06-11T10:48:22Z

For the record, how this evolved (describing the scenario of the operator-provided GPU driver):

In the first days of building this, I converged to mounting /run into the init container. I did this after noticing that the Operator's driver installer mounts /run/nvidia, and I wanted to go "above that" because I saw that this helped with instabilities that I did not fully understand at the time.
In review, we looked at what the Operator's driver validator is doing -- it mounts /run/nvidia/driver -- and then we asked critically: why can't we work with the same in this init container? So, we tried doing that.
After trying hard to build the init container by just mounting in /run/nvidia/driver, we could identify at least two rather nasty instability categories. At least one instability source did relate to the fact that the Operator's teardown / resource cleanup is not (and can never be) fully predictable: the host file system's state at /run/nvidia has to be treated as unknown when the init container comes up: path may exist or not, something may be mounted there or not, and there may be a bunch of mounts below that or not.
We decided to go back to a conceptually more safe approach where we can be sure that we're not suffering from disjunct filesystem / mount hierarchies, but have one guaranteed common node in the hierarchy. However, instead of taking the overly conservative approach of mounting /run (cf. (1)), we tried working with /run/nvidia. Gladly, that seems to work well so far (no instability found yet after many attempts of testing).

hack/kubelet-plugin-prestart.sh

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml

hack/kubelet-plugin-prestart.sh

klueska

Thanks for all the back and forth on this. I feel like we learned alot while putting this PR together. Not necessarily about the 100% right way to do this, but definitely alot about how not to do it.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke · 2025-06-11T13:49:43Z

Tested the latest here again manually.

Host-provided driver:

create symlink: /driver-root -> /driver-root-parent/
2025-06-11T13:42:56Z  /driver-root (/ on host): ...
...
nvidia-smi returned with code 0: success, leave

✔️

Operator-provided driver:

create symlink: /driver-root -> /driver-root-parent/driver
2025-06-11T13:44:56Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].

Check failed. Has the NVIDIA GPU driver been set up? It is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/run/nvidia/driver') in the host filesystem. If that path appears to be unexpected: review the DRA driver's 'nvidiaDriverRoot' Helm chart variable. Otherwise, review if the GPU driver has actually been installed under that path.
Hint: Directory /run/nvidia/driver is not empty but at least one of the binaries wasn't found.
Hint: NVIDIA_DRIVER_ROOT is set to '/run/nvidia/driver' which typically means that the NVIDIA GPU Operator manages the GPU driver. Make sure that the GPU Operator is deployed and healthy.

2025-06-11T13:45:06Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-11T13:45:16Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-11T13:45:26Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-11T13:45:36Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-11T13:45:46Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-11T13:45:56Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: '/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1', current contents: [bin  boot  cuda-keyring_1.1-1_all.deb  dev  drivers  etc  home  host-etc  lib  licenses  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var].
invoke: env -i LD_PRELOAD=/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1 /driver-root/usr/bin/nvidia-smi
<snip>
nvidia-smi returned with code 0: success, leave

✔️

jgehrcke force-pushed the jp/plugin-init-cont branch 2 times, most recently from 71e06de to bccdaf5 Compare June 4, 2025 17:45

jgehrcke commented Jun 4, 2025

View reviewed changes

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml Outdated Show resolved Hide resolved

jgehrcke commented Jun 4, 2025

View reviewed changes

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml Outdated Show resolved Hide resolved

jgehrcke commented Jun 4, 2025

View reviewed changes

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml Outdated Show resolved Hide resolved

jgehrcke commented Jun 4, 2025

View reviewed changes

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml Show resolved Hide resolved

jgehrcke commented Jun 4, 2025

View reviewed changes

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml Outdated Show resolved Hide resolved

klueska reviewed Jun 5, 2025

View reviewed changes

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml Outdated Show resolved Hide resolved

klueska reviewed Jun 5, 2025

View reviewed changes

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml Outdated Show resolved Hide resolved

klueska reviewed Jun 5, 2025

View reviewed changes

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml Outdated Show resolved Hide resolved

jgehrcke commented Jun 5, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Show resolved Hide resolved

jgehrcke commented Jun 5, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Outdated Show resolved Hide resolved

jgehrcke force-pushed the jp/plugin-init-cont branch from 2fd2c25 to 71bde39 Compare June 5, 2025 13:32

klueska reviewed Jun 5, 2025

View reviewed changes

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml Show resolved Hide resolved

klueska reviewed Jun 5, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Outdated Show resolved Hide resolved

jgehrcke force-pushed the jp/plugin-init-cont branch from b9d821c to 98c5637 Compare June 5, 2025 15:42

jgehrcke force-pushed the jp/plugin-init-cont branch from 8f21899 to 988590d Compare June 5, 2025 17:43

klueska reviewed Jun 5, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Outdated Show resolved Hide resolved

klueska reviewed Jun 5, 2025

View reviewed changes

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml Outdated Show resolved Hide resolved

klueska reviewed Jun 5, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Outdated Show resolved Hide resolved

klueska reviewed Jun 5, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Show resolved Hide resolved

jgehrcke force-pushed the jp/plugin-init-cont branch 3 times, most recently from 5dab7db to bc43cd6 Compare June 8, 2025 13:56

jgehrcke commented Jun 8, 2025

View reviewed changes

deployments/container/Dockerfile Outdated Show resolved Hide resolved

jgehrcke commented Jun 8, 2025

View reviewed changes

deployments/container/Makefile Outdated Show resolved Hide resolved

jgehrcke commented Jun 8, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Show resolved Hide resolved

jgehrcke mentioned this pull request Jun 10, 2025

Invalidate go build cache layer less often #397

Merged

jgehrcke force-pushed the jp/plugin-init-cont branch 2 times, most recently from 7ea2466 to b13b387 Compare June 10, 2025 12:19

jgehrcke force-pushed the jp/plugin-init-cont branch from f033684 to 9c7c0c5 Compare June 10, 2025 12:52

klueska reviewed Jun 10, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Show resolved Hide resolved

klueska reviewed Jun 10, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Show resolved Hide resolved

klueska reviewed Jun 10, 2025

View reviewed changes

jgehrcke force-pushed the jp/plugin-init-cont branch from 9c7c0c5 to 0924a0a Compare June 10, 2025 13:38

jgehrcke commented Jun 11, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Outdated Show resolved Hide resolved

jgehrcke commented Jun 11, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Outdated Show resolved Hide resolved

jgehrcke force-pushed the jp/plugin-init-cont branch from 09aeeda to 6af5954 Compare June 11, 2025 10:31

jgehrcke commented Jun 11, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Outdated Show resolved Hide resolved

jgehrcke commented Jun 11, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Outdated Show resolved Hide resolved

klueska reviewed Jun 11, 2025

View reviewed changes

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml Outdated Show resolved Hide resolved

klueska reviewed Jun 11, 2025

View reviewed changes

hack/kubelet-plugin-prestart.sh Outdated Show resolved Hide resolved

klueska approved these changes Jun 11, 2025

View reviewed changes

Add init container to plugin pod (GPU driver dependency check)

b33f9fd

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]> Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke force-pushed the jp/plugin-init-cont branch from c985522 to b33f9fd Compare June 11, 2025 13:35

klueska approved these changes Jun 11, 2025

View reviewed changes

jgehrcke merged commit 0a18f1a into NVIDIA:main Jun 11, 2025
13 checks passed

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Jun 16, 2025

klueska added this to Planning Board: k8s-dra-driver-gpu Jun 16, 2025

jgehrcke mentioned this pull request Jul 3, 2025

Fix COPYing files during container image build #418

Merged

klueska added this to the v25.3.0 milestone Aug 13, 2025

Add init container to plugin pod (GPU driver dependency check) #389

Add init container to plugin pod (GPU driver dependency check) #389

Uh oh!

Conversation

jgehrcke commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgehrcke Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgehrcke commented Jun 5, 2025

Scenario I) operator-provided driver, forgot to set nvidiaDriverRoot

Scenario II) NVIDIA_DRIVER_ROOT does not exist on host

Scenario III) operator-provided driver, orphanded directory

Scenario IV) operator-provided driver, happy path:

Scenario V) host-provided driver, happy path:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgehrcke commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgehrcke Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jgehrcke commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgehrcke commented Jun 10, 2025

host-provided driver at /

operator-provided driver

Uh oh!

jgehrcke commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

klueska left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jgehrcke commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

klueska left a comment

Choose a reason for hiding this comment

Uh oh!

jgehrcke commented Jun 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jgehrcke commented Jun 4, 2025 •

edited

Loading

Scenario I) operator-provided driver, forgot to set `nvidiaDriverRoot`

jgehrcke commented Jun 8, 2025 •

edited

Loading

jgehrcke commented Jun 10, 2025 •

edited

Loading

jgehrcke commented Jun 10, 2025 •

edited

Loading

jgehrcke commented Jun 11, 2025 •

edited

Loading