Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

@jgehrcke jgehrcke commented Jun 4, 2025

We need to give users some troubleshooting guidance.

@jgehrcke jgehrcke force-pushed the jp/plugin-init-cont branch 2 times, most recently from 71e06de to bccdaf5 Compare June 4, 2025 17:45
# actively wait for the GPU driver to be set up properly (this init
# container is meant to be a long-running init container that only
# exits upon success). That allows for auto-healing the DRA driver
# instalaltion right after the GPU driver setup has been fixed.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: typo

@jgehrcke jgehrcke force-pushed the jp/plugin-init-cont branch from 2fd2c25 to 71bde39 Compare June 5, 2025 13:32
@jgehrcke jgehrcke force-pushed the jp/plugin-init-cont branch from b9d821c to 98c5637 Compare June 5, 2025 15:42
@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Jun 5, 2025

Current state. I tested some permutations.

Scenario I) operator-provided driver, forgot to set nvidiaDriverRoot

Init container log output:

$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-s5sbb init-container
run: chroot /driver-root nvidia-smi
chroot: failed to run command 'nvidia-smi': No such file or directory

Followed by this msg (not in code block, to not make it too wide):

nvidia-smi failed (see error above). Has the NVIDIA GPU driver been set up? The GPU driver is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/') in the host filesystem. If that path appears to be unexpected: review and adjust the 'nvidiaDriverRoot' Helm chart variable. If the value is expected: review if the GPU driver has actually been installed under NVIDIA_DRIVER_ROOT.

and this hint:

Hint: /run/nvidia/driver/usr/bin/nvidia-smi exists on the host, you may want to re-install the DRA driver Helm chart with --set nvidiaDriverRoot=/run/nvidia/driver

Scenario II) NVIDIA_DRIVER_ROOT does not exist on host

$ kubectl get pods  -n nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-jsx8r
NAME                                         READY   STATUS     RESTARTS   AGE
nvidia-dra-driver-gpu-kubelet-plugin-jsx8r   0/1     Init:0/1   0          74s

Clear error message in describe output:

$ kubectl describe pods  -n nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-jsx8r | tail -n 5
Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    104s                default-scheduler  Successfully assigned nvidia-dra-driver-gpu/nvidia-dra-driver-gpu-kubelet-plugin-jsx8r to sc-starwars-mab6-b00
  Warning  FailedMount  40s (x8 over 103s)  kubelet            MountVolume.SetUp failed for volume "driver-root" : hostPath type check failed: /does/not/exist is not a directory

Zooming in, the relevant detail:

hostPath type check failed: /does/not/exist is not a directory

During that retry loop I then created that directory on the host. We then got past the MountVolume.SetUp failure, and the inner validation logic emitted first

nvidia-smi failed (see error above). [...]

and then, for this special case:

Hint: Directory /does/not/exist on the host appears to be empty

Scenario III) operator-provided driver, orphanded directory

In my testing I found that it can easily happen that /run/nvidia/driver exists, but is orphaned:

root@SC-Starwars-MAB5-B00:/run/nvidia/driver# tree
.
├── etc
└── lib
    └── firmware

3 directories, 0 files

In that case, the init container logs the same "nvidia-smi failed (see error above). [...]" as above. Additionally, for this special case:

If you chose the NVIDIA GPU Operator to manage the GPU driver (NVIDIA_DRIVER_ROOT is set to /run/nvidia/driver): make sure that Operator is deployed and healthy.

Hint: Directory /run/nvidia/driver not empty, but does not seem to contain GPU driver binaries

Scenario IV) operator-provided driver, happy path:

$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-7gddc init-container
run: chroot /driver-root nvidia-smi
Thu Jun  5 16:55:04 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.32                 Driver Version: 580.32         CUDA Version: 13.0     |
SNIP
chrooted nvidia-smi returned with code 0: success, leave

Scenario V) host-provided driver, happy path:

$ kubectl logs -n nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-d7cml init-container
<snip>
chrooted nvidia-smi returned with code 0: success, leave

@jgehrcke jgehrcke force-pushed the jp/plugin-init-cont branch from 8f21899 to 988590d Compare June 5, 2025 17:43
@jgehrcke jgehrcke force-pushed the jp/plugin-init-cont branch 3 times, most recently from 5dab7db to bc43cd6 Compare June 8, 2025 13:56
@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Jun 8, 2025

Will need to review things further. Importantly, I finally managed for the init container to detect when the operator-provided driver gets mounted at /run/nvidia/driver: that requires a combination of two strategies:

  1. Not mounting /run/nvidia/driver into the init container, because that's too deep down the hierarchy (the Operator mounts a volume at /run/nvidia.
  2. Even with a mount further up the tree (in init container and plugin container) we need mountPropagation: HostToContainer on that mount to pick up mount changes further down the hierarchy.

Some output:

$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-qdmq7 init-container -f
create symlink: /driver-root -> /host-run/nvidia/driver
2025-06-08T13:39:48Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
<snip>

$ helm install gpu-operator nvidia/gpu-operator ...
...

$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-qdmq7 init-container -f
...
2025-06-08T13:42:18Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-08T13:42:28Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-08T13:42:38Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-08T13:42:48Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: '/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1', current contents: [bin  boot  cuda-keyring_1.1-1_all.deb  dev  drivers  etc  home  host-etc  lib  licenses  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var].
invoke: env -i LD_PRELOAD=/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1 /driver-root/usr/bin/nvidia-smi
<snip>
nvidia-smi returned with code 0: success, leave

When the DRA driver is installed before the GPU Operator, the DRA driver's plugin daemonset may at some point delete and recreate its pods (including the init container); triggered by one of the mounts that the GPU Operator performs. That is fine.

Main err msg and scenario-specific hints currently read:

Check failed. Has the NVIDIA GPU driver been set up? It is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/run/nvidia/driver') in the host filesystem. If that path appears to be unexpected: review the DRA driver's 'nvidiaDriverRoot' Helm chart variable. Otherwise, review if the GPU driver has actually been installed under that path.

Hint: Directory /run/nvidia/driver is not empty but at least one of the binaries wasn't found.

Hint: NVIDIA_DRIVER_ROOT is set to '/run/nvidia/driver' which typically means that the NVIDIA GPU Operator manages the GPU driver. Make sure that the GPU Operator is deployed and healthy.

fi

# Log top-level entries in /driver-root (this may be valuable debug info).
echo "current contents: [$(/bin/ls -1xAw0 /driver-root 2>/dev/null)]."
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows what I was trying to achieve (condensed information on a single line, also showing how the contents of /driver-root evolved during init container runtime):

2025-06-08T14:13:52Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [].
2025-06-08T14:14:02Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].

@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Jun 10, 2025

Dropping the container image layer caching changes from this patch; moved them: #397

Next: squashing commits a bit, will squash to one commit before landing.

@jgehrcke jgehrcke force-pushed the jp/plugin-init-cont branch 2 times, most recently from 7ea2466 to b13b387 Compare June 10, 2025 12:19
@jgehrcke
Copy link
Collaborator Author

Current state, testing

host-provided driver at /

$ kubectl logs -n nvidia-dra-driver-gpu        nvidia-dra-driver-gpu-kubelet-plugin-kflln init-container
2025-06-10T12:35:14Z  /driver-root (/ on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: '/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1', current contents: [bin  boot  cdrom  dev  does  etc  home  lib  lost+found  media  mnt  opt  proc  root  run  sbin  snap  srv  swap.img  sys  tmp  usr  var].
invoke: env -i LD_PRELOAD=/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1 /driver-root/usr/bin/nvidia-smi
<snip>
nvidia-smi returned with code 0: success, leave 

operator-provided driver

  1. initial condition: no dra driver, no gpu operator
  2. install dra driver
  3. inspect init container logs:
$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-q7flz init-container -f
2025-06-10T12:42:25Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [].

Check failed. Has the NVIDIA GPU driver been set up? It is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/run/nvidia/driver') in the host filesystem. If that path appears to be unexpected: review the DRA driver's 'nvidiaDriverRoot' Helm chart variable. Otherwise, review if the GPU driver has actually been installed under that path.
Hint: Directory /run/nvidia/driver on the host is empty
Hint: NVIDIA_DRIVER_ROOT is set to '/run/nvidia/driver' which typically means that the NVIDIA GPU Operator manages the GPU driver. Make sure that the GPU Operator is deployed and healthy.

2025-06-10T12:42:35Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [].
2025-06-10T12:42:45Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [].
...
  1. install gpu operator
  2. keep inspecting init container logs
$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-q7flz init-container -f
...
2025-06-10T12:43:05Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [].
2025-06-10T12:43:15Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [].
2025-06-10T12:43:25.736Z: received SIGTERM
...

So, above the ds-managed pod was deleted & recreated as of a mount event

  1. inspect init container logs (new pod id):
$ kubectl logs -n nvidia-dra-driver-gpu   nvidia-dra-driver-gpu-kubelet-plugin-vjb2q init-container -f
2025-06-10T12:43:27Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].

Check failed. Has the NVIDIA GPU driver been set up? It is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/run/nvidia/driver') in the host filesystem. If that path appears to be unexpected: review the DRA driver's 'nvidiaDriverRoot' Helm chart variable. Otherwise, review if the GPU driver has actually been installed under that path.
Hint: Directory /run/nvidia/driver is not empty but at least one of the binaries wasn't found.
Hint: NVIDIA_DRIVER_ROOT is set to '/run/nvidia/driver' which typically means that the NVIDIA GPU Operator manages the GPU driver. Make sure that the GPU Operator is deployed and healthy.

2025-06-10T12:43:37Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-10T12:43:47Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-10T12:43:57Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-10T12:44:07Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-10T12:44:17Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-10T12:44:27Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: '/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1', current contents: [bin  boot  cuda-keyring_1.1-1_all.deb  dev  drivers  etc  home  host-etc  lib  licenses  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var].
invoke: env -i LD_PRELOAD=/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1 /driver-root/usr/bin/nvidia-smi
<snip-nvidia-smi-output>
nvidia-smi returned with code 0: success, leave

Above, we first see [lib] to pop up (intermediate state). A minute later, the driver volume is mounted and the check passes.

@jgehrcke jgehrcke force-pushed the jp/plugin-init-cont branch from f033684 to 9c7c0c5 Compare June 10, 2025 12:52
@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Jun 10, 2025

last force-push: squashing commits w/o change, i.e. the test output in the previous comment corresponds to squashed commit

Copy link
Collaborator

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small nits, otherwise looks good.
Could definitely benefit from being migrated into go code (and potentially contributed back to the gpu-operator validator in terms of the error handling that's done).

@jgehrcke jgehrcke force-pushed the jp/plugin-init-cont branch from 9c7c0c5 to 0924a0a Compare June 10, 2025 13:38
@jgehrcke jgehrcke force-pushed the jp/plugin-init-cont branch from 09aeeda to 6af5954 Compare June 11, 2025 10:31
@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Jun 11, 2025

For the record, how this evolved (describing the scenario of the operator-provided GPU driver):

  1. In the first days of building this, I converged to mounting /run into the init container. I did this after noticing that the Operator's driver installer mounts /run/nvidia, and I wanted to go "above that" because I saw that this helped with instabilities that I did not fully understand at the time.

  2. In review, we looked at what the Operator's driver validator is doing -- it mounts /run/nvidia/driver -- and then we asked critically: why can't we work with the same in this init container? So, we tried doing that.

  3. After trying hard to build the init container by just mounting in /run/nvidia/driver, we could identify at least two rather nasty instability categories. At least one instability source did relate to the fact that the Operator's teardown / resource cleanup is not (and can never be) fully predictable: the host file system's state at /run/nvidia has to be treated as unknown when the init container comes up: path may exist or not, something may be mounted there or not, and there may be a bunch of mounts below that or not.

  4. We decided to go back to a conceptually more safe approach where we can be sure that we're not suffering from disjunct filesystem / mount hierarchies, but have one guaranteed common node in the hierarchy. However, instead of taking the overly conservative approach of mounting /run (cf. (1)), we tried working with /run/nvidia. Gladly, that seems to work well so far (no instability found yet after many attempts of testing).

Copy link
Collaborator

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the back and forth on this. I feel like we learned alot while putting this PR together. Not necessarily about the 100% right way to do this, but definitely alot about how not to do it.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
@jgehrcke jgehrcke force-pushed the jp/plugin-init-cont branch from c985522 to b33f9fd Compare June 11, 2025 13:35
@jgehrcke
Copy link
Collaborator Author

Tested the latest here again manually.

Host-provided driver:

create symlink: /driver-root -> /driver-root-parent/
2025-06-11T13:42:56Z  /driver-root (/ on host): ...
...
nvidia-smi returned with code 0: success, leave

✔️

Operator-provided driver:

create symlink: /driver-root -> /driver-root-parent/driver
2025-06-11T13:44:56Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].

Check failed. Has the NVIDIA GPU driver been set up? It is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/run/nvidia/driver') in the host filesystem. If that path appears to be unexpected: review the DRA driver's 'nvidiaDriverRoot' Helm chart variable. Otherwise, review if the GPU driver has actually been installed under that path.
Hint: Directory /run/nvidia/driver is not empty but at least one of the binaries wasn't found.
Hint: NVIDIA_DRIVER_ROOT is set to '/run/nvidia/driver' which typically means that the NVIDIA GPU Operator manages the GPU driver. Make sure that the GPU Operator is deployed and healthy.

2025-06-11T13:45:06Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-11T13:45:16Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-11T13:45:26Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-11T13:45:36Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-11T13:45:46Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: not found, libnvidia-ml.so.1: not found, current contents: [lib].
2025-06-11T13:45:56Z  /driver-root (/run/nvidia/driver on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: '/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1', current contents: [bin  boot  cuda-keyring_1.1-1_all.deb  dev  drivers  etc  home  host-etc  lib  licenses  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var].
invoke: env -i LD_PRELOAD=/driver-root/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1 /driver-root/usr/bin/nvidia-smi
<snip>
nvidia-smi returned with code 0: success, leave

✔️

@jgehrcke jgehrcke merged commit 0a18f1a into NVIDIA:main Jun 11, 2025
13 checks passed
@klueska klueska added this to the v25.3.0 milestone Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants