-
Notifications
You must be signed in to change notification settings - Fork 98
Open
Labels
bugIssue/PR to expose/discuss/fix a bugIssue/PR to expose/discuss/fix a bug
Milestone
Description
Hi,
I was following the tutorial to get DRA running and so initially everything was working as expected until the installation of the driver.
The kubelet plugin directly fails with:
Error: error creating driver: failed to create device library: failed to locate driver libraries: error locating "libnvidia-ml.so.1"
Some extra information:
In order for the cluster to be able to mount the cdi device, I needed to change the name/path from runtime.nvidia.com/gpu/all to nvidia.com/gpu/all as I only see this as CDI device.
➜ k8s-dra-driver git:(main) ✗ sudo nvidia-ctk cdi list
INFO[0000] Found 1 CDI devices
nvidia.com/gpu=all
Here some details about my setup:
GPU: GTX 1080
OS: Windows/WSL
nvidia-container-toolkit version:
NVIDIA Container Toolkit CLI version 1.17.0
commit: 5bc031544833253e3ab6a36daec376dc13a4f479
runtime config:
➜ k8s-dra-driver git:(main) ✗ nvidia-ctk runtime configure --dry-run
INFO[0000] Loading config from /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
nvidia-smi command on host:
➜ k8s-dra-driver git:(main) ✗ nvidia-smi
Tue Nov 12 20:54:32 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.02 Driver Version: 566.03 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1080 On | 00000000:01:00.0 On | N/A |
| 0% 58C P0 45W / 210W | 1907MiB / 8192MiB | 3% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
nvidia-smi command on node:
root@k8s-dra-driver-cluster-worker:/# nvidia-smi
Tue Nov 12 19:55:46 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.02 Driver Version: 566.03 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1080 On | 00000000:01:00.0 On | N/A |
| 0% 61C P0 46W / 210W | 1903MiB / 8192MiB | 3% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Any help would be highly appreciated!
I already saw a similar behavior in this issue, but it s not the same: #65
Metadata
Metadata
Assignees
Labels
bugIssue/PR to expose/discuss/fix a bugIssue/PR to expose/discuss/fix a bug
Type
Projects
Status
Backlog