-
Notifications
You must be signed in to change notification settings - Fork 412
Open
Labels
Description
I have a few Grace Hopper 200's that I am trying to cluster up using k8s.
On the host i have the 560 drivers running from the repos:
cat /etc/apt/sources.list.d/nvidia-container-toolkit.list
deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /
nvidia-smi
Sat Mar 15 16:44:47 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GH200 480GB Off | 00000009:01:00.0 Off | 0 |
| N/A 27C P0 73W / 900W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
I can also upgrade to 570 using the nvidia run scripts, if anyone thinks that would work better.
Currently I have the gpu-operator installed as thus:
helm install --wait nvidiagpu \
-n gpu-operator --create-namespace \
--set toolkit.enabled=false \
--set driver.enabled=true \
--set nfd.enabled=true \
nvidia/gpu-operator
Which after a bit the logs from a feature-discovery container start like this:
kl -n gpu-operator gpu-feature-discovery-vnvvf
I0315 15:59:46.358279 1 main.go:163] Starting OS watcher.
I0315 15:59:46.358425 1 main.go:168] Loading configuration.
I0315 15:59:46.358648 1 main.go:180]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": true,
"gdsEnabled": null,
"mofedEnabled": null,
"useNodeFeatureAPI": false,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": null,
"deviceListStrategy": null,
"deviceIDStrategy": null,
"cdiAnnotationPrefix": null,
"nvidiaCTKPath": null,
"containerDriverRoot": "/driver-root"
},
"gfd": {
"oneshot": false,
"noTimestamp": false,
"sleepInterval": "1m0s",
"outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
"machineTypeFile": "/sys/class/dmi/id/product_name"
}
},
"resources": {
"gpus": null
},
"sharing": {
"timeSlicing": {}
},
"imex": {}
}
I0315 15:59:46.370695 1 factory.go:49] Using NVML manager
I0315 15:59:46.370708 1 main.go:210] Start running
I0315 15:59:46.397097 1 main.go:274] Creating Labels
I0315 15:59:46.397110 1 output.go:82] Writing labels to output file /etc/kubernetes/node-feature-discovery/features.d/gfd
I0315 15:59:46.397596 1 main.go:283] Sleeping for 60000000000
Its the "gpus": null that concerns me, and sure enough I don't seem to be able to run gpu loads?
I can try with the toolkit enabled as well:
helm install --wait nvidiagpu \
-n gpu-operator --create-namespace \
--set toolkit.enabled=false \
--set driver.enabled=true \
--set nfd.enabled=true \
nvidia/gpu-operator
Where I get logs like this:
[nvidia-ctk]
path = "/usr/local/nvidia/toolkit/nvidia-ctk"
time="2025-03-15T16:56:12Z" level=info msg="Starting 'setup' for nvidia-toolkit"
time="2025-03-15T16:56:12Z" level=info msg="Using config version 2"
time="2025-03-15T16:56:12Z" level=info msg="Using CRI runtime plugin name \"io.containerd.grpc.v1.cri\""
time="2025-03-15T16:56:12Z" level=info msg="Flushing config to /runtime/config-dir/config.toml"
time="2025-03-15T16:56:12Z" level=info msg="Sending SIGHUP signal to containerd"
time="2025-03-15T16:56:12Z" level=warning msg="Error signaling containerd, attempt 1/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:17Z" level=warning msg="Error signaling containerd, attempt 2/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:22Z" level=warning msg="Error signaling containerd, attempt 3/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:27Z" level=warning msg="Error signaling containerd, attempt 4/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:32Z" level=warning msg="Error signaling containerd, attempt 5/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:37Z" level=warning msg="Max retries reached 6/6, aborting"
time="2025-03-15T16:56:37Z" level=info msg="Shutting Down"
time="2025-03-15T16:56:37Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: unable to restart containerd: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
But I think that is because container toolkit is installed on the host? Or should this /runtime/sock-dir be set set? I'm uncertain and wondering if there was any advice out there.