Grace Hopper 200 (GH200) install recommendations?

I have a few Grace Hopper 200's that I am trying to cluster up using k8s.

On the host i have the 560 drivers running from the repos:

```
cat /etc/apt/sources.list.d/nvidia-container-toolkit.list 
deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /

nvidia-smi
Sat Mar 15 16:44:47 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 480GB             Off |   00000009:01:00.0 Off |                    0 |
| N/A   27C    P0             73W /  900W |       1MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
```

I can also upgrade to 570 using the nvidia run scripts, if anyone thinks that would work better.

Currently I have the gpu-operator installed as thus:

```
helm install --wait nvidiagpu \
-n gpu-operator --create-namespace \
--set toolkit.enabled=false \
--set driver.enabled=true \
--set nfd.enabled=true \
nvidia/gpu-operator
```

Which after a bit the logs from a feature-discovery container start like this:
 
```
kl -n gpu-operator gpu-feature-discovery-vnvvf
I0315 15:59:46.358279       1 main.go:163] Starting OS watcher.
I0315 15:59:46.358425       1 main.go:168] Loading configuration.
I0315 15:59:46.358648       1 main.go:180] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "single",
    "failOnInitError": true,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "useNodeFeatureAPI": false,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": null,
      "deviceListStrategy": null,
      "deviceIDStrategy": null,
      "cdiAnnotationPrefix": null,
      "nvidiaCTKPath": null,
      "containerDriverRoot": "/driver-root"
    },
    "gfd": {
      "oneshot": false,
      "noTimestamp": false,
      "sleepInterval": "1m0s",
      "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
      "machineTypeFile": "/sys/class/dmi/id/product_name"
    }
  },
  "resources": {
    "gpus": null
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I0315 15:59:46.370695       1 factory.go:49] Using NVML manager
I0315 15:59:46.370708       1 main.go:210] Start running
I0315 15:59:46.397097       1 main.go:274] Creating Labels
I0315 15:59:46.397110       1 output.go:82] Writing labels to output file /etc/kubernetes/node-feature-discovery/features.d/gfd
I0315 15:59:46.397596       1 main.go:283] Sleeping for 60000000000
```

Its the ` "gpus": null` that concerns me, and sure enough I don't seem to be able to run gpu loads?

I can try with the toolkit enabled as well:

```
helm install --wait nvidiagpu \
-n gpu-operator --create-namespace \
--set toolkit.enabled=false \
--set driver.enabled=true \
--set nfd.enabled=true \
nvidia/gpu-operator
```

Where I get logs like this:

```
[nvidia-ctk]
  path = "/usr/local/nvidia/toolkit/nvidia-ctk"
time="2025-03-15T16:56:12Z" level=info msg="Starting 'setup' for nvidia-toolkit"
time="2025-03-15T16:56:12Z" level=info msg="Using config version 2"
time="2025-03-15T16:56:12Z" level=info msg="Using CRI runtime plugin name \"io.containerd.grpc.v1.cri\""
time="2025-03-15T16:56:12Z" level=info msg="Flushing config to /runtime/config-dir/config.toml"
time="2025-03-15T16:56:12Z" level=info msg="Sending SIGHUP signal to containerd"
time="2025-03-15T16:56:12Z" level=warning msg="Error signaling containerd, attempt 1/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:17Z" level=warning msg="Error signaling containerd, attempt 2/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:22Z" level=warning msg="Error signaling containerd, attempt 3/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:27Z" level=warning msg="Error signaling containerd, attempt 4/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:32Z" level=warning msg="Error signaling containerd, attempt 5/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:37Z" level=warning msg="Max retries reached 6/6, aborting"
time="2025-03-15T16:56:37Z" level=info msg="Shutting Down"
time="2025-03-15T16:56:37Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: unable to restart containerd: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
```

But I think that is because container toolkit is installed on the host?  Or should this `/runtime/sock-dir` be set set?  I'm uncertain and wondering if there was any advice out there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Grace Hopper 200 (GH200) install recommendations? #1347

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Grace Hopper 200 (GH200) install recommendations? #1347

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions