Skip to content

Grace Hopper 200 (GH200) install recommendations? #1347

@joshuacox

Description

@joshuacox

I have a few Grace Hopper 200's that I am trying to cluster up using k8s.

On the host i have the 560 drivers running from the repos:

cat /etc/apt/sources.list.d/nvidia-container-toolkit.list 
deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /

nvidia-smi
Sat Mar 15 16:44:47 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 480GB             Off |   00000009:01:00.0 Off |                    0 |
| N/A   27C    P0             73W /  900W |       1MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I can also upgrade to 570 using the nvidia run scripts, if anyone thinks that would work better.

Currently I have the gpu-operator installed as thus:

helm install --wait nvidiagpu \
-n gpu-operator --create-namespace \
--set toolkit.enabled=false \
--set driver.enabled=true \
--set nfd.enabled=true \
nvidia/gpu-operator

Which after a bit the logs from a feature-discovery container start like this:

kl -n gpu-operator gpu-feature-discovery-vnvvf
I0315 15:59:46.358279       1 main.go:163] Starting OS watcher.
I0315 15:59:46.358425       1 main.go:168] Loading configuration.
I0315 15:59:46.358648       1 main.go:180] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "single",
    "failOnInitError": true,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "useNodeFeatureAPI": false,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": null,
      "deviceListStrategy": null,
      "deviceIDStrategy": null,
      "cdiAnnotationPrefix": null,
      "nvidiaCTKPath": null,
      "containerDriverRoot": "/driver-root"
    },
    "gfd": {
      "oneshot": false,
      "noTimestamp": false,
      "sleepInterval": "1m0s",
      "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
      "machineTypeFile": "/sys/class/dmi/id/product_name"
    }
  },
  "resources": {
    "gpus": null
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I0315 15:59:46.370695       1 factory.go:49] Using NVML manager
I0315 15:59:46.370708       1 main.go:210] Start running
I0315 15:59:46.397097       1 main.go:274] Creating Labels
I0315 15:59:46.397110       1 output.go:82] Writing labels to output file /etc/kubernetes/node-feature-discovery/features.d/gfd
I0315 15:59:46.397596       1 main.go:283] Sleeping for 60000000000

Its the "gpus": null that concerns me, and sure enough I don't seem to be able to run gpu loads?

I can try with the toolkit enabled as well:

helm install --wait nvidiagpu \
-n gpu-operator --create-namespace \
--set toolkit.enabled=false \
--set driver.enabled=true \
--set nfd.enabled=true \
nvidia/gpu-operator

Where I get logs like this:

[nvidia-ctk]
  path = "/usr/local/nvidia/toolkit/nvidia-ctk"
time="2025-03-15T16:56:12Z" level=info msg="Starting 'setup' for nvidia-toolkit"
time="2025-03-15T16:56:12Z" level=info msg="Using config version 2"
time="2025-03-15T16:56:12Z" level=info msg="Using CRI runtime plugin name \"io.containerd.grpc.v1.cri\""
time="2025-03-15T16:56:12Z" level=info msg="Flushing config to /runtime/config-dir/config.toml"
time="2025-03-15T16:56:12Z" level=info msg="Sending SIGHUP signal to containerd"
time="2025-03-15T16:56:12Z" level=warning msg="Error signaling containerd, attempt 1/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:17Z" level=warning msg="Error signaling containerd, attempt 2/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:22Z" level=warning msg="Error signaling containerd, attempt 3/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:27Z" level=warning msg="Error signaling containerd, attempt 4/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:32Z" level=warning msg="Error signaling containerd, attempt 5/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:37Z" level=warning msg="Max retries reached 6/6, aborting"
time="2025-03-15T16:56:37Z" level=info msg="Shutting Down"
time="2025-03-15T16:56:37Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: unable to restart containerd: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"

But I think that is because container toolkit is installed on the host? Or should this /runtime/sock-dir be set set? I'm uncertain and wondering if there was any advice out there.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions