Following gpu-operator documentation will break RKE2 cluster after reboot

RKE2 docs says only about passing the config for RKE2's internal CONTAINERD_SOCKET: https://docs.rke2.io/advanced?_highlight=gpu#deploy-nvidia-operator

Nvidia's also about CONTAINERD_CONFIG: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#rancher-kubernetes-engine-2

Following gpu-operator documentation, those things will happen:

- gpu-operator will write containerd config into  `/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl `
- rke2 will pick it up as a template and make dedicated  contained config: `/var/lib/rancher/rke2/agent/etc/containerd/config.toml`
- cluster will not get up after reboot, since the config provided by gpu-operator does not work with rke2

The most significant errors in the logs would be:

```
Sep 13 14:08:23 rke2 rke2[10318]: time="2024-09-13T14:08:23Z" level=info msg="Pod for etcd not synced (pod sandbox has changed), retrying"
Sep 13 14:08:23 rke2 rke2[10318]: time="2024-09-13T14:08:23Z" level=info msg="Waiting for API server to become available"
Sep 13 14:08:25 rke2 rke2[10318]: time="2024-09-13T14:08:25Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready"
Sep 13 14:08:25 rke2 rke2[10318]: time="2024-09-13T14:08:25Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
```

Following RKE2 docs about passing only CONTAINERD_SOCKET works, since gpu-operator will write it's (not working with rke2 config) into ` /etc/containerd/config.toml`, even though containerd is not installed at the OS level.

```
root@rke2:~# apt list --installed | grep containerd

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

root@rke2:~#
```

Looks like the containerd config, provided by gpu-operator with RKE2, doesn't matter since RKE2 is able to detect `nvidia-container-runtime` and configure it's own containerd conifg with nvidia runtime class:

```
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
  SystemdCgroup = true
```

Steps to reproduce on Ubuntu 22.04:


Following Nvidia's docs breaks RKE2 cluster after reboot:
```
helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true
```

Following RKE2's docs works fine:
```
helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
    --set toolkit.env[0].name=CONTAINERD_SOCKET \
    --set toolkit.env[0].value=/run/k3s/containerd/containerd.sock
```

Could someone verify the docs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Following gpu-operator documentation will break RKE2 cluster after reboot #992

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Following gpu-operator documentation will break RKE2 cluster after reboot #992

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions