-
Notifications
You must be signed in to change notification settings - Fork 412
Description
RKE2 docs says only about passing the config for RKE2's internal CONTAINERD_SOCKET: https://docs.rke2.io/advanced?_highlight=gpu#deploy-nvidia-operator
Nvidia's also about CONTAINERD_CONFIG: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#rancher-kubernetes-engine-2
Following gpu-operator documentation, those things will happen:
- gpu-operator will write containerd config into
/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl - rke2 will pick it up as a template and make dedicated contained config:
/var/lib/rancher/rke2/agent/etc/containerd/config.toml - cluster will not get up after reboot, since the config provided by gpu-operator does not work with rke2
The most significant errors in the logs would be:
Sep 13 14:08:23 rke2 rke2[10318]: time="2024-09-13T14:08:23Z" level=info msg="Pod for etcd not synced (pod sandbox has changed), retrying"
Sep 13 14:08:23 rke2 rke2[10318]: time="2024-09-13T14:08:23Z" level=info msg="Waiting for API server to become available"
Sep 13 14:08:25 rke2 rke2[10318]: time="2024-09-13T14:08:25Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready"
Sep 13 14:08:25 rke2 rke2[10318]: time="2024-09-13T14:08:25Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Following RKE2 docs about passing only CONTAINERD_SOCKET works, since gpu-operator will write it's (not working with rke2 config) into /etc/containerd/config.toml, even though containerd is not installed at the OS level.
root@rke2:~# apt list --installed | grep containerd
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
root@rke2:~#
Looks like the containerd config, provided by gpu-operator with RKE2, doesn't matter since RKE2 is able to detect nvidia-container-runtime and configure it's own containerd conifg with nvidia runtime class:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
SystemdCgroup = true
Steps to reproduce on Ubuntu 22.04:
Following Nvidia's docs breaks RKE2 cluster after reboot:
helm install gpu-operator -n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value=true
Following RKE2's docs works fine:
helm install gpu-operator -n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set toolkit.env[0].name=CONTAINERD_SOCKET \
--set toolkit.env[0].value=/run/k3s/containerd/containerd.sock
Could someone verify the docs?