Skip to content

Following gpu-operator documentation will break RKE2 cluster after reboot #992

@aiicore

Description

@aiicore

RKE2 docs says only about passing the config for RKE2's internal CONTAINERD_SOCKET: https://docs.rke2.io/advanced?_highlight=gpu#deploy-nvidia-operator

Nvidia's also about CONTAINERD_CONFIG: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#rancher-kubernetes-engine-2

Following gpu-operator documentation, those things will happen:

  • gpu-operator will write containerd config into /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
  • rke2 will pick it up as a template and make dedicated contained config: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
  • cluster will not get up after reboot, since the config provided by gpu-operator does not work with rke2

The most significant errors in the logs would be:

Sep 13 14:08:23 rke2 rke2[10318]: time="2024-09-13T14:08:23Z" level=info msg="Pod for etcd not synced (pod sandbox has changed), retrying"
Sep 13 14:08:23 rke2 rke2[10318]: time="2024-09-13T14:08:23Z" level=info msg="Waiting for API server to become available"
Sep 13 14:08:25 rke2 rke2[10318]: time="2024-09-13T14:08:25Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready"
Sep 13 14:08:25 rke2 rke2[10318]: time="2024-09-13T14:08:25Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"

Following RKE2 docs about passing only CONTAINERD_SOCKET works, since gpu-operator will write it's (not working with rke2 config) into /etc/containerd/config.toml, even though containerd is not installed at the OS level.

root@rke2:~# apt list --installed | grep containerd

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

root@rke2:~#

Looks like the containerd config, provided by gpu-operator with RKE2, doesn't matter since RKE2 is able to detect nvidia-container-runtime and configure it's own containerd conifg with nvidia runtime class:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
  SystemdCgroup = true

Steps to reproduce on Ubuntu 22.04:

Following Nvidia's docs breaks RKE2 cluster after reboot:

helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true

Following RKE2's docs works fine:

helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
    --set toolkit.env[0].name=CONTAINERD_SOCKET \
    --set toolkit.env[0].value=/run/k3s/containerd/containerd.sock

Could someone verify the docs?

Metadata

Metadata

Assignees

Labels

lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions