containerd getting restarted on gpu node reboots



### 1. Quick Debug Information
* OS/Version: RHEL8.8
* Kernel Version:4.18.0-477.27.1.el8_8.x86_64
* Container Runtime Type/Versio:  Containerd
* K8s Flavor/Version: v1.26.2
* GPU Operator Version: 23.6.1


### 2. Issue or feature description

In our setup whenever the GPU node gets restarted then the stateful workloads (pods consuming PVC using CSI driver) go into crashlookback state when the node gets ready, we suspect containerd restart by gpu-operator(container-toolkit pod) causing this issue and we are continuing our investigation.

We have observed whenever the nvidia-container-toolkit pod starts it configures the host’s containerd config.toml file with **nvidia** runtime and then restarts the containerd service on the host.

We think, since there are no changes in containerd configurations on the node (for e.g. in restart scenario )then why gpu-operator(container-toolkit pod) restarts the containerd service when nvidia runtime information already persisted?

Can't we put a check in container-toolkit pod not to restart containerd service if the configurations are unchanged?  This means when the pod comes up check the configurations and if it finds all the configurations are correct and no modification is needed then don't restart containerd. restart containerd only when required.

Also, the same behaviour is there for nvidia-driver pod, when it comes up it cleans up the node, removes the drivers and modules and re-installs, I think this can also be avoided. instead of doing cleanup, it should check for all the necessary files, driver-version, kernel-version etc. and driver health, if there is a need then only perform cleanup and install the driver. this will help to bring up gpu-operator in less time.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

containerd getting restarted on gpu node reboots #594

1. Quick Debug Information

2. Issue or feature description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

containerd getting restarted on gpu node reboots #594

Description

1. Quick Debug Information

2. Issue or feature description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions