-
Notifications
You must be signed in to change notification settings - Fork 412
Description
1. Quick Debug Information
- OS/Version: RHEL8.8
- Kernel Version:4.18.0-477.27.1.el8_8.x86_64
- Container Runtime Type/Versio: Containerd
- K8s Flavor/Version: v1.26.2
- GPU Operator Version: 23.6.1
2. Issue or feature description
In our setup whenever the GPU node gets restarted then the stateful workloads (pods consuming PVC using CSI driver) go into crashlookback state when the node gets ready, we suspect containerd restart by gpu-operator(container-toolkit pod) causing this issue and we are continuing our investigation.
We have observed whenever the nvidia-container-toolkit pod starts it configures the host’s containerd config.toml file with nvidia runtime and then restarts the containerd service on the host.
We think, since there are no changes in containerd configurations on the node (for e.g. in restart scenario )then why gpu-operator(container-toolkit pod) restarts the containerd service when nvidia runtime information already persisted?
Can't we put a check in container-toolkit pod not to restart containerd service if the configurations are unchanged? This means when the pod comes up check the configurations and if it finds all the configurations are correct and no modification is needed then don't restart containerd. restart containerd only when required.
Also, the same behaviour is there for nvidia-driver pod, when it comes up it cleans up the node, removes the drivers and modules and re-installs, I think this can also be avoided. instead of doing cleanup, it should check for all the necessary files, driver-version, kernel-version etc. and driver health, if there is a need then only perform cleanup and install the driver. this will help to bring up gpu-operator in less time.