Skip to content

Conversation

@cdesiniotis
Copy link
Contributor

@cdesiniotis cdesiniotis commented Oct 16, 2025

This commit updates the behavior of the nvidia-ctk-installer for cri-o.
On shutdown, we no longer delete the drop-in config file as long as
none of the nvidia runtime handlers are set as the default runtime.
This change was made to workaround an issue observed when uninstalling
the gpu-operator -- management containers launched with the nvidia
runtime handler would get stuck in the terminating state with the below
error message:

failed to find runtime handler nvidia from runtime list map[crun:... runc:...], failed to "KillPodSandbox" for ...

There appears to be a race condition where the nvidia-ctk-installer removes the drop-in file
and restarts cri-o. After the cri-o restart, if there are still pods / containers to terminate
that were started with the nvidia runtime, then cri-o fails to terminate them. The behavior
of cri-o, and its in-memory runtime handler cache, appears to differ from that of containerd as
we have never encountered such an issue with containerd.

This commit can be considered a stop-gap solution until more robust solution is developed.

Signed-off-by: Christopher Desiniotis [email protected]

@cdesiniotis cdesiniotis force-pushed the do-not-cleanup-crio-drop-in branch 8 times, most recently from 7d12529 to c9c7aa7 Compare October 17, 2025 05:50
@cdesiniotis cdesiniotis marked this pull request as ready for review October 17, 2025 05:50
@cdesiniotis cdesiniotis changed the title [nvidia-ctk-installer] do not revert cri-o config when nvidia is not … [nvidia-ctk-installer] do not revert cri-o config on shutdown Oct 17, 2025
@elezar
Copy link
Member

elezar commented Oct 17, 2025

One question: Even if we don't delete the drop-in file, do we no still remove the binaries?

Comment on lines +183 to +185
if !o.SetAsDefault {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still need to check whether we're using a drop-in file? We could be modifying the top-level config directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow why that matters? Regardless of whether we use a drop-in file or not, removing the nvidia runtime from the cri-o config can lead to the issues described in the PR description.

The reason I thought to add this conditional is that we must attempt to revert the config if nvidia is the default runtime, or else all future pods will fail to function (after the nvidia runtime binaries are removed).

@cdesiniotis
Copy link
Contributor Author

One question: Even if we don't delete the drop-in file, do we no still remove the binaries?

I believe we still do, yes.

@cdesiniotis cdesiniotis force-pushed the do-not-cleanup-crio-drop-in branch 3 times, most recently from e6d4b6e to 521c243 Compare October 17, 2025 20:59
This commit updates the behavior of the nvidia-ctk-installer for cri-o.
On shutdown, we no longer delete the drop-in config file as long as
none of the nvidia runtime handlers are set as the default runtime.
This change was made to workaround an issue observed when uninstalling
the gpu-operator -- management containers launched with the nvidia
runtime handler would get stuck in the terminating state with the below
error message:

```
failed to find runtime handler nvidia from runtime list map[crun:... runc:...], failed to "KillPodSandbox" for ...
```

There appears to be a race condition where the nvidia-ctk-installer removes the drop-in file
and restarts cri-o. After the cri-o restart, if there are still pods / containers to terminate
that were started with the nvidia runtime, then cri-o fails to terminate them. The behavior
of cri-o, and its in-memory runtime handler cache, appears to differ from that of containerd as
we have never encountered such an issue with containerd.

This commit can be considered a stop-gap solution until more robust solution is developed.

Signed-off-by: Christopher Desiniotis <[email protected]>
@cdesiniotis cdesiniotis force-pushed the do-not-cleanup-crio-drop-in branch from 42a4ad6 to 9fab81c Compare October 19, 2025 15:11
@cdesiniotis cdesiniotis merged commit dda4b93 into NVIDIA:main Oct 19, 2025
13 checks passed
@elezar
Copy link
Member

elezar commented Oct 21, 2025

One question: Even if we don't delete the drop-in file, do we no still remove the binaries?

I believe we still do, yes.

Just to close the loop on this. In checking the implementation yesterday, we do unconfigure the runtime, but do NOT remove the binaries. This is the reason that containerd still functions and why CRIO functions with this change. This should be a sufficient workaround for the time being, but the implications of being able to trigger the nvidia runtime even if it may not currently be configured / available in the strictest sense should be investigated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants