Skip to content

Conversation

@cdesiniotis
Copy link
Contributor

@cdesiniotis cdesiniotis commented Aug 1, 2025

Review commit-by-commit.

@cdesiniotis
Copy link
Contributor Author

The e2e test failures should be fixed by #1580

@coveralls
Copy link

coveralls commented Aug 5, 2025

Coverage Status

coverage: 19.424% (+0.3%) from 19.087%
when pulling ea036db on cdi-by-default
into 38f6413 on main.

@cdesiniotis cdesiniotis marked this pull request as ready for review August 5, 2025 22:33
// they get access to all GPUs.
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true")
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CrioConfigModeEnvName, "config")
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaRuntimeSetAsDefaultEnvName, "false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Could there be a race condition where the currently running device plugin still allows pods to be scheduled with the envvar strategy, while the toolkit update has already removed the default nvidia runtime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this race condition is possible. When upgrading the gpu-operator, new revisions of the plugin and toolkit daemonsets will get rolled out. As we don't coordinate their rollouts, it is possible for the new toolkit pod to start on a node before the new plugin pod. If a GPU pod gets scheduled onto the node at this particular point in time, the old plugin will set the envvar instead of the CDI annotation / CRI fields, which means the pod will not have any GPUs injected. I am not sure how we can avoid this race condition...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that race condition not already exist? It is already possible for users to configure the toolkit as is shown here. How would one ensure ordering / coordination between the components?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this race condition does already exist. The only way that I know of to coordinate is to write a custom controller...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, users could manually edit these today and run into the same issue. The difference here is that we're toggling the default behavior automatically, which makes it more prone to occur during upgrades. If possible we have to rollout the device-plugin change first in the state machine followed by the toolkit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky since toolkit/plugin DaemonSet pod updates can rollout on any node, so they may not end up on the same one. Another option is to continue setting the NVIDIA runtime as the default. In native CDI mode, the toolkit should effectively be a no-op for a couple of releases, giving us a smooth transition period. After that, we can remove the default runtime entirely.

Copy link
Contributor Author

@cdesiniotis cdesiniotis Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tariq1890 had an idea that would enforce ordering -- what if we deleted the other operand pods (e.g. device-plugin, gfd, dcgm-exporter) in the toolkit's initContainer? This is similar to what we do in the driver's initContainer in order to ensure that all the other operands (which depend on the driver) start AFTER the driver pod is in running and ready.

On each node where the new toolkit is rolled out on, we have it's initContainer delete the other operand pods, which forces the device-plugin to also upgrade on that node (if it hadn't already). This would address the race condition pointed out in this thread.

@shivamerla thoughts? Is the added complexity of adding this logic in the toolkit's initContainer worth it in your opinion? Also note, this does not prevent all possible race conditions. For example, it is still possible for the new device-plugin to run on a node before the new toolkit -- this could be problematic if a new GPU pod got scheduled and the underlying runtime (containerd) does not have CDI enabled, as the new plugin will not be setting the NVIDIA_VISIBLE_DEVICES envvar anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this as an alternative: Since the device plugin REQUIRES the NVIDIA Container Toolkit to start up and have access to the GPUs, we could check whether it has started with the expected toolkit configuration and if not, then fail? One option would be to add an envvar to the management CDI spec that we could check for, for example.
We do in fact already inject such an envvar which we could consider using: https://github.com/NVIDIA/nvidia-container-toolkit/blob/8739e483f1800964adbb920335db7e7d32a88d02/pkg/nvcdi/driver-nvml.go#L117-L120

Copy link
Contributor Author

@cdesiniotis cdesiniotis Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting we add a new envvar that represents the current toolkit configuration (maybe a hash of the toml file)? And the device-plugin, on startup, would recompute the hash and compare it? If I am understanding your proposal correctly then that means we would also need to mount the toolkit configuration file into the device-plugin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean that we can detect if the device plugin was started with the new toolkit if the device plugin container includes the NVIDIA_CTK_LIBCUDA_DIR environment variable. This will only be possible if the toolkit was installed and is running before the device plugin is started.

If we add an init container to the device plugin that blocks on the presence of this envvar (when CDI is enabled?) would that at least ensure the desired ordering? (I don't know whether this approach is perfect, but it's possible with what's already available).

@shivamerla
Copy link
Contributor

Great to see that we are switching to cdi as default finally!

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cdesiniotis

When discussing switching to using native CDI where possible, we also mentioned how to handle containerd 1.6 where CDI is not supported, and we would need to leverage the NVIDIA Container Runtime. That is not called out explicitly here (or in the design doc) and we should remember to flesh out the steps a user would have to take in this case. (or do we just tell these users that they would have to switch to cdi.enabled = false`?)

// to inject GPUs into workloads. We do not configure 'nvidia' as the default runtime. The
// 'nvidia' runtime will be set as the runtime class for our management containers so that
// they get access to all GPUs.
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of referring to the container by a certain index directly, we should retrieve the index of the desired container to modify by looping through the containers and selecting the container that matches the desired name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This PR is already getting quite large, so I might clean this up in a follow-up. I would also need to update TransformToolkit() and TransformDevicePlugin() to retrieve the container at a particular index as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to not refer to a certain container by index.

@elezar
Copy link
Member

elezar commented Aug 19, 2025

@cdesiniotis if I understood our discussions correctly, we're switching to NATIVE CDI here. Should we update the PR title / description and commit messages to indicate this.

@cdesiniotis cdesiniotis changed the title Use CDI by default as the mechanism for injecting GPUs into workload containers Use native CDI by default as the mechanism for injecting GPUs into workload containers Aug 19, 2025
@cdesiniotis cdesiniotis force-pushed the cdi-by-default branch 2 times, most recently from 38fb05d to 460f065 Compare August 29, 2025 23:26
@xiongzubiao
Copy link

This is really useful to support third-party OCI runtimes in CDI mode. Will it be included in the next release?

// 'nvidia' runtime will be set as the runtime class for our management containers so that
// they get access to all GPUs.
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true")
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CRIOConfigModeEnvName, "config")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we changed the default crio config mode from hook to config in v1.18.0 of toolkit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should do this.

I have updated my PR so that the operator sets config as the crio config mode by default. I mentioned this here: #1578 (comment)

I have also raised NVIDIA/nvidia-container-toolkit#1314 to update the default crio config mode that is hardwired in the toolkit container.

cc @elezar

@cdesiniotis
Copy link
Contributor Author

When discussing switching to using native CDI where possible, we also mentioned how to handle containerd 1.6 where CDI is not supported, and we would need to leverage the NVIDIA Container Runtime. That is not called out explicitly here (or in the design doc) and we should remember to flesh out the steps a user would have to take in this case. (or do we just tell these users that they would have to switch to cdi.enabled = false`?)

Users have two options, both of which I have tested on a system with containerd 1.6 running.

Option 1) set nvidia as the default runtime:

toolkit:
  env:
  - name: "NVIDIA_RUNTIME_SET_AS_DEFAULT"
    value: "true"

Option 2) disable CDI:

cdi:
  enabled: false

We just need to decide which option (or both) to call out in our documentation / release notes. The one benefit of option 1 is that users would still be using CDI (in our runtime as opposed to containerd).

Comment on lines +124 to +129
# The distroless image does not contain the ld.so.cache file. The update-ldcache
# createContainer hook will skip updating the container's ldcache if this file
# does not exist. This can lead to issues running the CUDA vectorAdd sample
# if the NVIDIA libraries are not present in the default dynamic linker search
# path(s).
COPY --from=sample-builder /etc/ld.so.cache /etc/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does setting the LD_LIBRARY_PATH not make more sense? (although we may not know where this is).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting we set LD_LIBRARY_PATH to all possible directories where the driver libraries can be found at in the container? Is the set of possible directories known up front?

I noticed on my OpenShift node that libraries got installed at /usr/lib64 while on an Ubuntu node driver libraries got installed at /usr/lib/x86_64-linux-gnu.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the tricky part. We have discussed always mouting the libraries to the same folder in any container, but have not done this yet. If importing the ld.cache works, I think I'm ok with doing this for the time being.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is NVIDIA_CTK_LIBCUDA_DIR the path of libcuda.so's parent directory on the host AND in the container?

I guess we could leverage this. Since the NVIDIA_CTK_LIBCIDA_DIR envvar is not known until runtime, I assume we would have to set LD_LIBRARY_PATH=$NVIDIA_CTK_LIBCUDA_DIR in the entrypoint command of the validator container? Or is there another way to leverage this envvar?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is that NVIDIA_CTK_LIBCUDA_DIR is the path to the directory containing libcuda.so in the container. This is not determined at runtime, but is encoded in the CDI spec. Setting the LD_LIBRARY_PATH=$NVIDIA_CTK_LIBCUDA_DIR:$LD_LIBRARY_PATH as you specify in the entrypoint was what I was thinking.

Like I said ... if adding a dummy ldcache works for now, I'd be happy to look at this in a follow-up.


if runtime == gpuv1.CRIO.String() {
// We add the nvidia runtime to the cri-o config by default instead of installing the OCI prestart hook
setContainerEnv(mainContainer, CRIOConfigModeEnvName, "config")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we now do this twice, or is the mainContainer different from obj.Spec.Template.Spec.Containers[0]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this so that we only set this envvar here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK .So this means that we DON'T SUPPORT hook-based injection for cri-o -- unless a user explicitly selects it? (to be clear, given the changes in v1.18.0, I would recommend this). Should we issue a warning if hook is selected (perhaps in the tookit-container)?

Copy link
Contributor Author

@cdesiniotis cdesiniotis Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, hook-based injection will not be used for cri-o unless a user explicitly selects it, by setting CRIO_CONFIG_MODE=hook. I am not sure that this means hook-based injection is no longer "supported" though.

Should we issue a warning if hook is selected (perhaps in the tookit-container)?

If we want to start discouraging the use of it (and maybe "deprecate" it?) then I wouldn't be opposed to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly suggest deprecating it in this context. The feature drift between the legacy hook and the new runtime (or CDI) is significant. I would also argue that crio users are more likely to have more restrictive configurations such as SELinux enabled.

This aligns with how we install NVIDIA Container Toolkit when containerd is the runtime.

Signed-off-by: Christopher Desiniotis <[email protected]>
Comment on lines 1236 to 1238
if config.CDI.IsDefault() {
setContainerEnv(container, NvidiaCtrRuntimeModeEnvName, "cdi")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: This will not trigger the automatic generation of CDI specs in the toolkit.

// update env required for CDI support
if config.CDI.IsEnabled() {
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true")
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaCtrRuntimeCDIPrefixesEnvName, "nvidia.cdi.k8s.io/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we remove this, do we also expect the nvidia runtime to handle annotation-based injections that use cdi.k8s.io/ annotations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I am trying to understand the use case / different possibilities here.

By default, the nvidia runtime will not be invoked for pods that request a GPU -- as nvidia is not configured as the default runtime. The CDI device passed via the CRI field is what will trigger the runtime, like containerd, to inject the requested GPU(s).

If a user sets runtimeClassName=nvidia in their pod spec, then the nvidia runtime would be invoked. By default, the nvidia runtime is configured with annotation-prefixes = ["cdi.k8s.io/"] (right?) so it would detect the GPU specified in the annotation (that our device-plugin set in its AllocateResonse) and inject it using CDI. I guess containerd / cri-o would also inject the GPU(s), but I assume things should just "work" even if the CDI device injection is performed twice.

Both of these cases should work, correct? Am I missing anything?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I really remember typing a response here, but it seems to have gone missing).

Making the same modifications twice should be fine, but there may be some unexpected sideeffects specifically if hooks are run multiple times. The creation of symlinks should be idempotent since we do check whether the symlink that we're creating is already valid. In the case of something like update-ldcache, there may be some side-effects like multiple /etc/ld.so.conf.d files with the same contents.

To be clear, this is not a configuration that I can test easily given my available machines, so if you have a test system that I could have access to (or that we can pair on), that may be useful.

// configure runtime
runtime := n.runtime.String()
err = transformForRuntime(obj, config, runtime, "nvidia-container-toolkit-ctr")
err = transformForRuntime(obj, config, runtime, mainContainerName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something for a follow-up: Does it make sense to rather have transformForRuntime accept a pointer to a container instead of a name? We then don't need to iterate over the containers there to find the right container. (Alternatively we could consider an implementation that uses generics).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I can make the proposed change in a follow-up.

Comment on lines +1423 to +1434
var mainContainer *corev1.Container
mainContainerName := "nvidia-device-plugin"
for i, ctr := range obj.Spec.Template.Spec.Containers {
if ctr.Name == mainContainerName {
mainContainer = &obj.Spec.Template.Spec.Containers[i]
break
}
}
if mainContainer == nil {
return fmt.Errorf("failed to find main container %q", mainContainerName)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something for follow-up: It may clean things up to have a findContainerByName function that returns the container we want.

return err
}
obj.Spec.Template.Spec.Containers[0].Image = image
mainContainer.Image = image
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a note: This is a simple but useful cleanup. The one comment is whether we want to always call this mainContainer or something more specific to make global searches simpler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe toolkitMainContainer in the TransformToolkit function and devicePluginMainContainer in the TransformDevicePlugin function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I was referring to. Not critical for this change though.

Comment on lines +1471 to +1473
setContainerEnv(mainContainer, GDSEnabledEnvName, "true")
setContainerEnv(mainContainer, MOFEDEnabledEnvName, "true")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we're going to have to do the same for GDR copy (and also merge NVIDIA/k8s-device-plugin#1339).


cdi:
enabled: false
default: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What impact does removing this have on downgrades?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should impact downgrades. Older releases of the operator (where the cdi.default field is not a no-op) handle the case where this field is omitted from the ClusterPolicy spec -- the default value is false in this case: https://github.com/NVIDIA/gpu-operator/blob/v25.3.3/api/nvidia/v1/clusterpolicy_types.go#L2074-L2079


cdi:
enabled: false
enabled: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to handle the case where this was NOT set (i.e. null) and a user upgrades?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is not strictly required in this case, because when the cdi.enabled field is not set we default to true in the code.

I would argue this is needed in the case where cdi.enabled=false in the older operator version. During the upgrade, we want to toggle this to true -- if the user wants to continue to NOT use CDI, they will need to provide an override during the helm upgrade.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so we would need to explicitly document this override? (I am a little rusty on helm upgrade semantics).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can call this out explicitly in our release notes -- we will definitely mention that the default value of this field is updated in the new release, and we can document how to keep the prior default (false) via an override file or the helm command line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a nit, I think we could make this change independent of this PR if we wanted to push it though.

cdesiniotis and others added 7 commits September 18, 2025 16:01
…lkit

The runtimeClassName field will always be set now regardless if cri-o
or containerd is the container runtime.

Signed-off-by: Christopher Desiniotis <[email protected]>
…nabled=true

This commit updates the default behavior when cdi.enabled=true. We now leverage
native CDI support in containerd / cri-o to inject GPU devices into workload
containers. This means we no longer configure 'nvidia' as the default runtime.
Our management containers will continue to use the 'nvidia' runtime to access
GPUs by explicitly setting runtimeClassName=nvidia in their pod specs.

Signed-off-by: Christopher Desiniotis <[email protected]>
Signed-off-by: Christopher Desiniotis <[email protected]>
Signed-off-by: Christopher Desiniotis <[email protected]>
Containerd 1.7.0 is the first release that supports CDI. CDI devices were exclusively
passed to containerd via annotations.
https://github.com/containerd/containerd/releases/tag/v1.7.0

In containerd 1.7.2 support was added for making use
of the new CRI fields 'Config.CDIDevices' so that CDI devices could be passed through the CRI
instead of annotations.
https://github.com/containerd/containerd/releases/tag/v1.7.2

This commit updates our holodeck configuration to a version of containerd that supports CDI
and the new CRI fields. containerd 1.7.27 is chosen since it is the latest release on the 1.7
branch that is available in the Ubuntu 22.04 repositories.

Signed-off-by: Christopher Desiniotis <[email protected]>
The distroless image does not contain the ld.so.cache file. The update-ldcache
createContainer hook will skip updating the container's ldcache if this file
does not exist. This can lead to issues running the CUDA vectorAdd sample
if the NVIDIA libraries are not present in the default dynamic linker search
path(s).

Signed-off-by: Christopher Desiniotis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants