-
Notifications
You must be signed in to change notification settings - Fork 413
Use native CDI by default as the mechanism for injecting GPUs into workload containers #1578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
abb4991 to
3e45e91
Compare
|
The e2e test failures should be fixed by #1580 |
3e45e91 to
f730b9d
Compare
f394a0a to
7161af4
Compare
controllers/object_controls.go
Outdated
| // they get access to all GPUs. | ||
| setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true") | ||
| setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CrioConfigModeEnvName, "config") | ||
| setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaRuntimeSetAsDefaultEnvName, "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Could there be a race condition where the currently running device plugin still allows pods to be scheduled with the envvar strategy, while the toolkit update has already removed the default nvidia runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this race condition is possible. When upgrading the gpu-operator, new revisions of the plugin and toolkit daemonsets will get rolled out. As we don't coordinate their rollouts, it is possible for the new toolkit pod to start on a node before the new plugin pod. If a GPU pod gets scheduled onto the node at this particular point in time, the old plugin will set the envvar instead of the CDI annotation / CRI fields, which means the pod will not have any GPUs injected. I am not sure how we can avoid this race condition...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that race condition not already exist? It is already possible for users to configure the toolkit as is shown here. How would one ensure ordering / coordination between the components?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this race condition does already exist. The only way that I know of to coordinate is to write a custom controller...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, users could manually edit these today and run into the same issue. The difference here is that we're toggling the default behavior automatically, which makes it more prone to occur during upgrades. If possible we have to rollout the device-plugin change first in the state machine followed by the toolkit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky since toolkit/plugin DaemonSet pod updates can rollout on any node, so they may not end up on the same one. Another option is to continue setting the NVIDIA runtime as the default. In native CDI mode, the toolkit should effectively be a no-op for a couple of releases, giving us a smooth transition period. After that, we can remove the default runtime entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tariq1890 had an idea that would enforce ordering -- what if we deleted the other operand pods (e.g. device-plugin, gfd, dcgm-exporter) in the toolkit's initContainer? This is similar to what we do in the driver's initContainer in order to ensure that all the other operands (which depend on the driver) start AFTER the driver pod is in running and ready.
On each node where the new toolkit is rolled out on, we have it's initContainer delete the other operand pods, which forces the device-plugin to also upgrade on that node (if it hadn't already). This would address the race condition pointed out in this thread.
@shivamerla thoughts? Is the added complexity of adding this logic in the toolkit's initContainer worth it in your opinion? Also note, this does not prevent all possible race conditions. For example, it is still possible for the new device-plugin to run on a node before the new toolkit -- this could be problematic if a new GPU pod got scheduled and the underlying runtime (containerd) does not have CDI enabled, as the new plugin will not be setting the NVIDIA_VISIBLE_DEVICES envvar anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this as an alternative: Since the device plugin REQUIRES the NVIDIA Container Toolkit to start up and have access to the GPUs, we could check whether it has started with the expected toolkit configuration and if not, then fail? One option would be to add an envvar to the management CDI spec that we could check for, for example.
We do in fact already inject such an envvar which we could consider using: https://github.com/NVIDIA/nvidia-container-toolkit/blob/8739e483f1800964adbb920335db7e7d32a88d02/pkg/nvcdi/driver-nvml.go#L117-L120
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting we add a new envvar that represents the current toolkit configuration (maybe a hash of the toml file)? And the device-plugin, on startup, would recompute the hash and compare it? If I am understanding your proposal correctly then that means we would also need to mount the toolkit configuration file into the device-plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I mean that we can detect if the device plugin was started with the new toolkit if the device plugin container includes the NVIDIA_CTK_LIBCUDA_DIR environment variable. This will only be possible if the toolkit was installed and is running before the device plugin is started.
If we add an init container to the device plugin that blocks on the presence of this envvar (when CDI is enabled?) would that at least ensure the desired ordering? (I don't know whether this approach is perfect, but it's possible with what's already available).
|
Great to see that we are switching to |
elezar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cdesiniotis
When discussing switching to using native CDI where possible, we also mentioned how to handle containerd 1.6 where CDI is not supported, and we would need to leverage the NVIDIA Container Runtime. That is not called out explicitly here (or in the design doc) and we should remember to flesh out the steps a user would have to take in this case. (or do we just tell these users that they would have to switch to cdi.enabled = false`?)
controllers/object_controls.go
Outdated
| // to inject GPUs into workloads. We do not configure 'nvidia' as the default runtime. The | ||
| // 'nvidia' runtime will be set as the runtime class for our management containers so that | ||
| // they get access to all GPUs. | ||
| setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of referring to the container by a certain index directly, we should retrieve the index of the desired container to modify by looping through the containers and selecting the container that matches the desired name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This PR is already getting quite large, so I might clean this up in a follow-up. I would also need to update TransformToolkit() and TransformDevicePlugin() to retrieve the container at a particular index as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this to not refer to a certain container by index.
|
@cdesiniotis if I understood our discussions correctly, we're switching to NATIVE CDI here. Should we update the PR title / description and commit messages to indicate this. |
38fb05d to
460f065
Compare
|
This is really useful to support third-party OCI runtimes in CDI mode. Will it be included in the next release? |
controllers/object_controls.go
Outdated
| // 'nvidia' runtime will be set as the runtime class for our management containers so that | ||
| // they get access to all GPUs. | ||
| setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true") | ||
| setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CRIOConfigModeEnvName, "config") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we changed the default crio config mode from hook to config in v1.18.0 of toolkit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should do this.
I have updated my PR so that the operator sets config as the crio config mode by default. I mentioned this here: #1578 (comment)
I have also raised NVIDIA/nvidia-container-toolkit#1314 to update the default crio config mode that is hardwired in the toolkit container.
cc @elezar
Users have two options, both of which I have tested on a system with containerd 1.6 running. Option 1) set Option 2) disable CDI: We just need to decide which option (or both) to call out in our documentation / release notes. The one benefit of option 1 is that users would still be using CDI (in our runtime as opposed to containerd). |
460f065 to
6ee3a80
Compare
| # The distroless image does not contain the ld.so.cache file. The update-ldcache | ||
| # createContainer hook will skip updating the container's ldcache if this file | ||
| # does not exist. This can lead to issues running the CUDA vectorAdd sample | ||
| # if the NVIDIA libraries are not present in the default dynamic linker search | ||
| # path(s). | ||
| COPY --from=sample-builder /etc/ld.so.cache /etc/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does setting the LD_LIBRARY_PATH not make more sense? (although we may not know where this is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting we set LD_LIBRARY_PATH to all possible directories where the driver libraries can be found at in the container? Is the set of possible directories known up front?
I noticed on my OpenShift node that libraries got installed at /usr/lib64 while on an Ubuntu node driver libraries got installed at /usr/lib/x86_64-linux-gnu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the tricky part. We have discussed always mouting the libraries to the same folder in any container, but have not done this yet. If importing the ld.cache works, I think I'm ok with doing this for the time being.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use NVIDIA_CTK_LIBCUDA_DIR to set LD_LIBRARY_PATH for the validator? See https://github.com/NVIDIA/nvidia-container-toolkit/blob/8739e483f1800964adbb920335db7e7d32a88d02/pkg/nvcdi/driver-nvml.go#L117-L120
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is NVIDIA_CTK_LIBCUDA_DIR the path of libcuda.so's parent directory on the host AND in the container?
I guess we could leverage this. Since the NVIDIA_CTK_LIBCIDA_DIR envvar is not known until runtime, I assume we would have to set LD_LIBRARY_PATH=$NVIDIA_CTK_LIBCUDA_DIR in the entrypoint command of the validator container? Or is there another way to leverage this envvar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent is that NVIDIA_CTK_LIBCUDA_DIR is the path to the directory containing libcuda.so in the container. This is not determined at runtime, but is encoded in the CDI spec. Setting the LD_LIBRARY_PATH=$NVIDIA_CTK_LIBCUDA_DIR:$LD_LIBRARY_PATH as you specify in the entrypoint was what I was thinking.
Like I said ... if adding a dummy ldcache works for now, I'd be happy to look at this in a follow-up.
|
|
||
| if runtime == gpuv1.CRIO.String() { | ||
| // We add the nvidia runtime to the cri-o config by default instead of installing the OCI prestart hook | ||
| setContainerEnv(mainContainer, CRIOConfigModeEnvName, "config") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we now do this twice, or is the mainContainer different from obj.Spec.Template.Spec.Containers[0]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this so that we only set this envvar here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK .So this means that we DON'T SUPPORT hook-based injection for cri-o -- unless a user explicitly selects it? (to be clear, given the changes in v1.18.0, I would recommend this). Should we issue a warning if hook is selected (perhaps in the tookit-container)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, hook-based injection will not be used for cri-o unless a user explicitly selects it, by setting CRIO_CONFIG_MODE=hook. I am not sure that this means hook-based injection is no longer "supported" though.
Should we issue a warning if hook is selected (perhaps in the tookit-container)?
If we want to start discouraging the use of it (and maybe "deprecate" it?) then I wouldn't be opposed to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would strongly suggest deprecating it in this context. The feature drift between the legacy hook and the new runtime (or CDI) is significant. I would also argue that crio users are more likely to have more restrictive configurations such as SELinux enabled.
This aligns with how we install NVIDIA Container Toolkit when containerd is the runtime. Signed-off-by: Christopher Desiniotis <[email protected]>
6ee3a80 to
6a5f262
Compare
controllers/object_controls.go
Outdated
| if config.CDI.IsDefault() { | ||
| setContainerEnv(container, NvidiaCtrRuntimeModeEnvName, "cdi") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: This will not trigger the automatic generation of CDI specs in the toolkit.
| // update env required for CDI support | ||
| if config.CDI.IsEnabled() { | ||
| setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true") | ||
| setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaCtrRuntimeCDIPrefixesEnvName, "nvidia.cdi.k8s.io/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we remove this, do we also expect the nvidia runtime to handle annotation-based injections that use cdi.k8s.io/ annotations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I am trying to understand the use case / different possibilities here.
By default, the nvidia runtime will not be invoked for pods that request a GPU -- as nvidia is not configured as the default runtime. The CDI device passed via the CRI field is what will trigger the runtime, like containerd, to inject the requested GPU(s).
If a user sets runtimeClassName=nvidia in their pod spec, then the nvidia runtime would be invoked. By default, the nvidia runtime is configured with annotation-prefixes = ["cdi.k8s.io/"] (right?) so it would detect the GPU specified in the annotation (that our device-plugin set in its AllocateResonse) and inject it using CDI. I guess containerd / cri-o would also inject the GPU(s), but I assume things should just "work" even if the CDI device injection is performed twice.
Both of these cases should work, correct? Am I missing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I really remember typing a response here, but it seems to have gone missing).
Making the same modifications twice should be fine, but there may be some unexpected sideeffects specifically if hooks are run multiple times. The creation of symlinks should be idempotent since we do check whether the symlink that we're creating is already valid. In the case of something like update-ldcache, there may be some side-effects like multiple /etc/ld.so.conf.d files with the same contents.
To be clear, this is not a configuration that I can test easily given my available machines, so if you have a test system that I could have access to (or that we can pair on), that may be useful.
| // configure runtime | ||
| runtime := n.runtime.String() | ||
| err = transformForRuntime(obj, config, runtime, "nvidia-container-toolkit-ctr") | ||
| err = transformForRuntime(obj, config, runtime, mainContainerName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something for a follow-up: Does it make sense to rather have transformForRuntime accept a pointer to a container instead of a name? We then don't need to iterate over the containers there to find the right container. (Alternatively we could consider an implementation that uses generics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I can make the proposed change in a follow-up.
| var mainContainer *corev1.Container | ||
| mainContainerName := "nvidia-device-plugin" | ||
| for i, ctr := range obj.Spec.Template.Spec.Containers { | ||
| if ctr.Name == mainContainerName { | ||
| mainContainer = &obj.Spec.Template.Spec.Containers[i] | ||
| break | ||
| } | ||
| } | ||
| if mainContainer == nil { | ||
| return fmt.Errorf("failed to find main container %q", mainContainerName) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something for follow-up: It may clean things up to have a findContainerByName function that returns the container we want.
| return err | ||
| } | ||
| obj.Spec.Template.Spec.Containers[0].Image = image | ||
| mainContainer.Image = image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a note: This is a simple but useful cleanup. The one comment is whether we want to always call this mainContainer or something more specific to make global searches simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe toolkitMainContainer in the TransformToolkit function and devicePluginMainContainer in the TransformDevicePlugin function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I was referring to. Not critical for this change though.
| setContainerEnv(mainContainer, GDSEnabledEnvName, "true") | ||
| setContainerEnv(mainContainer, MOFEDEnabledEnvName, "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we're going to have to do the same for GDR copy (and also merge NVIDIA/k8s-device-plugin#1339).
|
|
||
| cdi: | ||
| enabled: false | ||
| default: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What impact does removing this have on downgrades?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should impact downgrades. Older releases of the operator (where the cdi.default field is not a no-op) handle the case where this field is omitted from the ClusterPolicy spec -- the default value is false in this case: https://github.com/NVIDIA/gpu-operator/blob/v25.3.3/api/nvidia/v1/clusterpolicy_types.go#L2074-L2079
|
|
||
| cdi: | ||
| enabled: false | ||
| enabled: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed to handle the case where this was NOT set (i.e. null) and a user upgrades?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This is not strictly required in this case, because when the cdi.enabled field is not set we default to true in the code.
I would argue this is needed in the case where cdi.enabled=false in the older operator version. During the upgrade, we want to toggle this to true -- if the user wants to continue to NOT use CDI, they will need to provide an override during the helm upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so we would need to explicitly document this override? (I am a little rusty on helm upgrade semantics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can call this out explicitly in our release notes -- we will definitely mention that the default value of this field is updated in the new release, and we can document how to keep the prior default (false) via an override file or the helm command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a nit, I think we could make this change independent of this PR if we wanted to push it though.
…lkit The runtimeClassName field will always be set now regardless if cri-o or containerd is the container runtime. Signed-off-by: Christopher Desiniotis <[email protected]>
…nabled=true This commit updates the default behavior when cdi.enabled=true. We now leverage native CDI support in containerd / cri-o to inject GPU devices into workload containers. This means we no longer configure 'nvidia' as the default runtime. Our management containers will continue to use the 'nvidia' runtime to access GPUs by explicitly setting runtimeClassName=nvidia in their pod specs. Signed-off-by: Christopher Desiniotis <[email protected]>
Signed-off-by: Christopher Desiniotis <[email protected]>
Signed-off-by: Christopher Desiniotis <[email protected]>
Containerd 1.7.0 is the first release that supports CDI. CDI devices were exclusively passed to containerd via annotations. https://github.com/containerd/containerd/releases/tag/v1.7.0 In containerd 1.7.2 support was added for making use of the new CRI fields 'Config.CDIDevices' so that CDI devices could be passed through the CRI instead of annotations. https://github.com/containerd/containerd/releases/tag/v1.7.2 This commit updates our holodeck configuration to a version of containerd that supports CDI and the new CRI fields. containerd 1.7.27 is chosen since it is the latest release on the 1.7 branch that is available in the Ubuntu 22.04 repositories. Signed-off-by: Christopher Desiniotis <[email protected]>
The distroless image does not contain the ld.so.cache file. The update-ldcache createContainer hook will skip updating the container's ldcache if this file does not exist. This can lead to issues running the CUDA vectorAdd sample if the NVIDIA libraries are not present in the default dynamic linker search path(s). Signed-off-by: Christopher Desiniotis <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
f141248 to
ea036db
Compare
Review commit-by-commit.