-
Notifications
You must be signed in to change notification settings - Fork 435
Refactor extracting requested devices from the container image #1110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Unify device selection logic by centralizing it in a new CUDA.GetDevices method and updating all callers to use this unified API.
- Introduce
CUDA.GetDevices,WithSpec, andWithSwarmResourceto the CUDA image builder - Refactor internal modifiers and the runtime hook to call
GetDevicesand updated privileged checks - Add method-based
IsPrivilegedand tests for it; add swarm-resource support
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| internal/modifier/gated.go | Switched to GetDevices and changed log level to Debugf for no-ops |
| internal/modifier/cdi.go | Updated privileged check to use new method-based API |
| internal/config/image/cuda_image.go | Added GetDevices, IsSwarmResource, and refactored CUDA struct |
| internal/config/image/privileged.go | Converted IsPrivileged from free function to CUDA method |
| internal/config/image/builder.go | Added WithSpec and WithSwarmResource builder options |
| internal/config/image/privileged_test.go | New unit tests for the IsPrivileged method |
| cmd/.../container_config.go | Migrated hook code to call CUDA.GetDevices and IsPrivileged |
| cmd/.../container_config_test.go | Updated hook tests to use WithPrivileged and WithSwarmResource |
Comments suppressed due to low confidence (3)
internal/config/image/cuda_image.go:234
- New swarm-resource logic in
VisibleDevicesFromEnvVarisn’t covered by existing tests. Consider adding unit tests forIsSwarmResource,WithSwarmResource, andVisibleDevicesFromEnvVarwhen swarm resources are present.
if i.IsSwarmResource() {
internal/modifier/cdi.go:110
- The variable
containeris undefined in this scope. It looks like you intended to call the newimage.IsPrivileged()method or pass the correct object. Please replacecontainer.IsPrivileged()with the appropriate call.
if cfg.AcceptEnvvarUnprivileged || container.IsPrivileged() {
internal/config/image/cuda_image.go:90
- The comment refers to
IsSwarmbut the method is namedIsSwarmResource. Update the comment to match the method name and clarify its purpose.
// IsSwarm returns whether the image is a Docker Swarm image.
Pull Request Test Coverage Report for Build 15464918603Details
💛 - Coveralls |
| legacyImage := image.IsLegacy() | ||
|
|
||
| devices := hookConfig.getDevices(image, privileged) | ||
| devices := image.GetDevices(hookConfig.AcceptDeviceListAsVolumeMounts, hookConfig.AcceptEnvvarUnprivileged) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about also passing these two options to the image constructor. So that we only call: image.VisibleDevices here instead of having to pass through the config.
I already had a rough branch from some time ago which I have pushed here: https://github.com/elezar/nvidia-container-toolkit/tree/respect-volume-mounts-gated
internal/config/image/builder.go
Outdated
| mounts []specs.Mount | ||
| disableRequire bool | ||
|
|
||
| swarmResourceEnvvars []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be swarmResourceEnvvars, but visibleDeviceEnvvars.
internal/config/image/builder.go
Outdated
| ) | ||
|
|
||
| type builder struct { | ||
| spec *specs.Spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we should include a reference to the spec here -- or if we do we should remove the env and mounts below since they contain the same information -- or if they don't it is not clear which should take precedence.
internal/config/image/builder.go
Outdated
| // WithPrivileged sets the privileged option for the CUDA image. | ||
| // This is to allow testing the privileged mode of the container. | ||
| // DO NOT USE THIS IN PRODUCTION CODE. FOR TESTING PURPOSES ONLY. | ||
| func WithPrivileged(privileged bool) Option { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not for "Production"? Even in production we need to determine whether a container is privileged.
internal/config/image/cuda_image.go
Outdated
| return devices | ||
| } | ||
|
|
||
| func (i CUDA) GetDevices(acceptDeviceListAsVolumeMounts, acceptEnvvarUnprivileged bool) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func (i CUDA) GetDevices(acceptDeviceListAsVolumeMounts, acceptEnvvarUnprivileged bool) []string { | |
| func (i CUDA) VisibleDevices() []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly out of scope for this change, but the same logic needs to be implemented for IMEX channels and CDI devices.
internal/config/image/cuda_image.go
Outdated
| func (i CUDA) GetDevices(acceptDeviceListAsVolumeMounts, acceptEnvvarUnprivileged bool) []string { | ||
| // If enabled, try and get the device list from volume mounts first | ||
| if acceptDeviceListAsVolumeMounts { | ||
| devices := i.VisibleDevicesFromMounts() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to make VisibleDevicesFromMounts private.
internal/config/image/cuda_image.go
Outdated
| if i.IsSwarmResource() { | ||
| return i.DevicesFromEnvvars(i.swarmResourceEnvvars...).List() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behaviour in container_config is sublty different. We return the devices for the first ENVVAR that is set in the list that is set. I don't think that's what DevicesFromEnvvars does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you comment on this? Looking at the latest changes this was not addressed?
internal/config/image/cuda_image.go
Outdated
| } | ||
|
|
||
| // Fallback to reading from the environment variable if privileges are correct | ||
| devices := i.VisibleDevicesFromEnvVar() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to make this private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As called out, we can do this in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR unifies the logic for retrieving visible devices by consolidating the device detection functionality into the internal/config/image package and updating related APIs. Key changes include:
- Replacing legacy device visibility functions with a unified VisibleDevices method.
- Refactoring test and builder options to support the updated device logic.
- Updating log levels and type conversions to align with the new implementation.
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| internal/modifier/gated.go | Updated device lookup and log level change for device modification. |
| internal/modifier/cdi.go | Adjusted type casting for the IsPrivileged call. |
| internal/config/image/privileged_test.go | Added tests for the new privileged logic. |
| internal/config/image/privileged.go | Redesigned IsPrivileged using a CapabilitiesGetter interface. |
| internal/config/image/cuda_image_test.go | Updated test calls to reflect the new function naming conventions. |
| internal/config/image/cuda_image.go | Refactored CUDA image creation and unified visible devices extraction. |
| internal/config/image/builder.go | Refactored builder to embed CUDA and updated its option methods. |
| cmd/nvidia-container-runtime-hook/* | Adjusted option usage and device lookup calls to integrate changes. |
Comments suppressed due to low confidence (2)
internal/modifier/gated.go:40
- Ensure that the renaming from VisibleDevicesFromEnvVar to VisibleDevices aligns with the updated implementation and that all calling contexts are updated accordingly.
devices := image.VisibleDevices()
internal/config/image/builder.go:56
- [nitpick] Verify that embedding the CUDA struct in the builder and returning it directly does not bypass any necessary initialization previously performed explicitly.
return b.CUDA, nil
| for _, tc := range tests { | ||
| t.Run(tc.description, func(t *testing.T) { | ||
| image, _ := image.New( | ||
| opts := []image.Option{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't the results of this test ONLY depend on the CUDA image implementation? If this is the case, they should be moved to the config/image package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. This specific one does depend on other code, but the one below could be moved.
| image.WithAcceptEnvvarUnprivileged(tc.acceptUnprivileged), | ||
| ) | ||
| defaultConfig, _ := config.GetDefault() | ||
| cfg := &hookConfig{defaultConfig} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cfg is unused.
| "github.com/stretchr/testify/require" | ||
| ) | ||
|
|
||
| func TestIsPrivileged(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these tests are covered in the new file privileged_test.go. We should readd them.
internal/modifier/gated.go
Outdated
| devices := image.VisibleDevices() | ||
| if len(devices) == 0 { | ||
| logger.Debugf("No modification required; no devices requested") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a functional change and should at the very least be a separate commit from the refactoring which should only include non-functional changes.
internal/config/image/cuda_image.go
Outdated
| visibleDevicesEnvVars []string | ||
| annotationPrefixes []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a nit: Let's move this towards the end of the struct so that we logically separate the input (env, annotations, mounts) and settings that control behaviour.
internal/config/image/cuda_image.go
Outdated
| legacyCudaVersion := i.env[EnvVarCudaVersion] | ||
| cudaRequire := i.env[EnvVarNvidiaRequireCuda] | ||
| return len(legacyCudaVersion) > 0 && len(cudaRequire) == 0 | ||
| cudaVersion := i.Getenv(EnvVarCudaVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the reason for the switch to Getenv? I understand that we may be using things inconsistently here, but let's rather address that as a follow-up to keep the refactoring diff smaller?
internal/config/image/cuda_image.go
Outdated
| // IsSwarm returns whether the image is a Docker Swarm image. | ||
| func (i CUDA) IsSwarmResource() bool { | ||
| return len(i.visibleDevicesEnvVars) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not swarm specific. I don't think this function is useful.
internal/config/image/cuda_image.go
Outdated
| disable := i.Getenv(EnvVarNvidiaDisableRequire) | ||
| if disable != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert.
| } | ||
|
|
||
| // DevicesFromEnvvars returns the devices requested by the image through environment variables | ||
| func (i CUDA) DevicesFromEnvvars(envVars ...string) VisibleDevices { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the changes to this function? If we're going to clean it up, let's make sure that this is a separate commit or even PR.
internal/config/image/cuda_image.go
Outdated
| if i.IsSwarmResource() { | ||
| return i.DevicesFromEnvvars(i.visibleDevicesEnvVars...).List() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering how this is implemented in container_config, I believe this should be:
for _, envvar := range i.visibleDevicesEnvVars {
if _, ok := i.env[envvar]; ok {
return i.DevicesFromEnvvars(envvar).List()
}
}
(I think there was a bug in the original implementation, but I will have to confirm).
internal/config/image/cuda_image.go
Outdated
| type CUDA struct { | ||
| env map[string]string | ||
| visibleDevicesEnvVars []string | ||
| annotationPrefixes []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used, so remove this from the change set.
| annotationPrefixes []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed both annotationPrefixes and annotations from this PR
internal/config/image/cuda_image.go
Outdated
| // the NVIDIA_VISIBLE_DEVICES environment variable or any variables specified | ||
| // in visibleDevicesEnvVars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still not accurate.
internal/config/image/cuda_image.go
Outdated
| return devices | ||
| } | ||
|
|
||
| func (i CUDA) VisibleDevices() []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a docstring with the intended behaviour.
internal/config/image/cuda_image.go
Outdated
| if i.isPrivileged || i.acceptEnvvarUnprivileged { | ||
| return devices | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one thing that we're missing from the initial implementation is that we're not logging the fact that we're ignoring the devices requested as envvars.
Let's add a logger to the CUDA image as well and add the logging here:
i.logger.Warningf("Ignoring devices specified in NVIDIA_VISIBLE_DEVICES in unprivileged container")
d280029 to
95b458f
Compare
c563b87 to
924913f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors how CUDA image device requests are initialized and unified across internal packages, centralizing logic in internal/config/image.
- Introduce functional options (e.g., WithLogger) to
NewCUDAImageFromSpecandNewbuilder. - Refactor
IsPrivilegedto use aCapabilitiesGetterinterface and adapt call sites. - Extend and unify test coverage for visible device selection from env vars and mounts.
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/runtime/runtime_factory.go | Pass WithLogger(logger) into NewCUDAImageFromSpec |
| internal/modifier/cdi.go | Use options variant of NewCUDAImageFromSpec and updated cast |
| internal/config/image/privileged_test.go | Add table-driven tests for IsPrivileged |
| internal/config/image/privileged.go | Refactor IsPrivileged to use CapabilitiesGetter |
| internal/config/image/cuda_image_test.go | Rename constructor calls and add extensive VisibleDevices tests |
| internal/config/image/cuda_image.go | Change signature of NewCUDAImageFromSpec, add options & logic |
| internal/config/image/builder.go | Introduce Option funcs for builder and embed CUDA fields |
| cmd/nvidia-container-runtime-hook/hook_config.go | Add nil-check in getSwarmResourceEnvvars |
| cmd/nvidia-container-runtime-hook/container_config_test.go | Update test to pass new builder options |
| cmd/nvidia-container-runtime-hook/container_config.go | Refactor hook to use new image API and remove duplicated logic |
Comments suppressed due to low confidence (2)
internal/config/image/cuda_image.go:293
- [nitpick] Consider creating an issue or tracking ticket for this TODO or implementing the merge to avoid stale TODOs.
// TODO: This should be merged with getDevicesFromMounts used in the NVIDIA Container Runtime
cmd/nvidia-container-runtime-hook/container_config_test.go:479
- The error return from
image.Newis ignored; consider usingrequire.NoError(t, err)to catch instantiation failures in tests.
image, _ := image.New(
internal/config/image/builder.go
Outdated
| // WithPreferredVisibleDevicesEnvVars sets the visible devices environment variables. | ||
| // For each envvar passed: | ||
| // - Split on commas | ||
| // - Trim spaces and ignore empty strings | ||
| // - Concatenate all of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not accurate.
internal/config/image/cuda_image.go
Outdated
|
|
||
| env map[string]string | ||
| mounts []specs.Mount | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| specOpts := []Option{ | ||
| WithEnv(env), | ||
| WithMounts(spec.Mounts), | ||
| ) | ||
| WithPrivileged(IsPrivileged((*OCISpec)(spec))), | ||
| } | ||
|
|
||
| return New(append(opts, specOpts...)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing about this behaviour is that it ASSUMES that the env and mounts from the spec are preferred. It would be good to add a test for this -- although that's not a blocker for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does, but then the return appends the opts, so if the caller of NewCUDAImageFromSpec set's the WithPreferred... that will be present during the construction of the image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm speaking specifically of env and mounts, not preferredVisibleDevicesEnvVars. The point of a test is to ensure that LATER modifications don't change behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Add unit test for NewCUDAImageFromSpec
| } | ||
|
|
||
| // We log a warning if we are ignoring the environment variable requests. | ||
| i.logger.Warningf("Ignoring devices specified in NVIDIA_VISIBLE_DEVICES in unprivileged container") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically this could be checking other envvars. I'm ok to leave this as is though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, because if the previous check is not fulfilled, we will land here, and if this is a SWARM use case for example we would also be ignoring preferredVisibleDeviceEnvVars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like
if len(i.preferredVisibleDeviceEnvVars) > 0 {
i.logger.Warningf("Ignoring devices specified in %v in unprivileged container", i.preferredVisibleDeviceEnvVars)
} else {
i.logger.Warningf("Ignoring devices specified in NVIDIA_VISIBLE_DEVICES in unprivileged container")
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does not match the logic that we use to determine which envvars to use.
We need:
var ignoredEnvVars []string
for _, envVar := range i.preferredVisibleDeviceEnvVars {
if i.HasEnvvar(envVar) {
ignoredEnvVars = append(ignoredEnvVars, envVar)
}
}
if len(ignoredEnvVars) == 0 {
ignoredEnvVars = append(ignoredEnvVars, EnvVarNvidiaVisibleDevices)
}
// We log a warning if we are ignoring the environment variable requests.
i.logger.Warningf("Ignoring devices requested by environment variable(s) in unprivileged container: %v", strings.Join(ignoredEnvVars, ", "))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(strictly speaking we should factor this into a function and use that from VisibleDevicesFromEnvVar, but I don't want to waste cycles on this). Let's do this as a follow-up. Please create an issue to track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll create an issue once we merge the PR, so I can point to places in the code in the issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
// visibleEnvVars returns the environment variables that are used to determine device visibility.
// It returns the preferred environment variables that are set, or NVIDIA_VISIBLE_DEVICES if none are set.
func (i CUDA) visibleEnvVars() []string {
var envVars []string
for _, envVar := range i.preferredVisibleDeviceEnvVars {
if i.HasEnvvar(envVar) {
envVars = append(envVars, envVar)
}
}
if len(envVars) == 0 {
envVars = append(envVars, EnvVarNvidiaVisibleDevices)
}
return envVars
}(I have tested it and it works for both the logging use case and the VisibleDevicesFromEnvVar use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add a patch with #1110 (comment)
internal/config/image/builder.go
Outdated
| } | ||
| } | ||
|
|
||
| // WithPreferredVisibleDevicesEnvVars sets the visible devices environment variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // WithPreferredVisibleDevicesEnvVars sets the visible devices environment variables. | |
| // WithPreferredVisibleDevicesEnvVars sets the environment variables that should take precendence over the default NVIDIA_VISIBLE_DEVICES. | |
| // If any of these are present in the image, they are used to determine which devices were requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
8fcd35a to
fff5f82
Compare
This change consolidates the logic for determining requested devices from the container image. The logic for this has been integrated into the image.CUDA type so that multiple implementations are not required. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]> Co-authored-by: Evan Lezar <[email protected]>
elezar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this in.
This change consolidates the logic for determining requested devices
from the container image. The logic for this has been integrated into
the image.CUDA type so that multiple implementations are not required.
We will address #1049 in a follow-up.