Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented May 28, 2025

This change consolidates the logic for determining requested devices
from the container image. The logic for this has been integrated into
the image.CUDA type so that multiple implementations are not required.

We will address #1049 in a follow-up.

@ArangoGutierrez ArangoGutierrez requested review from Copilot and elezar May 28, 2025 05:35
@ArangoGutierrez ArangoGutierrez self-assigned this May 28, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Unify device selection logic by centralizing it in a new CUDA.GetDevices method and updating all callers to use this unified API.

  • Introduce CUDA.GetDevices, WithSpec, and WithSwarmResource to the CUDA image builder
  • Refactor internal modifiers and the runtime hook to call GetDevices and updated privileged checks
  • Add method-based IsPrivileged and tests for it; add swarm-resource support

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
internal/modifier/gated.go Switched to GetDevices and changed log level to Debugf for no-ops
internal/modifier/cdi.go Updated privileged check to use new method-based API
internal/config/image/cuda_image.go Added GetDevices, IsSwarmResource, and refactored CUDA struct
internal/config/image/privileged.go Converted IsPrivileged from free function to CUDA method
internal/config/image/builder.go Added WithSpec and WithSwarmResource builder options
internal/config/image/privileged_test.go New unit tests for the IsPrivileged method
cmd/.../container_config.go Migrated hook code to call CUDA.GetDevices and IsPrivileged
cmd/.../container_config_test.go Updated hook tests to use WithPrivileged and WithSwarmResource
Comments suppressed due to low confidence (3)

internal/config/image/cuda_image.go:234

  • New swarm-resource logic in VisibleDevicesFromEnvVar isn’t covered by existing tests. Consider adding unit tests for IsSwarmResource, WithSwarmResource, and VisibleDevicesFromEnvVar when swarm resources are present.
if i.IsSwarmResource() {

internal/modifier/cdi.go:110

  • The variable container is undefined in this scope. It looks like you intended to call the new image.IsPrivileged() method or pass the correct object. Please replace container.IsPrivileged() with the appropriate call.
if cfg.AcceptEnvvarUnprivileged || container.IsPrivileged() {

internal/config/image/cuda_image.go:90

  • The comment refers to IsSwarm but the method is named IsSwarmResource. Update the comment to match the method name and clarify its purpose.
// IsSwarm returns whether the image is a Docker Swarm image.

@coveralls
Copy link

coveralls commented May 28, 2025

Pull Request Test Coverage Report for Build 15464918603

Details

  • 90 of 106 (84.91%) changed or added relevant lines in 7 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.3%) to 33.619%

Changes Missing Coverage Covered Lines Changed/Added Lines %
internal/config/image/cuda_image.go 27 29 93.1%
internal/modifier/cdi.go 0 5 0.0%
cmd/nvidia-container-runtime-hook/container_config.go 10 19 52.63%
Files with Coverage Reduction New Missed Lines %
cmd/nvidia-container-runtime-hook/container_config.go 1 58.54%
Totals Coverage Status
Change from base Build 15437228265: 0.3%
Covered Lines: 4272
Relevant Lines: 12707

💛 - Coveralls

legacyImage := image.IsLegacy()

devices := hookConfig.getDevices(image, privileged)
devices := image.GetDevices(hookConfig.AcceptDeviceListAsVolumeMounts, hookConfig.AcceptEnvvarUnprivileged)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about also passing these two options to the image constructor. So that we only call: image.VisibleDevices here instead of having to pass through the config.

I already had a rough branch from some time ago which I have pushed here: https://github.com/elezar/nvidia-container-toolkit/tree/respect-volume-mounts-gated

mounts []specs.Mount
disableRequire bool

swarmResourceEnvvars []string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be swarmResourceEnvvars, but visibleDeviceEnvvars.

)

type builder struct {
spec *specs.Spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we should include a reference to the spec here -- or if we do we should remove the env and mounts below since they contain the same information -- or if they don't it is not clear which should take precedence.

Comment on lines 141 to 144
// WithPrivileged sets the privileged option for the CUDA image.
// This is to allow testing the privileged mode of the container.
// DO NOT USE THIS IN PRODUCTION CODE. FOR TESTING PURPOSES ONLY.
func WithPrivileged(privileged bool) Option {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not for "Production"? Even in production we need to determine whether a container is privileged.

return devices
}

func (i CUDA) GetDevices(acceptDeviceListAsVolumeMounts, acceptEnvvarUnprivileged bool) []string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (i CUDA) GetDevices(acceptDeviceListAsVolumeMounts, acceptEnvvarUnprivileged bool) []string {
func (i CUDA) VisibleDevices() []string {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly out of scope for this change, but the same logic needs to be implemented for IMEX channels and CDI devices.

func (i CUDA) GetDevices(acceptDeviceListAsVolumeMounts, acceptEnvvarUnprivileged bool) []string {
// If enabled, try and get the device list from volume mounts first
if acceptDeviceListAsVolumeMounts {
devices := i.VisibleDevicesFromMounts()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to make VisibleDevicesFromMounts private.

Comment on lines 231 to 272
if i.IsSwarmResource() {
return i.DevicesFromEnvvars(i.swarmResourceEnvvars...).List()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behaviour in container_config is sublty different. We return the devices for the first ENVVAR that is set in the list that is set. I don't think that's what DevicesFromEnvvars does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment on this? Looking at the latest changes this was not addressed?

}

// Fallback to reading from the environment variable if privileges are correct
devices := i.VisibleDevicesFromEnvVar()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to make this private.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As called out, we can do this in a follow-up.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR unifies the logic for retrieving visible devices by consolidating the device detection functionality into the internal/config/image package and updating related APIs. Key changes include:

  • Replacing legacy device visibility functions with a unified VisibleDevices method.
  • Refactoring test and builder options to support the updated device logic.
  • Updating log levels and type conversions to align with the new implementation.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/modifier/gated.go Updated device lookup and log level change for device modification.
internal/modifier/cdi.go Adjusted type casting for the IsPrivileged call.
internal/config/image/privileged_test.go Added tests for the new privileged logic.
internal/config/image/privileged.go Redesigned IsPrivileged using a CapabilitiesGetter interface.
internal/config/image/cuda_image_test.go Updated test calls to reflect the new function naming conventions.
internal/config/image/cuda_image.go Refactored CUDA image creation and unified visible devices extraction.
internal/config/image/builder.go Refactored builder to embed CUDA and updated its option methods.
cmd/nvidia-container-runtime-hook/* Adjusted option usage and device lookup calls to integrate changes.
Comments suppressed due to low confidence (2)

internal/modifier/gated.go:40

  • Ensure that the renaming from VisibleDevicesFromEnvVar to VisibleDevices aligns with the updated implementation and that all calling contexts are updated accordingly.
devices := image.VisibleDevices()

internal/config/image/builder.go:56

  • [nitpick] Verify that embedding the CUDA struct in the builder and returning it directly does not bypass any necessary initialization previously performed explicitly.
return b.CUDA, nil

for _, tc := range tests {
t.Run(tc.description, func(t *testing.T) {
image, _ := image.New(
opts := []image.Option{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't the results of this test ONLY depend on the CUDA image implementation? If this is the case, they should be moved to the config/image package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. This specific one does depend on other code, but the one below could be moved.

image.WithAcceptEnvvarUnprivileged(tc.acceptUnprivileged),
)
defaultConfig, _ := config.GetDefault()
cfg := &hookConfig{defaultConfig}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfg is unused.

"github.com/stretchr/testify/require"
)

func TestIsPrivileged(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these tests are covered in the new file privileged_test.go. We should readd them.

Comment on lines 40 to 42
devices := image.VisibleDevices()
if len(devices) == 0 {
logger.Debugf("No modification required; no devices requested")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a functional change and should at the very least be a separate commit from the refactoring which should only include non-functional changes.

Comment on lines 41 to 42
visibleDevicesEnvVars []string
annotationPrefixes []string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a nit: Let's move this towards the end of the struct so that we logically separate the input (env, annotations, mounts) and settings that control behaviour.

legacyCudaVersion := i.env[EnvVarCudaVersion]
cudaRequire := i.env[EnvVarNvidiaRequireCuda]
return len(legacyCudaVersion) > 0 && len(cudaRequire) == 0
cudaVersion := i.Getenv(EnvVarCudaVersion)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for the switch to Getenv? I understand that we may be using things inconsistently here, but let's rather address that as a follow-up to keep the refactoring diff smaller?

Comment on lines 97 to 99
// IsSwarm returns whether the image is a Docker Swarm image.
func (i CUDA) IsSwarmResource() bool {
return len(i.visibleDevicesEnvVars) > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not swarm specific. I don't think this function is useful.

Comment on lines 130 to 131
disable := i.Getenv(EnvVarNvidiaDisableRequire)
if disable != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert.

}

// DevicesFromEnvvars returns the devices requested by the image through environment variables
func (i CUDA) DevicesFromEnvvars(envVars ...string) VisibleDevices {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the changes to this function? If we're going to clean it up, let's make sure that this is a separate commit or even PR.

Comment on lines 237 to 272
if i.IsSwarmResource() {
return i.DevicesFromEnvvars(i.visibleDevicesEnvVars...).List()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering how this is implemented in container_config, I believe this should be:

	for _, envvar := range i.visibleDevicesEnvVars {
		if _, ok := i.env[envvar]; ok {
			return i.DevicesFromEnvvars(envvar).List()
		}
	}

(I think there was a bug in the original implementation, but I will have to confirm).

type CUDA struct {
env map[string]string
visibleDevicesEnvVars []string
annotationPrefixes []string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used, so remove this from the change set.

Suggested change
annotationPrefixes []string

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed both annotationPrefixes and annotations from this PR

Comment on lines 228 to 229
// the NVIDIA_VISIBLE_DEVICES environment variable or any variables specified
// in visibleDevicesEnvVars.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not accurate.

return devices
}

func (i CUDA) VisibleDevices() []string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a docstring with the intended behaviour.

if i.isPrivileged || i.acceptEnvvarUnprivileged {
return devices
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one thing that we're missing from the initial implementation is that we're not logging the fact that we're ignoring the devices requested as envvars.

Let's add a logger to the CUDA image as well and add the logging here:

	i.logger.Warningf("Ignoring devices specified in NVIDIA_VISIBLE_DEVICES in unprivileged container")

@elezar elezar force-pushed the i/1049 branch 2 times, most recently from d280029 to 95b458f Compare June 4, 2025 21:19
@ArangoGutierrez ArangoGutierrez force-pushed the i/1049 branch 2 times, most recently from c563b87 to 924913f Compare June 5, 2025 08:30
@ArangoGutierrez ArangoGutierrez requested review from Copilot and elezar June 5, 2025 08:30
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors how CUDA image device requests are initialized and unified across internal packages, centralizing logic in internal/config/image.

  • Introduce functional options (e.g., WithLogger) to NewCUDAImageFromSpec and New builder.
  • Refactor IsPrivileged to use a CapabilitiesGetter interface and adapt call sites.
  • Extend and unify test coverage for visible device selection from env vars and mounts.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
internal/runtime/runtime_factory.go Pass WithLogger(logger) into NewCUDAImageFromSpec
internal/modifier/cdi.go Use options variant of NewCUDAImageFromSpec and updated cast
internal/config/image/privileged_test.go Add table-driven tests for IsPrivileged
internal/config/image/privileged.go Refactor IsPrivileged to use CapabilitiesGetter
internal/config/image/cuda_image_test.go Rename constructor calls and add extensive VisibleDevices tests
internal/config/image/cuda_image.go Change signature of NewCUDAImageFromSpec, add options & logic
internal/config/image/builder.go Introduce Option funcs for builder and embed CUDA fields
cmd/nvidia-container-runtime-hook/hook_config.go Add nil-check in getSwarmResourceEnvvars
cmd/nvidia-container-runtime-hook/container_config_test.go Update test to pass new builder options
cmd/nvidia-container-runtime-hook/container_config.go Refactor hook to use new image API and remove duplicated logic
Comments suppressed due to low confidence (2)

internal/config/image/cuda_image.go:293

  • [nitpick] Consider creating an issue or tracking ticket for this TODO or implementing the merge to avoid stale TODOs.
// TODO: This should be merged with getDevicesFromMounts used in the NVIDIA Container Runtime

cmd/nvidia-container-runtime-hook/container_config_test.go:479

  • The error return from image.New is ignored; consider using require.NoError(t, err) to catch instantiation failures in tests.
image, _ := image.New(

@elezar elezar changed the title Unify GetDevices logic at internal/config/image Refactor extracting requested devices from the container image Jun 5, 2025
Comment on lines 133 to 137
// WithPreferredVisibleDevicesEnvVars sets the visible devices environment variables.
// For each envvar passed:
// - Split on commas
// - Trim spaces and ignore empty strings
// - Concatenate all of this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not accurate.


env map[string]string
mounts []specs.Mount

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment on lines +63 to +68
specOpts := []Option{
WithEnv(env),
WithMounts(spec.Mounts),
)
WithPrivileged(IsPrivileged((*OCISpec)(spec))),
}

return New(append(opts, specOpts...)...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing about this behaviour is that it ASSUMES that the env and mounts from the spec are preferred. It would be good to add a test for this -- although that's not a blocker for this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does, but then the return appends the opts, so if the caller of NewCUDAImageFromSpec set's the WithPreferred... that will be present during the construction of the image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm speaking specifically of env and mounts, not preferredVisibleDevicesEnvVars. The point of a test is to ensure that LATER modifications don't change behaviour.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add unit test for NewCUDAImageFromSpec

}

// We log a warning if we are ignoring the environment variable requests.
i.logger.Warningf("Ignoring devices specified in NVIDIA_VISIBLE_DEVICES in unprivileged container")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically this could be checking other envvars. I'm ok to leave this as is though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, because if the previous check is not fulfilled, we will land here, and if this is a SWARM use case for example we would also be ignoring preferredVisibleDeviceEnvVars

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like

	if len(i.preferredVisibleDeviceEnvVars) > 0 {
		i.logger.Warningf("Ignoring devices specified in %v in unprivileged container", i.preferredVisibleDeviceEnvVars)
	} else {
		i.logger.Warningf("Ignoring devices specified in NVIDIA_VISIBLE_DEVICES in unprivileged container")
	}

Copy link
Member

@elezar elezar Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not match the logic that we use to determine which envvars to use.

We need:

	var ignoredEnvVars []string
	for _, envVar := range i.preferredVisibleDeviceEnvVars {
		if i.HasEnvvar(envVar) {
			ignoredEnvVars = append(ignoredEnvVars, envVar)
		}
	}
	if len(ignoredEnvVars) == 0 {
		ignoredEnvVars = append(ignoredEnvVars, EnvVarNvidiaVisibleDevices)
	}

	// We log a warning if we are ignoring the environment variable requests.
	i.logger.Warningf("Ignoring devices requested by environment variable(s) in unprivileged container: %v", strings.Join(ignoredEnvVars, ", "))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(strictly speaking we should factor this into a function and use that from VisibleDevicesFromEnvVar, but I don't want to waste cycles on this). Let's do this as a follow-up. Please create an issue to track.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create an issue once we merge the PR, so I can point to places in the code in the issue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

// visibleEnvVars returns the environment variables that are used to determine device visibility.
// It returns the preferred environment variables that are set, or NVIDIA_VISIBLE_DEVICES if none are set.
func (i CUDA) visibleEnvVars() []string {
	var envVars []string
	for _, envVar := range i.preferredVisibleDeviceEnvVars {
		if i.HasEnvvar(envVar) {
			envVars = append(envVars, envVar)
		}
	}
	if len(envVars) == 0 {
		envVars = append(envVars, EnvVarNvidiaVisibleDevices)
	}
	return envVars
}

(I have tested it and it works for both the logging use case and the VisibleDevicesFromEnvVar use case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add a patch with #1110 (comment)

}
}

// WithPreferredVisibleDevicesEnvVars sets the visible devices environment variables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// WithPreferredVisibleDevicesEnvVars sets the visible devices environment variables.
// WithPreferredVisibleDevicesEnvVars sets the environment variables that should take precendence over the default NVIDIA_VISIBLE_DEVICES.
// If any of these are present in the image, they are used to determine which devices were requested.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ArangoGutierrez ArangoGutierrez force-pushed the i/1049 branch 2 times, most recently from 8fcd35a to fff5f82 Compare June 5, 2025 10:19
This change consolidates the logic for determining requested devices
from the container image. The logic for this has been integrated into
the image.CUDA type so that multiple implementations are not required.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Co-authored-by: Evan Lezar <[email protected]>
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this in.

@elezar elezar merged commit b55255e into NVIDIA:main Jun 5, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants