Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Jun 5, 2025

The gated modifiers used to add support for GDS, Mofed, and CUDA Forward Comatibility only check the NVIDIA_VISIBLE_DEVICES envvar to determine whether GPUs are requested and modifications should be made. This means that use cases where volume mounts are used to request devices (e.g. when using the GPU Device Plugin) are not supported.

These changes make the extraction of devices for gated and graphics modifiers consistent with other methods.

Fixes: #1049

@ArangoGutierrez ArangoGutierrez requested review from Copilot and elezar June 5, 2025 11:28
@ArangoGutierrez ArangoGutierrez self-assigned this Jun 5, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR privatizes the legacy VisibleDevicesFromEnvVar method and ensures all modifiers use the public VisibleDevices API to query requested devices.

  • Swapped calls from VisibleDevicesFromEnvVar() to VisibleDevices() in each modifier
  • Renamed and privatized VisibleDevicesFromEnvVar to visibleDevicesFromEnvVar, and introduced a new visibleEnvVars helper
  • Updated the unit test to exercise the private visibleDevicesFromEnvVar method

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
internal/modifier/graphics.go Replaced VisibleDevicesFromEnvVar with VisibleDevices
internal/modifier/gated.go Replaced VisibleDevicesFromEnvVar with VisibleDevices
internal/modifier/csv.go Replaced VisibleDevicesFromEnvVar with VisibleDevices
internal/modifier/cdi.go Replaced VisibleDevicesFromEnvVar with VisibleDevices
internal/config/image/cuda_image_test.go Switched test to call visibleDevicesFromEnvVar
internal/config/image/cuda_image.go Privatized VisibleDevicesFromEnvVar, added visibleEnvVars, updated VisibleDevices
Comments suppressed due to low confidence (2)

internal/config/image/cuda_image.go:233

  • The new helper visibleEnvVars isn’t directly covered by existing tests. Consider adding unit tests to verify its behavior across different environment variable combinations.
func (i CUDA) visibleEnvVars() []string {

internal/config/image/cuda_image_test.go:432

  • This test now calls an unexported method; ensure the test file is in the same package (not *_test) so it can access visibleDevicesFromEnvVar, or consider testing via the public VisibleDevices method instead.
devices := image.visibleDevicesFromEnvVar()

@coveralls
Copy link

coveralls commented Jun 5, 2025

Pull Request Test Coverage Report for Build 15704089585

Details

  • 39 of 45 (86.67%) changed or added relevant lines in 4 files are covered.
  • 4 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.05%) to 33.678%

Changes Missing Coverage Covered Lines Changed/Added Lines %
internal/modifier/gated.go 0 1 0.0%
internal/modifier/graphics.go 6 11 54.55%
Files with Coverage Reduction New Missed Lines %
internal/runtime/logger.go 4 43.51%
Totals Coverage Status
Change from base Build 15683627970: 0.05%
Covered Lines: 4338
Relevant Lines: 12881

💛 - Coveralls

var devices []string
seen := make(map[string]bool)
for _, name := range container.VisibleDevicesFromEnvVar() {
for _, name := range container.VisibleDevices() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have mentioned this in #1110, but we already process the mount devices from above, and although this MAY have the desired effect we may want to revisit how we handle CDI device requests and how we distinguish between these are "regular" requests.

My suggestion would be to:

  1. Keep VisibleDevicesFromEnvVar public for now.
  2. Update the other usages as per this PR.
  3. Handle CDI devices in a follow up.

@elezar
Copy link
Member

elezar commented Jun 5, 2025

Please update the commit messages / PR heading and description to better indicate intent. We are making this change to FIX A BUG. That is not clear from the commit messages and as such this will be missed in our release notes.

@elezar
Copy link
Member

elezar commented Jun 5, 2025

Please also add unit tests to capture the new behaviour in the affected code. (ideally first failing with the existing implementation).

This PR is not a blocker for the v1.18.0-rc.1 release.

@ArangoGutierrez ArangoGutierrez changed the title Only export VisibleDevices func as valid method to query valid devices BUGFIX: modifier: respect GPU volume-mount device requests Jun 5, 2025
@ArangoGutierrez ArangoGutierrez requested a review from elezar June 5, 2025 13:52
@ArangoGutierrez ArangoGutierrez added the bug Issue/PR to expose/discuss/fix a bug label Jun 5, 2025
@ArangoGutierrez ArangoGutierrez added this to the v1.18.0 milestone Jun 5, 2025
@elezar elezar force-pushed the fix/1049 branch 2 times, most recently from f93792f to 635b99d Compare June 17, 2025 09:54
ArangoGutierrez and others added 4 commits June 17, 2025 11:54
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
The gated modifiers used to add support for GDS, Mofed, and CUDA Forward Comatibility
only check the NVIDIA_VISIBLE_DEVICES envvar to determine whether GPUs are requested
and modifications should be made. This means that use cases where volume mounts are
used to request devices are not supported.

This change ensures that device extraction is consistent for all use cases.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar merged commit 208896d into NVIDIA:main Jun 17, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue/PR to expose/discuss/fix a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gated modifiers ignore device requests by volume mounts

3 participants