Skip to content

Conversation

@elezar
Copy link
Member

@elezar elezar commented Jun 27, 2025

The original CDI spec generation API was focussed on NVML device specifically. Since then we have replaced the more specific functions (for GPU and MIG devices) in the API with more generally applicable functions based on mode and device IDs.

This organic growth of APIs also means that for the NVML case specifically we had multiple different implementations of CDI spec generation making keeping things consistent more difficult.

Thes changes remove the redundant functions in the nvcdi.Interface allowing devices to be requested by ID across all use cases. It also refactors the CDI spec generation for NVML devices to ensure that the same generation logic is used for all cases.

@elezar elezar force-pushed the refactor-cdi-api branch from 7389c4f to 4656c25 Compare June 27, 2025 12:46
@coveralls
Copy link

coveralls commented Jun 27, 2025

Pull Request Test Coverage Report for Build 16072906327

Details

  • 119 of 260 (45.77%) changed or added relevant lines in 12 files are covered.
  • 9 unchanged lines in 5 files lost coverage.
  • Overall coverage increased (+0.9%) to 34.054%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/nvcdi/lib-wsl.go 0 3 0.0%
pkg/nvcdi/gds.go 0 4 0.0%
pkg/nvcdi/lib-imex.go 41 45 91.11%
pkg/nvcdi/management.go 0 4 0.0%
pkg/nvcdi/mofed.go 0 4 0.0%
pkg/nvcdi/wrapper.go 12 17 70.59%
pkg/nvcdi/lib.go 3 9 33.33%
pkg/nvcdi/lib-csv.go 0 10 0.0%
pkg/nvcdi/full-gpu-nvml.go 11 31 35.48%
pkg/nvcdi/mig-device-nvml.go 15 36 41.67%
Files with Coverage Reduction New Missed Lines %
pkg/nvcdi/lib-wsl.go 1 0.0%
pkg/nvcdi/mig-device-nvml.go 1 25.42%
pkg/nvcdi/lib-imex.go 2 82.76%
pkg/nvcdi/wrapper.go 2 70.21%
pkg/nvcdi/lib-nvml.go 3 43.23%
Totals Coverage Status
Change from base Build 16054869452: 0.9%
Covered Lines: 4491
Relevant Lines: 13188

💛 - Coveralls

This comment was marked as outdated.

@elezar elezar force-pushed the refactor-cdi-api branch 3 times, most recently from 0a2b879 to a6f8a10 Compare July 1, 2025 13:18
@elezar elezar added this to the v1.18.0 milestone Jul 1, 2025
@elezar elezar marked this pull request as ready for review July 1, 2025 13:19
@elezar elezar requested a review from cdesiniotis July 1, 2025 13:28
@elezar elezar force-pushed the refactor-cdi-api branch 2 times, most recently from 2683a6b to 7753402 Compare July 1, 2025 13:43
@ArangoGutierrez ArangoGutierrez requested a review from Copilot July 1, 2025 16:11

This comment was marked as outdated.

ArangoGutierrez
ArangoGutierrez previously approved these changes Jul 2, 2025
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This comment was marked as outdated.

@elezar elezar force-pushed the refactor-cdi-api branch 2 times, most recently from 77fe5cf to 66af75b Compare July 3, 2025 11:27
@elezar elezar requested a review from ArangoGutierrez July 3, 2025 15:37
@ArangoGutierrez ArangoGutierrez requested a review from Copilot July 3, 2025 16:13

This comment was marked as outdated.

ArangoGutierrez
ArangoGutierrez previously approved these changes Jul 3, 2025
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks for this Evan, this is a great refactor of the API

elezar added 2 commits July 4, 2025 11:12
Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar force-pushed the refactor-cdi-api branch 2 times, most recently from 0443041 to a248c76 Compare July 4, 2025 09:38
@ArangoGutierrez ArangoGutierrez dismissed their stale review July 4, 2025 11:05

updated commit

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the nvcdi API to unify CDI spec generation across all modes by introducing a factory-based generator pattern and removing redundant, mode-specific methods.

  • Introduce deviceSpecGeneratorFactory and DeviceSpecGenerator to replace multiple Interface methods.
  • Refactor NVML path to create and combine per-device generators with init/shutdown hooks.
  • Deprecate GetAllDeviceSpecs and route calls through GetDeviceSpecsByID("all").

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

File Description
pkg/nvcdi/wrapper.go Replaced direct Interface embedding with a factory for generating specs
pkg/nvcdi/lib-nvml.go Switched NVML path to return generators and unified init/shutdown logic
pkg/nvcdi/api.go Updated Interface to use new SpecGenerator types and deprecated legacy methods
Comments suppressed due to low confidence (2)

pkg/nvcdi/lib-nvml.go:80

  • [nitpick] The variable name DeviceSpecGenerators shadows the type name and begins with an uppercase letter. Rename it to a lowercase, distinct name like generators or dsgs.
	var DeviceSpecGenerators DeviceSpecGenerators

pkg/nvcdi/wrapper.go:78

  • Add unit tests for GetDeviceSpecsByID (and GetAllDeviceSpecs) on the wrapper to ensure correct behavior across all factory implementations.
func (l *wrapper) GetDeviceSpecsByID(devices ...string) ([]specs.Device, error) {

Comment on lines +39 to +40
// TODO: Rename this type
type deviceSpecGeneratorFactory interface {
Copy link

Copilot AI Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Remove or resolve the TODO comment by renaming deviceSpecGeneratorFactory to a more descriptive name (e.g., SpecFactory).

Suggested change
// TODO: Rename this type
type deviceSpecGeneratorFactory interface {
// SpecFactory is responsible for creating device spec generators and retrieving common edits.
type SpecFactory interface {

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elezar elezar merged commit 16bd33d into NVIDIA:main Jul 7, 2025
16 checks passed
@elezar elezar deleted the refactor-cdi-api branch July 7, 2025 09:21
elezar added a commit to elezar/nvidia-container-toolkit that referenced this pull request Jul 17, 2025
These changes fix a bug in CDI spec generation introduced in NVIDIA#1166
nvml is shutdown and initialized again -- we explicitly store UUIDs and
use these to query the device handles when generating the CDI
specification.

Signed-off-by: Evan Lezar <[email protected]>
elezar added a commit to elezar/nvidia-container-toolkit that referenced this pull request Jul 17, 2025
These changes fix a bug in CDI spec generation introduced in NVIDIA#1166
where device handles become invalid when nvml is shutdown and initialized
again. Here we explicitly store UUIDs and use these to query the
device handles when generating the CDI specification.

Signed-off-by: Evan Lezar <[email protected]>
elezar added a commit to elezar/nvidia-container-toolkit that referenced this pull request Jul 21, 2025
These changes fix a bug in CDI spec generation introduced in NVIDIA#1166
where device handles become invalid when nvml is shutdown and initialized
again. Here we explicitly store UUIDs and use these to query the
device handles when generating the CDI specification.

Signed-off-by: Evan Lezar <[email protected]>
elezar added a commit to elezar/nvidia-container-toolkit that referenced this pull request Jul 21, 2025
These changes fix a bug in CDI spec generation introduced in NVIDIA#1166
where device handles become invalid when nvml is shutdown and initialized
again. Here we explicitly store UUIDs and use these to query the
device handles when generating the CDI specification.

Signed-off-by: Evan Lezar <[email protected]>
elezar added a commit to elezar/nvidia-container-toolkit that referenced this pull request Jul 21, 2025
These changes fix a bug in CDI spec generation introduced in NVIDIA#1166
where device handles become invalid when nvml is shutdown and initialized
again. Here we explicitly store UUIDs and use these to query the
device handles when generating the CDI specification.

Signed-off-by: Evan Lezar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants