Skip to content

Conversation

@elezar
Copy link
Member

@elezar elezar commented Aug 15, 2025

This change updates CDI spec generation to generation specifications for coherent and non-coherent devices. At its core, this is acheived by adding device-level annotations to the CDI specification for devices indicating whether a device is coherent or not. The nvidia-ctk cdi generate command then splits the generated spec by the specified annotation and generates additional specs for nvidia.com/gpu.coherent and / or nvidia.com/gpu.noncoherentdevices.

For example, when running on a device that supports coherence we see the following:

$ ./nvidia-ctk cdi generate --output $(pwd)/test/nvidia.yaml
$ ls -l test/
total 48
-rw-r--r-- 1 local-elezar local-elezar 20820 Aug 15 13:22 nvidia.coherent.yaml
-rw-r--r-- 1 local-elezar local-elezar 20811 Aug 15 13:22 nvidia.yaml

With the only difference between the two specs being the device kind:

$ diff test/nvidia.yaml test/nvidia.coherent.yaml
3c3
< kind: nvidia.com/gpu
---

The following specs are generated:

$ nvidia-ctk cdi list --spec-dir=$(pwd)/test/
INFO[0000] Found 6 CDI devices
nvidia.com/gpu.coherent=0
nvidia.com/gpu.coherent=GPU-ca81aac1-36e7-2d26-8a15-0aa6ef17c627
nvidia.com/gpu.coherent=all
nvidia.com/gpu=0
nvidia.com/gpu=GPU-ca81aac1-36e7-2d26-8a15-0aa6ef17c627
nvidia.com/gpu=all

This is blocked by:

@elezar elezar added this to the v1.18.0 milestone Aug 15, 2025
@elezar elezar self-assigned this Aug 15, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements the generation of separate CDI specifications for coherent and non-coherent GPU devices. The implementation adds device-level annotations to indicate coherence status and splits the generated CDI spec accordingly.

  • Adds coherence annotations to GPU device specifications
  • Generates separate CDI specs for coherent/non-coherent devices in addition to the base spec
  • Introduces feature flag to control coherence annotation behavior

Reviewed Changes

Copilot reviewed 9 out of 24 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/nvcdi/options.go Adds WithFeatureFlags function and deprecates single WithFeatureFlag
pkg/nvcdi/api.go Defines new FeatureDisableCoherentAnnotations feature flag
pkg/nvcdi/full-gpu-nvml.go Implements device coherence annotation logic
pkg/nvcdi/lib-nvml.go Updates function calls to pass feature flags
pkg/nvcdi/mig-device-nvml.go Disables coherent annotations for MIG devices
cmd/nvidia-ctk/cdi/generate/generate.go Implements spec splitting logic and multi-spec generation
cmd/nvidia-ctk/cdi/generate/generate_test.go Updates tests for new multi-spec generation
deployments/devel/go.mod Removes golangci-lint dependency
deployments/devel/tools.go Removes golangci-lint import

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

@elezar
Copy link
Member Author

elezar commented Aug 18, 2025

We should add an attribute to the k8s-dra-driver for this too.

@elezar elezar force-pushed the coherent-non-coherent branch from c528d0f to 113317b Compare August 20, 2025 10:10
Copy link

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a 'trust approval', in case you need to move forward for the next release candidate. I am not sure if we have consensus in the team to use these types of approvals -- but I personally think that sometimes there's a time for them (very open to discussing this arguably controversial perspective).

Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@elezar elezar force-pushed the coherent-non-coherent branch from 113317b to 1dea726 Compare August 20, 2025 12:18
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elezar elezar force-pushed the coherent-non-coherent branch from 1dea726 to 48286b9 Compare August 20, 2025 12:24
@ArangoGutierrez
Copy link
Collaborator

/ok to test

dependabot bot and others added 4 commits August 20, 2025 15:05
Bumps [github.com/NVIDIA/go-nvml](https://github.com/NVIDIA/go-nvml) from 0.12.9-0 to 0.13.0-0.
- [Release notes](https://github.com/NVIDIA/go-nvml/releases)
- [Commits](NVIDIA/go-nvml@v0.12.9-0...v0.13.0-0)

---
updated-dependencies:
- dependency-name: github.com/NVIDIA/go-nvml
  dependency-version: 0.13.0-0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
With this change the nvidia-ctk cdi generate command generates
CDI specs based on whether a device supports coherent access to
system memory or not. In this case "regular" nvidia.com/gpu CDI
specs are generated for all devices as well as
nvidia.com/gpu.coherent and nvidia.com/gpu.noncoherent for devices
that are either coherent or non-coherent.

Adding the --feature-flag=disable-coherent-annotations command line
argument to the nvidia-ctk cdi generate command will disable this.

The "disable-coherent-annotations" feature flag can also be set in the
nvcdi API in which case the generated CDI device specification will
not include annotations indicating coherence.

Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar force-pushed the coherent-non-coherent branch from d0c4c96 to 868963b Compare August 20, 2025 13:06
@elezar elezar merged commit a10c54e into NVIDIA:main Aug 20, 2025
16 checks passed
@elezar elezar deleted the coherent-non-coherent branch August 20, 2025 13:37
@guptaNswati
Copy link

Is this supposed to be supported with device-plugin as well.

elezar added a commit to elezar/nvidia-container-toolkit that referenced this pull request Sep 26, 2025
This change disables the functionality for splitting
generated CDI specifications based on device coherence by default.
This was added in NVIDIA#1247, but due to discussions around whether
coherence is an property that should be exposed, we are disabling
this by default.

Note that users can opt in to the feature by running the
`nvidia-ctk cdi generate` with the `--feature-flag=enable-coherent-annotations`
command line flag. Alternatively the `nvidia-ctk cdi generate` command can
be run with the `NVIDIA_CTK_CDI_GENERATE_FEATURE_FLAGS` enviroment set to
include "enable-coherent-annotations" (in a comma-separated list).

Signed-off-by: Evan Lezar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants