Skip to content

Conversation

@klueska klueska added this to the v25.8.0 milestone Sep 25, 2025
@klueska klueska self-assigned this Sep 25, 2025
@klueska klueska added the usability issue/pr related to UX label Sep 25, 2025
@klueska
Copy link
Collaborator Author

klueska commented Sep 25, 2025

/cc @varunrsekar

@klueska klueska moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Sep 25, 2025
@jgehrcke
Copy link
Collaborator

jgehrcke commented Sep 25, 2025

I read the linked issue. To me, without historical context, the UUID seems to be the natural identifier for a problem like this. Maybe the only question I have then: why was this not considered in the first place, what are the downsides?

Edit: maybe I misunderstood. This only removes that index from the slice, yes? (At first I thought we now show the UUID instead, but that's said nowhere and I probably misread that from the patch).

@klueska
Copy link
Collaborator Author

klueska commented Sep 25, 2025

The only real downside is if a new physical node gets swapped in for a given logical node in k8s (which can happen in GKE across a node reboot). When this happens the physical GPUs will have changed their UUIDs, whereas the indexing scheme would stay consistent.

Making the change in the PR would be an issue if we hadn't (previously) introduced the renaming of the device to its canonical name instead of the "ID" passed into cdi.nvcdiDevice.GetDeviceSpecsByID(). This ensures the UUID does not appear anywhere in the generated CDI spec, but rather the canonical name (based on the minor number).

Using the minor number is OK (whereas using the index wasn't) because it is generated once at boot time and is predictably computed based the PCIeBusID of the GPU. It would only be a problem if the rebooted node came back with a different number (or set) of GPUs than its predecessor, but (at least in the GKE case, which is the only one we are truly worried about) this never happens.

Copy link
Contributor

@varunrsekar varunrsekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for doing this!

Copy link
Collaborator

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation and the test report.

@jgehrcke
Copy link
Collaborator

Using the minor number is OK

We probably need to review this; after #563 (comment):

It looks like even device minors are subject to the same issue as GPU index

@klueska
Copy link
Collaborator Author

klueska commented Sep 30, 2025

Based on the discussion in #563 I have also removed the minor number as an advertised attribute. We will continue to use it internally to name the GPU devices in the resource slice as well as for some other internal bookkeeping, but users will not be able to select GPUs based on it.

@klueska klueska merged commit 852ac6b into NVIDIA:main Oct 6, 2025
14 checks passed
@klueska klueska moved this from In Progress to Closed in Planning Board: k8s-dra-driver-gpu Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

usability issue/pr related to UX

Projects

Development

Successfully merging this pull request may close these issues.

GPU index is not a reliable attribute for advertisement

3 participants