-
Notifications
You must be signed in to change notification settings - Fork 98
Remove GPU index for CDI spec generation / attribute advertisement #624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Kevin Klues <[email protected]>
|
/cc @varunrsekar |
|
I read the linked issue. To me, without historical context, the UUID seems to be the natural identifier for a problem like this. Maybe the only question I have then: why was this not considered in the first place, what are the downsides? Edit: maybe I misunderstood. This only removes that index from the slice, yes? (At first I thought we now show the UUID instead, but that's said nowhere and I probably misread that from the patch). |
|
The only real downside is if a new physical node gets swapped in for a given logical node in k8s (which can happen in GKE across a node reboot). When this happens the physical GPUs will have changed their UUIDs, whereas the indexing scheme would stay consistent. Making the change in the PR would be an issue if we hadn't (previously) introduced the renaming of the device to its canonical name instead of the "ID" passed into Using the minor number is OK (whereas using the index wasn't) because it is generated once at boot time and is predictably computed based the PCIeBusID of the GPU. It would only be a problem if the rebooted node came back with a different number (or set) of GPUs than its predecessor, but (at least in the GKE case, which is the only one we are truly worried about) this never happens. |
varunrsekar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for doing this!
jgehrcke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation and the test report.
We probably need to review this; after #563 (comment):
|
Signed-off-by: Kevin Klues <[email protected]>
|
Based on the discussion in #563 I have also removed the minor number as an advertised attribute. We will continue to use it internally to name the GPU devices in the resource slice as well as for some other internal bookkeeping, but users will not be able to select GPUs based on it. |
Fixes: #563
Tested with: