-
Notifications
You must be signed in to change notification settings - Fork 18
Add DRA support for GPU pod eviction during driver upgrades #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
shivamerla
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Don't we cordon the node before starting the upgrade? If the node is cordoned, then there won't be new allocations to that node. |
65a3f53 to
43d29cc
Compare
I think you're right here. Good point, thanks for bringing it up. |
6e1a6fb to
0682513
Compare
internal/kubernetes/client.go
Outdated
| return claim != nil, nil | ||
| }) | ||
| if err != nil { | ||
| client.log.Warnf("Failed to get ResourceClaim %s/%s after retries", pod.Namespace, claimName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider returning an error here? Not getting ResourceClaims after retries seems like a legitimate reason for failing the driver upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the claim doesn't exist, continuing is correct — no claim means no GPU allocation. We could distinguish NotFound (continue) from transient errors (fail), but that adds complexity. Given the retry + 10s timeout, the risk of missing a GPU pod due to transient errors is low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed when the claim is not found. But when there other errors (i.e. apierrors.IsNotFound(err) == false) that show up even after retries, IMO the right thing to do is to bubble up and fail the uninstall_driver command because it means we couldn't guarantee a clean eviction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fair! Will look into adding this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
0682513 to
a355d89
Compare
Signed-off-by: Karthik Vetrivel <[email protected]>
a355d89 to
9c7ed23
Compare
Summary
Extends the driver-upgrade controller to detect and evict GPU workloads using Dynamic Resource Allocation (DRA) in addition to traditional
nvidia.com/gpuresources. This ensures GPU driver upgrades work correctly as Kubernetes transitions from device plugins to the DRA model (GA in K8s 1.34+).Testing
Tested in a kubeadm cluster (K8s 1.34) with NVIDIA DRA driver installed:
Created test workloads:
ResourceClaim(driver: gpu.nvidia.com)nvidia.com/gpu: 1)Verified ResourceClaim allocation:
$ kubectl get resourceclaim -o yaml | grep -A 5 allocation status: allocation: devices: results: - driver: gpu.nvidia.com device: gpu-0Triggered driver upgrade eviction:
Verified both pods evicted successfully:
As I have it right now, I only evict DRA GPU pods if their ResourceClaim has status.allocation != nil (actively using a GPU), but I'm wondering if we should evict ANY pod with a ResourceClaim requesting nvidia.com GPUs (regardless of allocation status) to prevent race conditions where a pending claim gets allocated during the upgrade - thoughts?