Add DRA support for GPU pod eviction during driver upgrades #129

karthikvetrivel · 2025-11-17T21:14:42Z

Summary

Extends the driver-upgrade controller to detect and evict GPU workloads using Dynamic Resource Allocation (DRA) in addition to traditional nvidia.com/gpu resources. This ensures GPU driver upgrades work correctly as Kubernetes transitions from device plugins to the DRA model (GA in K8s 1.34+).

Testing

Tested in a kubeadm cluster (K8s 1.34) with NVIDIA DRA driver installed:

Created test workloads:
- DRA GPU pod with allocated ResourceClaim (driver: gpu.nvidia.com)
- Traditional GPU pod (nvidia.com/gpu: 1)

Verified ResourceClaim allocation:

$ kubectl get resourceclaim -o yaml | grep -A 5 allocation
status:
  allocation:
    devices:
      results:
      - driver: gpu.nvidia.com
        device: gpu-0

Triggered driver upgrade eviction:

$ k8s-driver-manager uninstall_driver --node-name=<node>

GPU pod - default/dra-allocated-pod
GPU pod - default/traditional-allocated-pod
evicting pod default/traditional-allocated-pod
evicting pod default/dra-allocated-pod

Verified both pods evicted successfully:
```
$ kubectl get pods
No resources found
```

As I have it right now, I only evict DRA GPU pods if their ResourceClaim has status.allocation != nil (actively using a GPU), but I'm wondering if we should evict ANY pod with a ResourceClaim requesting nvidia.com GPUs (regardless of allocation status) to prevent race conditions where a pending claim gets allocated during the upgrade - thoughts?

shivamerla

LGTM

rahulait · 2025-11-25T17:15:07Z

I'm wondering if we should evict ANY pod with a ResourceClaim requesting nvidia.com GPUs (regardless of allocation status) to prevent race conditions where a pending claim gets allocated during the upgrade - thoughts?

Don't we cordon the node before starting the upgrade? If the node is cordoned, then there won't be new allocations to that node.

internal/kubernetes/client.go

karthikvetrivel · 2025-11-25T19:37:28Z

I'm wondering if we should evict ANY pod with a ResourceClaim requesting nvidia.com GPUs (regardless of allocation status) to prevent race conditions where a pending claim gets allocated during the upgrade - thoughts?

Don't we cordon the node before starting the upgrade? If the node is cordoned, then there won't be new allocations to that node.

I think you're right here. Good point, thanks for bringing it up.

internal/kubernetes/client.go

rajathagasthya · 2025-11-25T20:39:26Z

internal/kubernetes/client.go

+			return claim != nil, nil
+		})
+		if err != nil {
+			client.log.Warnf("Failed to get ResourceClaim %s/%s after retries", pod.Namespace, claimName)


Should we consider returning an error here? Not getting ResourceClaims after retries seems like a legitimate reason for failing the driver upgrade.

If the claim doesn't exist, continuing is correct — no claim means no GPU allocation. We could distinguish NotFound (continue) from transient errors (fail), but that adds complexity. Given the retry + 10s timeout, the risk of missing a GPU pod due to transient errors is low.

Agreed when the claim is not found. But when there other errors (i.e. apierrors.IsNotFound(err) == false) that show up even after retries, IMO the right thing to do is to bubble up and fail the uninstall_driver command because it means we couldn't guarantee a clean eviction.

I think that's fair! Will look into adding this.

Signed-off-by: Karthik Vetrivel <[email protected]>

karthikvetrivel requested review from cdesiniotis, shivamerla and tariq1890 November 17, 2025 21:15

karthikvetrivel marked this pull request as draft November 17, 2025 21:15

shivamerla reviewed Nov 17, 2025

View reviewed changes

karthikvetrivel marked this pull request as ready for review November 18, 2025 13:16

rajathagasthya reviewed Nov 25, 2025

View reviewed changes

internal/kubernetes/client.go Outdated Show resolved Hide resolved

karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from 65a3f53 to 43d29cc Compare November 25, 2025 19:26

karthikvetrivel requested review from rahulait and rajathagasthya November 25, 2025 19:37

karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch 2 times, most recently from 6e1a6fb to 0682513 Compare November 25, 2025 20:32

rajathagasthya reviewed Nov 25, 2025

View reviewed changes

karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from 0682513 to a355d89 Compare November 25, 2025 20:57

Add DRA support for GPU pod eviction during driver upgrades

9c7ed23

Signed-off-by: Karthik Vetrivel <[email protected]>

karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from a355d89 to 9c7ed23 Compare November 26, 2025 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add DRA support for GPU pod eviction during driver upgrades #129

Add DRA support for GPU pod eviction during driver upgrades #129

Uh oh!

karthikvetrivel commented Nov 17, 2025 •

edited

Loading

Uh oh!

shivamerla left a comment

Uh oh!

rahulait commented Nov 25, 2025

Uh oh!

Uh oh!

karthikvetrivel commented Nov 25, 2025

Uh oh!

Uh oh!

rajathagasthya Nov 25, 2025

Uh oh!

karthikvetrivel Nov 25, 2025

Uh oh!

rajathagasthya Nov 26, 2025

Uh oh!

karthikvetrivel Nov 26, 2025

Uh oh!

karthikvetrivel Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add DRA support for GPU pod eviction during driver upgrades #129

Are you sure you want to change the base?

Add DRA support for GPU pod eviction during driver upgrades #129

Uh oh!

Conversation

karthikvetrivel commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

shivamerla left a comment

Choose a reason for hiding this comment

Uh oh!

rahulait commented Nov 25, 2025

Uh oh!

Uh oh!

karthikvetrivel commented Nov 25, 2025

Uh oh!

Uh oh!

rajathagasthya Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

karthikvetrivel Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

rajathagasthya Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

karthikvetrivel Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

karthikvetrivel Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

karthikvetrivel commented Nov 17, 2025 •

edited

Loading