Skip to content

Conversation

@karthikvetrivel
Copy link
Member

@karthikvetrivel karthikvetrivel commented Nov 17, 2025

Summary

Extends the driver-upgrade controller to detect and evict GPU workloads using Dynamic Resource Allocation (DRA) in addition to traditional nvidia.com/gpu resources. This ensures GPU driver upgrades work correctly as Kubernetes transitions from device plugins to the DRA model (GA in K8s 1.34+).

Testing

Tested in a kubeadm cluster (K8s 1.34) with NVIDIA DRA driver installed:

  1. Created test workloads:

    • DRA GPU pod with allocated ResourceClaim (driver: gpu.nvidia.com)
    • Traditional GPU pod (nvidia.com/gpu: 1)
  2. Verified ResourceClaim allocation:

    $ kubectl get resourceclaim -o yaml | grep -A 5 allocation
    status:
      allocation:
        devices:
          results:
          - driver: gpu.nvidia.com
            device: gpu-0
  3. Triggered driver upgrade eviction:

    $ k8s-driver-manager uninstall_driver --node-name=<node>
    
    GPU pod - default/dra-allocated-pod
    GPU pod - default/traditional-allocated-pod
    evicting pod default/traditional-allocated-pod
    evicting pod default/dra-allocated-pod
  4. Verified both pods evicted successfully:

    $ kubectl get pods
    No resources found

As I have it right now, I only evict DRA GPU pods if their ResourceClaim has status.allocation != nil (actively using a GPU), but I'm wondering if we should evict ANY pod with a ResourceClaim requesting nvidia.com GPUs (regardless of allocation status) to prevent race conditions where a pending claim gets allocated during the upgrade - thoughts?

Copy link
Contributor

@shivamerla shivamerla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karthikvetrivel karthikvetrivel marked this pull request as ready for review November 18, 2025 13:16
@rahulait
Copy link

I'm wondering if we should evict ANY pod with a ResourceClaim requesting nvidia.com GPUs (regardless of allocation status) to prevent race conditions where a pending claim gets allocated during the upgrade - thoughts?

Don't we cordon the node before starting the upgrade? If the node is cordoned, then there won't be new allocations to that node.

@karthikvetrivel karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from 65a3f53 to 43d29cc Compare November 25, 2025 19:26
@karthikvetrivel
Copy link
Member Author

I'm wondering if we should evict ANY pod with a ResourceClaim requesting nvidia.com GPUs (regardless of allocation status) to prevent race conditions where a pending claim gets allocated during the upgrade - thoughts?

Don't we cordon the node before starting the upgrade? If the node is cordoned, then there won't be new allocations to that node.

I think you're right here. Good point, thanks for bringing it up.

@karthikvetrivel karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch 2 times, most recently from 6e1a6fb to 0682513 Compare November 25, 2025 20:32
return claim != nil, nil
})
if err != nil {
client.log.Warnf("Failed to get ResourceClaim %s/%s after retries", pod.Namespace, claimName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider returning an error here? Not getting ResourceClaims after retries seems like a legitimate reason for failing the driver upgrade.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the claim doesn't exist, continuing is correct — no claim means no GPU allocation. We could distinguish NotFound (continue) from transient errors (fail), but that adds complexity. Given the retry + 10s timeout, the risk of missing a GPU pod due to transient errors is low.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed when the claim is not found. But when there other errors (i.e. apierrors.IsNotFound(err) == false) that show up even after retries, IMO the right thing to do is to bubble up and fail the uninstall_driver command because it means we couldn't guarantee a clean eviction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fair! Will look into adding this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@karthikvetrivel karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from 0682513 to a355d89 Compare November 25, 2025 20:57
@karthikvetrivel karthikvetrivel force-pushed the feature/dra-gpu-pod-eviction branch from a355d89 to 9c7ed23 Compare November 26, 2025 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants