DRA: DBC - happy path without binding failure in device attachment scenario

## Summary

One use case for DRA Device Binding Conditions is an attachement scenario where a fabric-attached device—such as a GPU connected via a PCIe or CXL switch—is provisioned when triggered by a ResourceClaim's request and subsequently attached.  BindingConditions play a key role in waiting pod scheduling until the provisioning is complete. In this use case, the fabric device that is not yet attached to a node is represented as a separate ResourceSlice from devices already attached to the node. Once the fabric device is actually attached to the node, it should be exposed as a ResourceSlice for node-local. Thus, in an attachment scenario for fabric devices, a single device effectively appears to transition between ResourceSlices. The following section discusses the detailed approach to achieve this behavior.

## Device Attachment Scenario with Traditional Approach
First, create a ResourceSlice for the fabric device, separate from the node-local one, as shown below. Do not explicitly specify `nodeName`; instead, use `nodeSelector` so that multiple nodes where the device could be attached match the criteria.
This ResourceSlice should use the same `driverName` and GPU model name as the node-local one. This allows users to request devices via ResourceClaim without distinguishing between node-local and fabric devices.

When a ResourceClaim is allocated to a fabric device within this ResourceSlice, the scheduler enters a wait state due to the BindingConditions. During this time, an external controller attaches the fabric device to an actual node. After the device is attached, it appears in the node-local ResourceSlice and is removed from the fabric device’s ResourceSlice.

In this state, the ResourceClaim still has the fabric device allocated, so it needs to be updated so that the allocation information reflects the attached device.

In the traditional approach, this was achieved by writing `BindingFailureConditions` to the ResourceClaim and invoking the scheduler to re-schedule. This caused the ResourceClaim allocation to be re-executed, resulting in a node-local device being allocated. In other words, this aims to achieve Pod scheduling onto the attached device through "multi-scheduling cycles".

<details><summary>ResourceSlice for fabric device</summary>
<pre>
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  name: gpu.nvidia.com-7vg86
spec:
  devices:
  - attributes:
      productName:
        string: NVIDIA A100 80GB PCIe
      type:
        string: gpu
    bindingConditions:
    - FabricDeviceReady
    bindingFailureConditions:
    - FabricDeviceReschedule
    - FabricDeviceFailed
    bindsToNode: true
    name: nvidia-a100-80-gpu0
  driver: gpu.nvidia.com
  nodeSelector:
    nodeSelectorTerms:
    - matchExpressions:
      - key: cohdi.io/nvidia-a100-80
        operator: In
        values:
        - "true"
  pool:
    generation: 1
    name: nvidia-a100-80
    resourceSliceCount: 1
</pre>
</details>

**Problem**  
The following issues have been pointed out regarding this approach.
- This scenario relies on a fail-and-reschedule approach, which is not the typical usage of this feature.
- There should never be a situation where attachment fails during binding for reasons other than hardware failure.
- When `BindingFailureConditions` is marked and the Pod is rescheduled, if other Pods are being created around the same time, race conditions could cause the attached device to be taken by another Pod. This approach carries a potential handoff issue.

## Resolution requiring discussion

#### happy path - no setting of binding failures and forcing rescheduling

This solution incorporates a handshake mechanism to ensure that the Pod is reliably allocated to the attached device. It can be divided into the following two steps.

**First Step:** 

To prevent the attached device from being taken by another Pod, the feature `DRADeviceTaints` is used.  The device taints must be guaranteed to be active at the point when the fabric device is attached and appears in the node-local ResourceSlice, and two methods have been proposed to achieve this.

- Create a DeviceTaintRule before the device is attached  
  
	- _pros_  
	  This can be achieved with the current implementation without any additional enhancements.
	- _cons_  
	  This has the potential race that the scheduler may schedule the Pod before the DeviceTaintRule has been applied to the device. In addition, there is no way to verify whether the DeviceTaintRule has been recognized by the scheduler.
	  
- Enable external augmentation of ResourceSlice attributes  
  This approach is based on the following comment from @johnbelamaric.
  
  > Ensure that the first time the new device is published in the ResourceSlice, it is tainted with a ResourceClaim-specific taint. This prevents it from being "stolen". This may require coordination with the vendor driver (in the case of NVIDIA GPUs for example). This could be solved by allowing external agents to "augment" the driver attribute (and now taint) data. We have discussed this in the past in the context of node architects augmenting data about GPUs that the driver could not know, but the node architect does now. The simple mechanism I would use for this is a directory that drivers look in for JSON files that select and augment device attributes and taints. The on-node component here could write a file in there as part of the attachment process (prior to attachment).
  
	- _pros_  
	  Since a taint is added when the vendor's DRA driver publishes a ResourceSlice, it ensures that the taint is effectively applied.
	- _cons_  
	  Coordination with vendors and enhancements for the ResourceSlice creation process are needed.
  
**Second Step:**

Update the ResourceClaim to tolerate the taint and perform re-allocation. @johnbelamaric has proposed the following three options to achieve this.

_Option 1 - Controller Re-allocates_

The controller managing binding conditions already has write access to the resource claim status. Therefore, couldn't it just re-write the allocation to the new device? I suppose this will lead to cache inconsistency inside the scheduler, so we would have to solve that somehow.

_Option 2 - Scheduler Re-allocates_

Once binding conditions are met, we could re-run the allocation routine. To make sure the right device is picked, the controller creating the new device (via attachment or via some other provisioning method) could add a special DeviceTaint with the UUID of the ResourceClaim. This would prevent the device from getting stolen, and the scheduler plugin would know about this special taint and add a toleration. We could even add a special attribute or have a deterministic way to calculate the device name, so that the real allocation routine doesn't need to be run, it just needs to update the allocation reference and its internal cache.

_Option 3 - Template Device_

We could add a new field on the Device that identifies it as a "template device". This would tell the scheduler - go ahead and pick this device, we will provision a device identical to this one. This is pretty similar to option 2, and the processing would be similar. But this flag lets the scheduler know that it needs to do this special processing, rather than always doing it.

Unlike the approach of marking a failure, this method re-executes the allocation routine when the binding conditions are met, aiming to realize the attachment scenario in the typical way expected by BindingConditions. In other words, this is referred to as the ‘happy path.’

_Options 2_ and _3_ assume adding an additional CEL selection criterion to the DeviceRequest in the ResourceClaim as mentioned in @johnbelamaric comment below.

> Update the ResourceClaim to a) tolerate the specific taint; b) can select _only_ the allocated device. This may require adding an additional CEL selection criteria to the DeviceRequest, that selects on a specific UID of the device. Another options for this second aspect would be to have the attachment controller know ahead of time the name of the driver/resource pool/device name - then it essentially just does the allocation and we move on. There could be a race at kubelet admission time, I am not sure.

However, as noted in @pohly comments, it is confirmed that modifying the ResourceClaim spec is not allowed. Therefore, it is considered that _Option 1_ is likely to be recommended. 

> Note that a ResourceClaim.Spec is immutable. You cannot modify it once the ResourceClaim has been created.

Since the details of the process for re-executing the allocation routine when the binding conditions are met are unclear, @johnbelamaric, could you please share any detailed ideas you might have about how this would work?

In this issue, we want to discuss which of the two methods to ensure that taints are applied to attached devices and which approach to take among _Options 1–3_ to achieve the happy path, in addition, the details of how to implement it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRA: DBC - happy path without binding failure in device attachment scenario #5701

Summary

Device Attachment Scenario with Traditional Approach

Resolution requiring discussion

happy path - no setting of binding failures and forcing rescheduling

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DRA: DBC - happy path without binding failure in device attachment scenario #5701

Description

Summary

Device Attachment Scenario with Traditional Approach

Resolution requiring discussion

happy path - no setting of binding failures and forcing rescheduling

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions