Use custom rate limiter for kubelet plugin (un)prepare retries

A follow-up from
https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/263#discussion_r2095286665
and from
https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/598#discussion_r2360765970
and also
https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/375#issuecomment-2898288074

Broken out of https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/633.

We should take control of the retrying and back-off methodology and accompanying parameters used for retrying NodePrepareResources() and NodeUnprepareResources() requests in our kubelet plugins.

Currently we (rather ignorantly :-)) use

```golang
func DefaultTypedControllerRateLimiter[T comparable]() TypedRateLimiter[T] {
	return NewTypedMaxOfRateLimiter(
		NewTypedItemExponentialFailureRateLimiter[T](5*time.Millisecond, 1000*time.Second),
		// 10 qps, 100 bucket size.  This is only for retry speed and its only the overall factor (not per item)
		&TypedBucketRateLimiter[T]{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
	)
}
```
from [client-go/util/workqueue/default_rate_limiters.go#L50](https://github.com/kubernetes/client-go/blob/3c3a19fb562d759f0119e084f110139e0cd19c91/util/workqueue/default_rate_limiters.go#L50)

Concerns:

- the lower retry period bound: 5 ms -- think: too fast by at least an order of magnitude
- the upper retry period bound: 1 s -- think: also too fast -- by an order of magnitude
- the global "bucket" rate limiter allows for 100 Hz bursts -- way too fast for our use case -- it however isn't hitting in in our case because we do _not_ share the same queue across requests -- an indicator for the fact that this is just not really a fitting retrying methodology.

My point is that we should at least _start_ creating our own retry methodology and parameters ASAP, and then tune the setup from there. 

DRA resource prep/unprep work is just very different in terms of fundamental retrying needs as compared to whichever use case this `DefaultTypedControllerRateLimiter` has been proposed for.

Quoting myself from #633:

> Re-parametrization of the work queue rate limiter for prepare/unprepare retries to be slightly slower -- we do not need to retry ~10 times per second initially when expected time-to-completion is O(1 s) or slower anyway. I made this change with log verbosity being a motivation, but I believe architecturally this change also makes sense and might even be important.

Example, showing current retrying behavior over time:

```
19.74  workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
19.75  workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
19.77  workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
19.82  workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
19.90  workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
20.06  workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
```
(first column is a timestamp in seconds, and each log line reflects a retry based on the current parameterization).

Note that the first three retries are spread across the same `1/10 th` of a second.

The average rate over the first second of retries is around 10 Hz.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use custom rate limiter for kubelet plugin (un)prepare retries #655

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use custom rate limiter for kubelet plugin (un)prepare retries #655

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions