-
Notifications
You must be signed in to change notification settings - Fork 99
Description
A follow-up from
#263 (comment)
and from
#598 (comment)
and also
#375 (comment)
Broken out of #633.
We should take control of the retrying and back-off methodology and accompanying parameters used for retrying NodePrepareResources() and NodeUnprepareResources() requests in our kubelet plugins.
Currently we (rather ignorantly :-)) use
func DefaultTypedControllerRateLimiter[T comparable]() TypedRateLimiter[T] {
return NewTypedMaxOfRateLimiter(
NewTypedItemExponentialFailureRateLimiter[T](5*time.Millisecond, 1000*time.Second),
// 10 qps, 100 bucket size. This is only for retry speed and its only the overall factor (not per item)
&TypedBucketRateLimiter[T]{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
)
}from client-go/util/workqueue/default_rate_limiters.go#L50
Concerns:
- the lower retry period bound: 5 ms -- think: too fast by at least an order of magnitude
- the upper retry period bound: 1 s -- think: also too fast -- by an order of magnitude
- the global "bucket" rate limiter allows for 100 Hz bursts -- way too fast for our use case -- it however isn't hitting in in our case because we do not share the same queue across requests -- an indicator for the fact that this is just not really a fitting retrying methodology.
My point is that we should at least start creating our own retry methodology and parameters ASAP, and then tune the setup from there.
DRA resource prep/unprep work is just very different in terms of fundamental retrying needs as compared to whichever use case this DefaultTypedControllerRateLimiter has been proposed for.
Quoting myself from #633:
Re-parametrization of the work queue rate limiter for prepare/unprepare retries to be slightly slower -- we do not need to retry ~10 times per second initially when expected time-to-completion is O(1 s) or slower anyway. I made this change with log verbosity being a motivation, but I believe architecturally this change also makes sense and might even be important.
Example, showing current retrying behavior over time:
19.74 workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
19.75 workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
19.77 workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
19.82 workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
19.90 workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
20.06 workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
(first column is a timestamp in seconds, and each log line reflects a retry based on the current parameterization).
Note that the first three retries are spread across the same 1/10 th of a second.
The average rate over the first second of retries is around 10 Hz.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status