Skip to content

Use custom rate limiter for kubelet plugin (un)prepare retries #655

@jgehrcke

Description

@jgehrcke

A follow-up from
#263 (comment)
and from
#598 (comment)
and also
#375 (comment)

Broken out of #633.

We should take control of the retrying and back-off methodology and accompanying parameters used for retrying NodePrepareResources() and NodeUnprepareResources() requests in our kubelet plugins.

Currently we (rather ignorantly :-)) use

func DefaultTypedControllerRateLimiter[T comparable]() TypedRateLimiter[T] {
	return NewTypedMaxOfRateLimiter(
		NewTypedItemExponentialFailureRateLimiter[T](5*time.Millisecond, 1000*time.Second),
		// 10 qps, 100 bucket size.  This is only for retry speed and its only the overall factor (not per item)
		&TypedBucketRateLimiter[T]{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
	)
}

from client-go/util/workqueue/default_rate_limiters.go#L50

Concerns:

  • the lower retry period bound: 5 ms -- think: too fast by at least an order of magnitude
  • the upper retry period bound: 1 s -- think: also too fast -- by an order of magnitude
  • the global "bucket" rate limiter allows for 100 Hz bursts -- way too fast for our use case -- it however isn't hitting in in our case because we do not share the same queue across requests -- an indicator for the fact that this is just not really a fitting retrying methodology.

My point is that we should at least start creating our own retry methodology and parameters ASAP, and then tune the setup from there.

DRA resource prep/unprep work is just very different in terms of fundamental retrying needs as compared to whichever use case this DefaultTypedControllerRateLimiter has been proposed for.

Quoting myself from #633:

Re-parametrization of the work queue rate limiter for prepare/unprepare retries to be slightly slower -- we do not need to retry ~10 times per second initially when expected time-to-completion is O(1 s) or slower anyway. I made this change with log verbosity being a motivation, but I believe architecturally this change also makes sense and might even be important.

Example, showing current retrying behavior over time:

19.74  workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
19.75  workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
19.77  workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
19.82  workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
19.90  workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]
20.06  workqueue.go:138] Failed to reconcile [...] channel 0 already allocated by claim de7cc [...]

(first column is a timestamp in seconds, and each log line reflects a retry based on the current parameterization).

Note that the first three retries are spread across the same 1/10 th of a second.

The average rate over the first second of retries is around 10 Hz.

Metadata

Metadata

Assignees

Labels

usabilityissue/pr related to UX

Type

No type

Projects

Status

Closed

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions