Skip to content

Commit 73954fe

Browse files
committed
CD daemon: workqueue RateLimiter with O(10s) upper bound
Change upper bound from 1000 s (17 minutes) to something much less. For formation of a larger ComputeDomain (N nodes), many writers desire updating the same API server object. The overall work that has to be done by the API server scales linearly with N: a certain number of updates (at least N, in the best case) is required. Hence, in the best case (perfect serialization, no conflicts) the time it takes overall to get all updates in (the overall ideal convergence time C) is scaling with N, linearly. In the worst case, when individual updates always conflict with each other (are performed 'at the same time' against the same reference state), convergence is never achieved. Without centralized coordination, backing off individual retriers is a way to spread out updates over time. The nature of the distribution of those back-offs governs how long the actual convergence time takes compared to the ideal case C. The ideal case C is governed by the rate R at which the central entity can process updates. If we naively back off exponentially without a sane upper bound then we don't homogenously spread the update load over time, but try to get less and less updates injected into the system per time, as time progresses. The attempted update rate then falls far below R (the possible update rate). That makes convergence unnecessarily slow. If we do not back off enough, an opposite effect may occur because the global rate of retries accumulating at the central point (API server) may always exceed R, and hence thrash resources and slow things down compared to the theoretical update rate maximum (in case of perfectly serialized updates). Hence, there is a sweet spot between both extrema. The positioning of that sweet spot strongly depends on R. Summary: 1) We do not want to back off individual retriers too far, otherwise we operate at an update rate lower than necessary and artificially slow down the convergence process. 2) We need to back off individual retriers enough to prevent thrashing from slowing us and others down. This is critical for making sure the convergence time scales linearly with N (instead of, say, O(N**2)). This patch primarily takes care of (1). For (2), in the future, we may want to further increase that upper bound after a certain amount of time (if e.g. a 5 second cap does not result to overall convergence after e.g. 30 minutes, it may be worth backing off further, to remove stress from the API server). Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
1 parent 11f22c8 commit 73954fe

File tree

2 files changed

+5
-1
lines changed

2 files changed

+5
-1
lines changed

cmd/compute-domain-daemon/controller.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ func NewController(config *ControllerConfig) (*Controller, error) {
6868
return nil, fmt.Errorf("failed to create client sets: %v", err)
6969
}
7070

71-
workQueue := workqueue.New(workqueue.DefaultControllerRateLimiter())
71+
workQueue := workqueue.New(workqueue.DefaultCDDaemonRateLimiter())
7272

7373
mc := &ManagerConfig{
7474
workQueue: workQueue,

pkg/workqueue/workqueue.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,10 @@ func DefaultPrepUnprepRateLimiter() workqueue.TypedRateLimiter[any] {
5858
)
5959
}
6060

61+
func DefaultCDDaemonRateLimiter() workqueue.TypedRateLimiter[any] {
62+
return workqueue.NewTypedItemExponentialFailureRateLimiter[any](5*time.Millisecond, 6000*time.Millisecond)
63+
}
64+
6165
func DefaultControllerRateLimiter() workqueue.TypedRateLimiter[any] {
6266
return workqueue.DefaultTypedControllerRateLimiter[any]()
6367
}

0 commit comments

Comments
 (0)