CD daemon: workqueue RateLimiter with O(10s) upper bound

jgehrcke · jgehrcke · commit 73954fe622b1 · 2025-11-20T04:33:30.000-08:00
Change upper bound from 1000 s (17 minutes) to something much less.

For formation of a larger ComputeDomain (N nodes),
many writers desire updating the same API server
object.

The overall work that has to be done by the API
server scales linearly with N: a certain number of
updates (at least N, in the best case) is
required.

Hence, in the best case (perfect serialization, no
conflicts) the time it takes overall to get all
updates in (the overall ideal convergence time C)
is scaling with N, linearly.

In the worst case, when individual updates always
conflict with each other (are performed 'at the
same time' against the same reference state),
convergence is never achieved.

Without centralized coordination, backing off
individual retriers is a way to spread out updates
over time. The nature of the distribution of those
back-offs governs how long the actual convergence
time takes compared to the ideal case C.

The ideal case C is governed by the rate R at
which the central entity can process updates.

If we naively back off exponentially without a
sane upper bound then we don't homogenously spread
the update load over time, but try to get less and
less updates injected into the system per time, as
time progresses. The attempted update rate then
falls far below R (the possible update rate). That
makes convergence unnecessarily slow.

If we do not back off enough, an opposite effect
may occur because the global rate of retries
accumulating at the central point (API server) may
always exceed R, and hence thrash resources and
slow things down compared to the theoretical
update rate maximum (in case of perfectly
serialized updates).

Hence, there is a sweet spot between both extrema.
The positioning of that sweet spot strongly
depends on R.

Summary:

1) We do not want to back off individual retriers
   too far, otherwise we operate at an update rate
   lower than necessary and artificially slow down
   the convergence process.

2) We need to back off individual retriers enough
   to prevent thrashing from slowing us and others
   down. This is critical for making sure the
   convergence time scales linearly with N
   (instead of, say, O(N**2)).

This patch primarily takes care of (1).

For (2), in the future, we may want to further
increase that upper bound after a certain amount
of time (if e.g. a 5 second cap does not result to
overall convergence after e.g. 30 minutes, it may
be worth backing off further, to remove stress
from the API server).

Signed-off-by: Dr. Jan-Philip Gehrcke &lt;jgehrcke@nvidia.com&gt;
diff --git a/cmd/compute-domain-daemon/controller.go b/cmd/compute-domain-daemon/controller.go
@@ -68,7 +68,7 @@ func NewController(config *ControllerConfig) (*Controller, error) {
 		return nil, fmt.Errorf("failed to create client sets: %v", err)
 	}
 
-	workQueue := workqueue.New(workqueue.DefaultControllerRateLimiter())
+	workQueue := workqueue.New(workqueue.DefaultCDDaemonRateLimiter())
 
 	mc := &ManagerConfig{
 		workQueue:              workQueue,
diff --git a/pkg/workqueue/workqueue.go b/pkg/workqueue/workqueue.go
@@ -58,6 +58,10 @@ func DefaultPrepUnprepRateLimiter() workqueue.TypedRateLimiter[any] {
 	)
 }
 
+func DefaultCDDaemonRateLimiter() workqueue.TypedRateLimiter[any] {
+	return workqueue.NewTypedItemExponentialFailureRateLimiter[any](5*time.Millisecond, 6000*time.Millisecond)
+}
+
 func DefaultControllerRateLimiter() workqueue.TypedRateLimiter[any] {
 	return workqueue.DefaultTypedControllerRateLimiter[any]()
 }

Original file line number	Diff line number	Diff line change
`@@ -68,7 +68,7 @@ func NewController(config ControllerConfig) (Controller, error) {`
`68`	`68`	`return nil, fmt.Errorf("failed to create client sets: %v", err)`
`69`	`69`	`}`
`70`	`70`
`71`		`- workQueue := workqueue.New(workqueue.DefaultControllerRateLimiter())`
	`71`	`+ workQueue := workqueue.New(workqueue.DefaultCDDaemonRateLimiter())`
`72`	`72`
`73`	`73`	`mc := &ManagerConfig{`
`74`	`74`	`workQueue: workQueue,`
Original file line number	Diff line number	Diff line change
`@@ -58,6 +58,10 @@ func DefaultPrepUnprepRateLimiter() workqueue.TypedRateLimiter[any] {`
`58`	`58`	`)`
`59`	`59`	`}`
`60`	`60`
	`61`	`+func DefaultCDDaemonRateLimiter() workqueue.TypedRateLimiter[any] {`
	`62`	`+ return workqueue.NewTypedItemExponentialFailureRateLimiter[any](5time.Millisecond, 6000time.Millisecond)`
	`63`	`+}`
	`64`	`+`
`61`	`65`	`func DefaultControllerRateLimiter() workqueue.TypedRateLimiter[any] {`
`62`	`66`	`return workqueue.DefaultTypedControllerRateLimiter[any]()`
`63`	`67`	`}`