Commit 73954fe
committed
CD daemon: workqueue RateLimiter with O(10s) upper bound
Change upper bound from 1000 s (17 minutes) to something much less.
For formation of a larger ComputeDomain (N nodes),
many writers desire updating the same API server
object.
The overall work that has to be done by the API
server scales linearly with N: a certain number of
updates (at least N, in the best case) is
required.
Hence, in the best case (perfect serialization, no
conflicts) the time it takes overall to get all
updates in (the overall ideal convergence time C)
is scaling with N, linearly.
In the worst case, when individual updates always
conflict with each other (are performed 'at the
same time' against the same reference state),
convergence is never achieved.
Without centralized coordination, backing off
individual retriers is a way to spread out updates
over time. The nature of the distribution of those
back-offs governs how long the actual convergence
time takes compared to the ideal case C.
The ideal case C is governed by the rate R at
which the central entity can process updates.
If we naively back off exponentially without a
sane upper bound then we don't homogenously spread
the update load over time, but try to get less and
less updates injected into the system per time, as
time progresses. The attempted update rate then
falls far below R (the possible update rate). That
makes convergence unnecessarily slow.
If we do not back off enough, an opposite effect
may occur because the global rate of retries
accumulating at the central point (API server) may
always exceed R, and hence thrash resources and
slow things down compared to the theoretical
update rate maximum (in case of perfectly
serialized updates).
Hence, there is a sweet spot between both extrema.
The positioning of that sweet spot strongly
depends on R.
Summary:
1) We do not want to back off individual retriers
too far, otherwise we operate at an update rate
lower than necessary and artificially slow down
the convergence process.
2) We need to back off individual retriers enough
to prevent thrashing from slowing us and others
down. This is critical for making sure the
convergence time scales linearly with N
(instead of, say, O(N**2)).
This patch primarily takes care of (1).
For (2), in the future, we may want to further
increase that upper bound after a certain amount
of time (if e.g. a 5 second cap does not result to
overall convergence after e.g. 30 minutes, it may
be worth backing off further, to remove stress
from the API server).
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>1 parent 11f22c8 commit 73954fe
File tree
2 files changed
+5
-1
lines changed- cmd/compute-domain-daemon
- pkg/workqueue
2 files changed
+5
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
71 | | - | |
| 71 | + | |
72 | 72 | | |
73 | 73 | | |
74 | 74 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
61 | 65 | | |
62 | 66 | | |
63 | 67 | | |
| |||
0 commit comments