-
Notifications
You must be signed in to change notification settings - Fork 98
ComputeDomains: adjust task reconciliation behavior for large CD formation #732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9524592 to
7815f9a
Compare
| // whitespace to the left. | ||
| for _, ip := range slices.Sorted(maps.Keys(m.ipToDNSName)) { | ||
| dnsname := m.ipToDNSName[ip] | ||
| klog.Infof("%26s -> %s", dnsname, ip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current patch sorts the keys, and hence the IP addresses :)
| // perform a stable sort of IP addresses before writing them to the nodes | ||
| // config file. | ||
| if !maps.Equal(newIPs, previousIPs) { | ||
| klog.Infof("IP set changed: previous: %v; new: %v", previousIPs, newIPs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| if err := pm.updateNodeStatus(ctx, status); err != nil { | ||
| return fmt.Errorf("failed to update node status: %w", err) | ||
| return fmt.Errorf("pod update: failed to update note status in CD (%s): %w", status, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the wrapper (workqueue) does not enrich the error message with meaningful context information, and so I added pod update: here -- makes it easier to understand what a log message means. Example:
I1119 22:10:21.531887 1 workqueue.go:197] Reconcile: pod update: failed to update note status in CD (Ready): simulated error 5 (attempt 5)
| // UpdateComputeDomainNodeInfo updates the Nodes field in the ComputeDomain with | ||
| // info about the ComputeDomain daemon running on this node. Upon success, it | ||
| // reflects the mutation in `m.mutationCache`. | ||
| func (m *ComputeDomainManager) UpdateComputeDomainNodeInfo(ctx context.Context, cd *nvapi.ComputeDomain) (rerr error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt like renaming this from UpdateComputeDomainNodeInfo to EnsureNodeInfoInCD after I repeatedly found myself slightly confused about the high-level responsibility of this method.
Any incoming pod update should terminate the retry loop initiated for a previously incoming pod update. The same for any incoming CD update. Any pod update refers to the same pod object, and any CD update refers to the same CD object. Explicitly use that by using hard-coded keys. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
For large CDs this makes it faster to identify changes from the log output. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This is less error-prone: if we treat `node == nil` generally as success, we may miss persisting a pod state transition in edge cases and for edge-case state transitions after the initial NotReady -> Ready transition. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
During reconciliation of pod and CD updates a number of log messages and error messages are flowing through the system, and this change makes it easier to understand which messages belong together and what is actually happening. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This reduces the amount of log volume on the default log level for large CDs. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This was meant to be three seconds, not 3000 seconds. This is a node-local retry and we can easily afford not backing off towards O(1 min) or further. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Change upper bound from 1000 s (17 minutes) to something much less. For formation of a larger ComputeDomain (N nodes), many writers desire updating the same API server object. The overall work that has to be done by the API server scales linearly with N: a certain number of updates (at least N, in the best case) is required. Hence, in the best case (perfect serialization, no conflicts) the time it takes overall to get all updates in (the overall ideal convergence time C) is scaling with N, linearly. In the worst case, when individual updates always conflict with each other (are performed 'at the same time' against the same reference state), convergence is never achieved. Without centralized coordination, backing off individual retriers is a way to spread out updates over time. The nature of the distribution of those back-offs governs how long the actual convergence time takes compared to the ideal case C. The ideal case C is governed by the rate R at which the central entity can process updates. If we naively back off exponentially without a sane upper bound then we don't homogenously spread the update load over time, but try to get less and less updates injected into the system per time, as time progresses. The attempted update rate then falls far below R (the possible update rate). That makes convergence unnecessarily slow. If we do not back off enough, an opposite effect may occur because the global rate of retries accumulating at the central point (API server) may always exceed R, and hence thrash resources and slow things down compared to the theoretical update rate maximum (in case of perfectly serialized updates). Hence, there is a sweet spot between both extrema. The positioning of that sweet spot strongly depends on R. Summary: 1) We do not want to back off individual retriers too far, otherwise we operate at an update rate lower than necessary and artificially slow down the convergence process. 2) We need to back off individual retriers enough to prevent thrashing from slowing us and others down. This is critical for making sure the convergence time scales linearly with N (instead of, say, O(N**2)). This patch primarily takes care of (1). For (2), in the future, we may want to further increase that upper bound after a certain amount of time (if e.g. a 5 second cap does not result to overall convergence after e.g. 30 minutes, it may be worth backing off further, to remove stress from the API server). Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
The client-go rate limiters such as `ExponentialFailureRateLimiter` do not implement jitter. In a user's environment, formation of a CD across 144 nodes has shown that the absence of jitter results in significant retry attempt correlation across nodes -- even after ~10 retries, resulting in otherwise preventable conflicts (and hence increased convergence time). That effect can be diminished by adding jitter, which should allow for The JitterRL implementation provided by this patch is a simple, custom implementation that I validated with simulated errors and careful placement of log messages. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
7815f9a to
f8fceea
Compare
| // fails and is retried, the delay grows exponentially starting from the | ||
| // lower value up to the upper bound. | ||
| workqueue.NewTypedItemExponentialFailureRateLimiter[any](250*time.Millisecond, 3000*time.Second), | ||
| workqueue.NewTypedItemExponentialFailureRateLimiter[any](250*time.Millisecond, 3000*time.Millisecond), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
|
|
||
| func DefaultCDDaemonRateLimiter() workqueue.TypedRateLimiter[any] { | ||
| return NewJitterRateLimiter(workqueue.NewTypedItemExponentialFailureRateLimiter[any](5*time.Millisecond, 6000*time.Millisecond), 0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought quite a bit about these numbers, but of course these are just an attempt to pick something meaningful -- we will see over time if and how we want to change method and parameters.
|
/ok to test f8fceea |


+misc changes.
See commit messages.
Tests: