ComputeDomains: adjust task reconciliation behavior for large CD formation #732

jgehrcke · 2025-11-20T13:03:30Z

Fixes DefaultPrepUnprepRateLimiter backs off too much #733
Implements CD daemon: facilitate large CD formation convergence by adding jitter #734
Introduces a rate limiter specific for the CD daemon use case (which is special in the sense that it's exercised on N nodes, and hitting a central entity).
Improves logging and error message to easier see through reconciliation upon large CD formation.

+misc changes.

See commit messages.

Tests:

test_basics.bats
 ✓ test VERSION_W_COMMIT, VERSION_GHCR_CHART, VERSION [194]
 ✓ confirm no kubelet plugin pods running [169]
 ✓ helm-install oci://ghcr.io/nvidia/k8s-dra-driver-gpu/25.12.0-dev-f8fceeae-chart [9525]
 ✓ helm list: validate output [230]
 ✓ get crd computedomains.resource.nvidia.com [164]
 ✓ wait for plugin & controller pods READY [749]
 ✓ validate CD controller container image spec [169]
test_gpu_basic.bats
 ✓ 1 pod(s), 1 full GPU [5089]
 ✓ 2 pod(s), 1 full GPU each [5065]
 ✓ 2 pod(s), 1 full GPU (shared, 1 RC) [5098]
 ✓ 1 pod(s), 2 cntrs, 1 full GPU (shared, 1 RCT) [4382]
test_cd_imex_chan_inject.bats
 ✓ IMEX channel injection (single) [14871]
 ✓ IMEX channel injection (all) [12443]
test_cd_mnnvl_workload.bats
 ✓ nickelpie (NCCL send/recv/broadcast, 2 pods, 2 nodes, small payload) [11296]
 ✓ nvbandwidth (2 nodes, 2 GPUs each) [16139]
test_cd_misc.bats
 ✓ CD daemon shutdown: confirm CD status cleanup [9212]
 ✓ reject unknown field in opaque cfg in CD chan ResourceClaim [10262]
 ✓ self-initiated unprepare of stale RCs in PrepareStarted [25815]
test_cd_logging.bats
 ✓ CD controller/plugin: startup config / detail in logs on level 0 [6462]
 ✓ CD controller: test log verbosity levels [57130]
 ✓ CD daemon: test log verbosity levels [32774]
test_cd_failover.bats
 ✓ CD failover nvb2: force-delete worker pod 0 [48916]
 ✓ CD failover nvb2: force-delete all IMEX daemons [36919]
 ✓ CD failover nvb2: regular-delete worker pod 1 [54482]
test_cd_updowngrade.bats
 ✓ downgrade: current-dev -> last-stable [25450]
 ✓ upgrade: wipe-state, install-last-stable, upgrade-to-current-dev [34085]
test_gpu_stress.bats
 ✓ Stress: shared ResourceClaim across 15 pods x 5 loops [155559]

27 tests, 0 failures in 617 seconds

copy-pr-bot · 2025-11-20T13:03:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jgehrcke · 2025-11-20T13:33:10Z

cmd/compute-domain-daemon/dnsnames.go

+	// whitespace to the left.
+	for _, ip := range slices.Sorted(maps.Keys(m.ipToDNSName)) {
+		dnsname := m.ipToDNSName[ip]
+		klog.Infof("%26s -> %s", dnsname, ip)


Change motivated by seeing logs like this:

The current patch sorts the keys, and hence the IP addresses :)

jgehrcke · 2025-11-20T13:35:08Z

cmd/compute-domain-daemon/computedomain.go

 	// perform a stable sort of IP addresses before writing them to the nodes
 	// config file.
 	if !maps.Equal(newIPs, previousIPs) {
-		klog.Infof("IP set changed: previous: %v; new: %v", previousIPs, newIPs)


The bulk of the log volume emitted by the CD daemon is dominated by this; we must not log all of this on level zero.

example:

jgehrcke · 2025-11-20T13:39:56Z

cmd/compute-domain-daemon/podmanager.go


 	if err := pm.updateNodeStatus(ctx, status); err != nil {
-		return fmt.Errorf("failed to update node status: %w", err)
+		return fmt.Errorf("pod update: failed to update note status in CD (%s): %w", status, err)


the wrapper (workqueue) does not enrich the error message with meaningful context information, and so I added pod update: here -- makes it easier to understand what a log message means. Example:

I1119 22:10:21.531887 1 workqueue.go:197] Reconcile: pod update: failed to update note status in CD (Ready): simulated error 5 (attempt 5)

jgehrcke · 2025-11-20T13:41:35Z

cmd/compute-domain-daemon/computedomain.go

-// UpdateComputeDomainNodeInfo updates the Nodes field in the ComputeDomain with
-// info about the ComputeDomain daemon running on this node. Upon success, it
-// reflects the mutation in `m.mutationCache`.
-func (m *ComputeDomainManager) UpdateComputeDomainNodeInfo(ctx context.Context, cd *nvapi.ComputeDomain) (rerr error) {


I felt like renaming this from UpdateComputeDomainNodeInfo to EnsureNodeInfoInCD after I repeatedly found myself slightly confused about the high-level responsibility of this method.

Any incoming pod update should terminate the retry loop initiated for a previously incoming pod update. The same for any incoming CD update. Any pod update refers to the same pod object, and any CD update refers to the same CD object. Explicitly use that by using hard-coded keys. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

For large CDs this makes it faster to identify changes from the log output. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

This is less error-prone: if we treat `node == nil` generally as success, we may miss persisting a pod state transition in edge cases and for edge-case state transitions after the initial NotReady -> Ready transition. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

During reconciliation of pod and CD updates a number of log messages and error messages are flowing through the system, and this change makes it easier to understand which messages belong together and what is actually happening. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

This reduces the amount of log volume on the default log level for large CDs. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

This was meant to be three seconds, not 3000 seconds. This is a node-local retry and we can easily afford not backing off towards O(1 min) or further. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

Change upper bound from 1000 s (17 minutes) to something much less. For formation of a larger ComputeDomain (N nodes), many writers desire updating the same API server object. The overall work that has to be done by the API server scales linearly with N: a certain number of updates (at least N, in the best case) is required. Hence, in the best case (perfect serialization, no conflicts) the time it takes overall to get all updates in (the overall ideal convergence time C) is scaling with N, linearly. In the worst case, when individual updates always conflict with each other (are performed 'at the same time' against the same reference state), convergence is never achieved. Without centralized coordination, backing off individual retriers is a way to spread out updates over time. The nature of the distribution of those back-offs governs how long the actual convergence time takes compared to the ideal case C. The ideal case C is governed by the rate R at which the central entity can process updates. If we naively back off exponentially without a sane upper bound then we don't homogenously spread the update load over time, but try to get less and less updates injected into the system per time, as time progresses. The attempted update rate then falls far below R (the possible update rate). That makes convergence unnecessarily slow. If we do not back off enough, an opposite effect may occur because the global rate of retries accumulating at the central point (API server) may always exceed R, and hence thrash resources and slow things down compared to the theoretical update rate maximum (in case of perfectly serialized updates). Hence, there is a sweet spot between both extrema. The positioning of that sweet spot strongly depends on R. Summary: 1) We do not want to back off individual retriers too far, otherwise we operate at an update rate lower than necessary and artificially slow down the convergence process. 2) We need to back off individual retriers enough to prevent thrashing from slowing us and others down. This is critical for making sure the convergence time scales linearly with N (instead of, say, O(N**2)). This patch primarily takes care of (1). For (2), in the future, we may want to further increase that upper bound after a certain amount of time (if e.g. a 5 second cap does not result to overall convergence after e.g. 30 minutes, it may be worth backing off further, to remove stress from the API server). Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

The client-go rate limiters such as `ExponentialFailureRateLimiter` do not implement jitter. In a user's environment, formation of a CD across 144 nodes has shown that the absence of jitter results in significant retry attempt correlation across nodes -- even after ~10 retries, resulting in otherwise preventable conflicts (and hence increased convergence time). That effect can be diminished by adding jitter, which should allow for The JitterRL implementation provided by this patch is a simple, custom implementation that I validated with simulated errors and careful placement of log messages. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke · 2025-11-20T13:44:53Z

pkg/workqueue/workqueue.go

 		// fails and is retried, the delay grows exponentially starting from the
 		// lower value up to the upper bound.
-		workqueue.NewTypedItemExponentialFailureRateLimiter[any](250*time.Millisecond, 3000*time.Second),
+		workqueue.NewTypedItemExponentialFailureRateLimiter[any](250*time.Millisecond, 3000*time.Millisecond),


jgehrcke · 2025-11-20T13:46:35Z

pkg/workqueue/workqueue.go

 }

+func DefaultCDDaemonRateLimiter() workqueue.TypedRateLimiter[any] {
+	return NewJitterRateLimiter(workqueue.NewTypedItemExponentialFailureRateLimiter[any](5*time.Millisecond, 6000*time.Millisecond), 0.5)


I thought quite a bit about these numbers, but of course these are just an attempt to pick something meaningful -- we will see over time if and how we want to change method and parameters.

jgehrcke · 2025-11-20T13:47:57Z

/ok to test f8fceea

github-project-automation bot added this to Planning Board: k8s-dra-driver-gpu Nov 20, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Nov 20, 2025

jgehrcke force-pushed the jp/large-cd-formation branch from 9524592 to 7815f9a Compare November 20, 2025 13:29

jgehrcke commented Nov 20, 2025

View reviewed changes

jgehrcke added 10 commits November 20, 2025 05:44

CD daemon: log updated DNS map in alphabetical order

b9ba1be

For large CDs this makes it faster to identify changes from the log output. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

CD daemon: log large IP set on higher verbosity

950d62a

This reduces the amount of log volume on the default log level for large CDs. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

kubelet plugins: fix DefaultPrepUnprepRateLimiter upper bound

a1c9323

This was meant to be three seconds, not 3000 seconds. This is a node-local retry and we can easily afford not backing off towards O(1 min) or further. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

workqueue: improve logging (add retry count, misc)

598d682

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

tests: adjust to new CD daemon log msg

f8fceea

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke force-pushed the jp/large-cd-formation branch from 7815f9a to f8fceea Compare November 20, 2025 13:44

jgehrcke commented Nov 20, 2025

View reviewed changes

jgehrcke self-assigned this Nov 20, 2025

jgehrcke moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Nov 20, 2025

jgehrcke removed the status in Planning Board: k8s-dra-driver-gpu Nov 21, 2025

jgehrcke moved this to In Progress in Planning Board: k8s-dra-driver-gpu Nov 21, 2025

jgehrcke added this to the v25.12.0 milestone Nov 21, 2025

jgehrcke added the backport-25.8 label Nov 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ComputeDomains: adjust task reconciliation behavior for large CD formation #732

ComputeDomains: adjust task reconciliation behavior for large CD formation #732

Uh oh!

jgehrcke commented Nov 20, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 20, 2025

Uh oh!

jgehrcke Nov 20, 2025

Uh oh!

jgehrcke Nov 24, 2025

Uh oh!

jgehrcke Nov 20, 2025 •

edited

Loading

Uh oh!

jgehrcke Nov 20, 2025

Uh oh!

jgehrcke Nov 20, 2025

Uh oh!

jgehrcke Nov 20, 2025

Uh oh!

jgehrcke Nov 20, 2025

Uh oh!

jgehrcke commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ComputeDomains: adjust task reconciliation behavior for large CD formation #732

Are you sure you want to change the base?

ComputeDomains: adjust task reconciliation behavior for large CD formation #732

Uh oh!

Conversation

jgehrcke commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests:

Uh oh!

copy-pr-bot bot commented Nov 20, 2025

Uh oh!

jgehrcke Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

jgehrcke Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

jgehrcke Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

jgehrcke Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

jgehrcke Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

jgehrcke Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

jgehrcke commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jgehrcke commented Nov 20, 2025 •

edited

Loading

jgehrcke Nov 20, 2025 •

edited

Loading