Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

@jgehrcke jgehrcke commented Nov 20, 2025

+misc changes.

See commit messages.

Tests:

test_basics.bats
 ✓ test VERSION_W_COMMIT, VERSION_GHCR_CHART, VERSION [194]
 ✓ confirm no kubelet plugin pods running [169]
 ✓ helm-install oci://ghcr.io/nvidia/k8s-dra-driver-gpu/25.12.0-dev-f8fceeae-chart [9525]
 ✓ helm list: validate output [230]
 ✓ get crd computedomains.resource.nvidia.com [164]
 ✓ wait for plugin & controller pods READY [749]
 ✓ validate CD controller container image spec [169]
test_gpu_basic.bats
 ✓ 1 pod(s), 1 full GPU [5089]
 ✓ 2 pod(s), 1 full GPU each [5065]
 ✓ 2 pod(s), 1 full GPU (shared, 1 RC) [5098]
 ✓ 1 pod(s), 2 cntrs, 1 full GPU (shared, 1 RCT) [4382]
test_cd_imex_chan_inject.bats
 ✓ IMEX channel injection (single) [14871]
 ✓ IMEX channel injection (all) [12443]
test_cd_mnnvl_workload.bats
 ✓ nickelpie (NCCL send/recv/broadcast, 2 pods, 2 nodes, small payload) [11296]
 ✓ nvbandwidth (2 nodes, 2 GPUs each) [16139]
test_cd_misc.bats
 ✓ CD daemon shutdown: confirm CD status cleanup [9212]
 ✓ reject unknown field in opaque cfg in CD chan ResourceClaim [10262]
 ✓ self-initiated unprepare of stale RCs in PrepareStarted [25815]
test_cd_logging.bats
 ✓ CD controller/plugin: startup config / detail in logs on level 0 [6462]
 ✓ CD controller: test log verbosity levels [57130]
 ✓ CD daemon: test log verbosity levels [32774]
test_cd_failover.bats
 ✓ CD failover nvb2: force-delete worker pod 0 [48916]
 ✓ CD failover nvb2: force-delete all IMEX daemons [36919]
 ✓ CD failover nvb2: regular-delete worker pod 1 [54482]
test_cd_updowngrade.bats
 ✓ downgrade: current-dev -> last-stable [25450]
 ✓ upgrade: wipe-state, install-last-stable, upgrade-to-current-dev [34085]
test_gpu_stress.bats
 ✓ Stress: shared ResourceClaim across 15 pods x 5 loops [155559]

27 tests, 0 failures in 617 seconds

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jgehrcke jgehrcke force-pushed the jp/large-cd-formation branch from 9524592 to 7815f9a Compare November 20, 2025 13:29
// whitespace to the left.
for _, ip := range slices.Sorted(maps.Keys(m.ipToDNSName)) {
dnsname := m.ipToDNSName[ip]
klog.Infof("%26s -> %s", dnsname, ip)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change motivated by seeing logs like this:

Image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current patch sorts the keys, and hence the IP addresses :)

// perform a stable sort of IP addresses before writing them to the nodes
// config file.
if !maps.Equal(newIPs, previousIPs) {
klog.Infof("IP set changed: previous: %v; new: %v", previousIPs, newIPs)
Copy link
Collaborator Author

@jgehrcke jgehrcke Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bulk of the log volume emitted by the CD daemon is dominated by this; we must not log all of this on level zero.

example:
Image


if err := pm.updateNodeStatus(ctx, status); err != nil {
return fmt.Errorf("failed to update node status: %w", err)
return fmt.Errorf("pod update: failed to update note status in CD (%s): %w", status, err)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the wrapper (workqueue) does not enrich the error message with meaningful context information, and so I added pod update: here -- makes it easier to understand what a log message means. Example:

I1119 22:10:21.531887       1 workqueue.go:197] Reconcile: pod update: failed to update note status in CD (Ready): simulated error 5 (attempt 5)

// UpdateComputeDomainNodeInfo updates the Nodes field in the ComputeDomain with
// info about the ComputeDomain daemon running on this node. Upon success, it
// reflects the mutation in `m.mutationCache`.
func (m *ComputeDomainManager) UpdateComputeDomainNodeInfo(ctx context.Context, cd *nvapi.ComputeDomain) (rerr error) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt like renaming this from UpdateComputeDomainNodeInfo to EnsureNodeInfoInCD after I repeatedly found myself slightly confused about the high-level responsibility of this method.

Any incoming pod update should terminate the retry
loop initiated for a previously incoming pod
update. The same for any incoming CD update.

Any pod update refers to the same pod object, and
any CD update refers to the same CD object.
Explicitly use that by using hard-coded keys.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
For large CDs this makes it faster to identify
changes from the log output.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This is less error-prone: if we treat `node ==
nil` generally as success, we may miss persisting
a pod state transition in edge cases and for
edge-case state transitions after the initial
NotReady -> Ready transition.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
During reconciliation of pod and CD updates a
number of log messages and error messages are
flowing through the system, and this change makes
it easier to understand which messages belong
together and what is actually happening.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This reduces the amount of log volume on the
default log level for large CDs.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This was meant to be three seconds, not 3000 seconds.

This is a node-local retry and we can easily
afford not backing off towards O(1 min) or
further.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Change upper bound from 1000 s (17 minutes) to something much less.

For formation of a larger ComputeDomain (N nodes),
many writers desire updating the same API server
object.

The overall work that has to be done by the API
server scales linearly with N: a certain number of
updates (at least N, in the best case) is
required.

Hence, in the best case (perfect serialization, no
conflicts) the time it takes overall to get all
updates in (the overall ideal convergence time C)
is scaling with N, linearly.

In the worst case, when individual updates always
conflict with each other (are performed 'at the
same time' against the same reference state),
convergence is never achieved.

Without centralized coordination, backing off
individual retriers is a way to spread out updates
over time. The nature of the distribution of those
back-offs governs how long the actual convergence
time takes compared to the ideal case C.

The ideal case C is governed by the rate R at
which the central entity can process updates.

If we naively back off exponentially without a
sane upper bound then we don't homogenously spread
the update load over time, but try to get less and
less updates injected into the system per time, as
time progresses. The attempted update rate then
falls far below R (the possible update rate). That
makes convergence unnecessarily slow.

If we do not back off enough, an opposite effect
may occur because the global rate of retries
accumulating at the central point (API server) may
always exceed R, and hence thrash resources and
slow things down compared to the theoretical
update rate maximum (in case of perfectly
serialized updates).

Hence, there is a sweet spot between both extrema.
The positioning of that sweet spot strongly
depends on R.

Summary:

1) We do not want to back off individual retriers
   too far, otherwise we operate at an update rate
   lower than necessary and artificially slow down
   the convergence process.

2) We need to back off individual retriers enough
   to prevent thrashing from slowing us and others
   down. This is critical for making sure the
   convergence time scales linearly with N
   (instead of, say, O(N**2)).

This patch primarily takes care of (1).

For (2), in the future, we may want to further
increase that upper bound after a certain amount
of time (if e.g. a 5 second cap does not result to
overall convergence after e.g. 30 minutes, it may
be worth backing off further, to remove stress
from the API server).

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
The client-go rate limiters such as
`ExponentialFailureRateLimiter` do not implement
jitter. In a user's environment, formation of a CD
across 144 nodes has shown that the absence of
jitter results in significant retry attempt
correlation across nodes -- even after ~10
retries, resulting in otherwise preventable
conflicts (and hence increased convergence time).

That effect can be diminished by adding jitter,
which should allow for

The JitterRL implementation provided by this patch
is a simple, custom implementation that I
validated with simulated errors and careful
placement of log messages.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
@jgehrcke jgehrcke force-pushed the jp/large-cd-formation branch from 7815f9a to f8fceea Compare November 20, 2025 13:44
// fails and is retried, the delay grows exponentially starting from the
// lower value up to the upper bound.
workqueue.NewTypedItemExponentialFailureRateLimiter[any](250*time.Millisecond, 3000*time.Second),
workqueue.NewTypedItemExponentialFailureRateLimiter[any](250*time.Millisecond, 3000*time.Millisecond),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

func DefaultCDDaemonRateLimiter() workqueue.TypedRateLimiter[any] {
return NewJitterRateLimiter(workqueue.NewTypedItemExponentialFailureRateLimiter[any](5*time.Millisecond, 6000*time.Millisecond), 0.5)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought quite a bit about these numbers, but of course these are just an attempt to pick something meaningful -- we will see over time if and how we want to change method and parameters.

@jgehrcke
Copy link
Collaborator Author

/ok to test f8fceea

@jgehrcke jgehrcke self-assigned this Nov 20, 2025
@jgehrcke jgehrcke moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Nov 20, 2025
@jgehrcke jgehrcke added this to the v25.12.0 milestone Nov 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

DefaultPrepUnprepRateLimiter backs off too much

1 participant