CD plugin: channel prepare: fail if allocated as of checkpoint #642

jgehrcke · 2025-10-06T09:42:27Z

A patch proposal that addresses #641.

When a resource on a node gets released by a pod and then -- in quick succession -- gets consumed by a new pod (through a new resource claim) then we must make sure that the old resource claim actually gets Unprepare() ed first, before the new resource claim gets Prepare()d.

If said Prepare() comes in early, it needs to be rejected because the device is still allocated from the node's point of view. If Unprepare()ing the old resource claim is done last, workload fails. See #641.

This patch implements one of many ways to make sure that the early Prepare() gets rejected. It raises an interesting question about entries in the checkpoint JSON in the PrepareStarted state.

jgehrcke · 2025-10-06T14:46:46Z

cmd/compute-domain-kubelet-plugin/device_state.go


+	// For now, we treat each request as a request for channel zero, even if
+	// AllocationModeAll.
+	if err := s.allocateImexChannel(0); err != nil {


from review: rename to AssertImexChannelNotAllocated() -- call further below, closer to other asserts.

jgehrcke · 2025-10-06T14:47:34Z

cmd/compute-domain-kubelet-plugin/device_state.go

+	for claimUID, claim := range cp.V2.PreparedClaims {
+		// Ignore non-completed preparations: only one instance of this program
+		// is running, and we only run one Prepare() at any given time. Is that
+		// true during upgrades though? If this is not true, then we must fail


Is that true during upgrades though?

Yes, of course -- that's why we introduced the file-based locking (to never have interleaving Prepare/Unprepare calls, even when having multiple processes).

copy-pr-bot · 2025-10-06T18:49:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jgehrcke · 2025-10-06T19:07:27Z

Tested current state of the patch in a failover test. The new check in action, from a kubelet plugin log:

[pod/nvidia-dra-driver-gpu-kubelet-plugin-lmpgt/compute-domains] 2025-10-06T19:00:00.354419370Z E1006 19:00:00.354370 1 workqueue.go:138] Failed to reconcile work item: error preparing devices for claim 8c6f8987-f8b8-4534-a163-769716e8e4f6: prepare devices failed: error applying config: allocation failed: channel 0 already allocated by claim 294bf5c1-c9b6-44ff-8171-afa14939193d (according to checkpoint)

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke · 2025-10-06T19:26:35Z

It raises an interesting question about entries in the checkpoint JSON in the PrepareStarted state.

Kevin and I discussed this today at length. The discussion naturally converged to what we believe is a fundamentally required state reconciliation; once again based on a periodic type of cleanup: periodically, we have to perform a (beautifully named) self-initiated Unprepare() of previously partially performed Prepare()s.

Routine, at the high level:

Perform periodically:

read checkpoint
iterate through RCs in PrepareStarted
for each: RC still known in API server? If not: initiate an Unprepare, and remove from checkpoint file

Value is two-fold:

Critical: unprepare any partially performed claim preparation (this might take care of e.g. removing a CD node label)
Likely never a real-world problem, but still a decent thing to do: stop unbounded growth of the checkpoint JSON file over time.

Edit: now tracking this here: #643

Edit 2: thanks for another review, @klueska :)

github-project-automation bot added this to Planning Board: k8s-dra-driver-gpu Oct 6, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Oct 6, 2025

klueska assigned jgehrcke Oct 6, 2025

klueska added this to the v25.8.0 milestone Oct 6, 2025

klueska added the bug Issue/PR to expose/discuss/fix a bug label Oct 6, 2025

klueska linked an issue Oct 6, 2025 that may be closed by this pull request

CD kubelet plugin may prepare channel0 multiple times #641

Closed

jgehrcke commented Oct 6, 2025

View reviewed changes

jgehrcke force-pushed the jp/channelprep-fail-if-allocated branch from fbd041e to 8e568dc Compare October 6, 2025 19:10

CD plugin: channel prepare: fail if allocated as of checkpoint

3bae548

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke force-pushed the jp/channelprep-fail-if-allocated branch from 8e568dc to 3bae548 Compare October 6, 2025 19:13

klueska approved these changes Oct 6, 2025

View reviewed changes

jgehrcke mentioned this pull request Oct 6, 2025

CD kubelet plugin: build state reconciliation for partially prepared claims #643

Closed

jgehrcke merged commit a8e6de0 into NVIDIA:main Oct 6, 2025
7 checks passed

jgehrcke moved this from Backlog to Closed in Planning Board: k8s-dra-driver-gpu Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CD plugin: channel prepare: fail if allocated as of checkpoint #642

CD plugin: channel prepare: fail if allocated as of checkpoint #642

Uh oh!

jgehrcke commented Oct 6, 2025

Uh oh!

jgehrcke Oct 6, 2025

Uh oh!

jgehrcke Oct 6, 2025

Uh oh!

copy-pr-bot bot commented Oct 6, 2025

Uh oh!

jgehrcke commented Oct 6, 2025

Uh oh!

jgehrcke commented Oct 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CD plugin: channel prepare: fail if allocated as of checkpoint #642

CD plugin: channel prepare: fail if allocated as of checkpoint #642

Uh oh!

Conversation

jgehrcke commented Oct 6, 2025

Uh oh!

jgehrcke Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

jgehrcke Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot bot commented Oct 6, 2025

Uh oh!

jgehrcke commented Oct 6, 2025

Uh oh!

jgehrcke commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jgehrcke commented Oct 6, 2025 •

edited

Loading