Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

@jgehrcke jgehrcke commented Oct 6, 2025

A patch proposal that addresses #641.

When a resource on a node gets released by a pod and then -- in quick succession -- gets consumed by a new pod (through a new resource claim) then we must make sure that the old resource claim actually gets Unprepare() ed first, before the new resource claim gets Prepare()d.

If said Prepare() comes in early, it needs to be rejected because the device is still allocated from the node's point of view. If Unprepare()ing the old resource claim is done last, workload fails. See #641.

This patch implements one of many ways to make sure that the early Prepare() gets rejected. It raises an interesting question about entries in the checkpoint JSON in the PrepareStarted state.

@klueska klueska added this to the v25.8.0 milestone Oct 6, 2025
@klueska klueska added the bug Issue/PR to expose/discuss/fix a bug label Oct 6, 2025
@klueska klueska linked an issue Oct 6, 2025 that may be closed by this pull request

// For now, we treat each request as a request for channel zero, even if
// AllocationModeAll.
if err := s.allocateImexChannel(0); err != nil {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from review: rename to AssertImexChannelNotAllocated() -- call further below, closer to other asserts.

for claimUID, claim := range cp.V2.PreparedClaims {
// Ignore non-completed preparations: only one instance of this program
// is running, and we only run one Prepare() at any given time. Is that
// true during upgrades though? If this is not true, then we must fail
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that true during upgrades though?

Yes, of course -- that's why we introduced the file-based locking (to never have interleaving Prepare/Unprepare calls, even when having multiple processes).

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Oct 6, 2025

Tested current state of the patch in a failover test. The new check in action, from a kubelet plugin log:

[pod/nvidia-dra-driver-gpu-kubelet-plugin-lmpgt/compute-domains] 2025-10-06T19:00:00.354419370Z E1006 19:00:00.354370 1 workqueue.go:138] Failed to reconcile work item: error preparing devices for claim 8c6f8987-f8b8-4534-a163-769716e8e4f6: prepare devices failed: error applying config: allocation failed: channel 0 already allocated by claim 294bf5c1-c9b6-44ff-8171-afa14939193d (according to checkpoint)

@jgehrcke jgehrcke force-pushed the jp/channelprep-fail-if-allocated branch from fbd041e to 8e568dc Compare October 6, 2025 19:10
@jgehrcke jgehrcke force-pushed the jp/channelprep-fail-if-allocated branch from 8e568dc to 3bae548 Compare October 6, 2025 19:13
@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Oct 6, 2025

It raises an interesting question about entries in the checkpoint JSON in the PrepareStarted state.

Kevin and I discussed this today at length. The discussion naturally converged to what we believe is a fundamentally required state reconciliation; once again based on a periodic type of cleanup: periodically, we have to perform a (beautifully named) self-initiated Unprepare() of previously partially performed Prepare()s.

Routine, at the high level:

Perform periodically:

  • read checkpoint
  • iterate through RCs in PrepareStarted
  • for each: RC still known in API server? If not: initiate an Unprepare, and remove from checkpoint file

Value is two-fold:

  • Critical: unprepare any partially performed claim preparation (this might take care of e.g. removing a CD node label)
  • Likely never a real-world problem, but still a decent thing to do: stop unbounded growth of the checkpoint JSON file over time.

Edit: now tracking this here: #643

Edit 2: thanks for another review, @klueska :)

@jgehrcke jgehrcke merged commit a8e6de0 into NVIDIA:main Oct 6, 2025
7 checks passed
@jgehrcke jgehrcke moved this from Backlog to Closed in Planning Board: k8s-dra-driver-gpu Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue/PR to expose/discuss/fix a bug

Projects

Development

Successfully merging this pull request may close these issues.

CD kubelet plugin may prepare channel0 multiple times

2 participants