Fix Prepare() idempotency by preventing reverse state transition and atomically detecting already-prepared claims #727

shivamerla · 2025-11-19T03:22:59Z

Avoid overwriting PrepareCompleted with PrepareStarted on subsequent Prepare() calls.
Move claim prepared state checks into a single updateCheckpoint call for atomicity.
Ensure checkpoint state does not transition from completed to started state again.

…atomically detecting already-prepared claims. * Avoid overwriting PrepareCompleted with PrepareStarted on subsequent Prepare() calls. * Move claim prepared state checks into a single updateCheckpoint call for atomicity. * Ensure checkpoint state does not transition from completed to started state again. Signed-off-by: Shiva Krishna, Merla <[email protected]>

copy-pr-bot · 2025-11-19T03:23:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shivamerla · 2025-11-19T03:23:17Z

Run dir: /tmp/k8s-dra-driver-gpu-tests-out-smerla/bats-tests-1763521347-XXXXX.DYUaakUsGu
+ cd /cwd
+ echo 'Running k8s cluster cleanup (invasive)...'
Running k8s cluster cleanup (invasive)...
+ set +x
--- STARTING TEST SUITE ---
+ TMPDIR=/tmp/k8s-dra-driver-gpu-tests-out-smerla/bats-tests-1763521347-XXXXX.DYUaakUsGu
+ bats --print-output-on-failure --no-tempdir-cleanup --timing --abort tests/bats/test_gpu_basic.bats tests/bats/test_gpu_stress.bats
test_gpu_basic.bats
 ✓ 1 pod(s), 1 full GPU [19569]
 ✓ 2 pod(s), 1 full GPU each [21535]
 ✓ 2 pod(s), 1 full GPU (shared, 1 RC) [17509]
 ✓ 1 pod(s), 2 cntrs, 1 full GPU (shared, 1 RCT) [17893]
test_gpu_stress.bats
 ✓ Stress: shared ResourceClaim across 15 pods x 5 loops [275201]

5 tests, 0 failures in 355 seconds

jgehrcke · 2025-11-19T11:45:38Z

Thanks for thinking this through! Any brain power on that front is much appreciated.

Higher-level notes:

I haven't looked at the details yet, at all! Will do.
I will want to compare this to commits made as part of the dynamic MIG device allocation PoC -- I found at least one issue with the current (gpu plugin) code that I addressed; and I just didn't get to opening a PR yet since coming back from Kubecon.
In the long run (or short term?) I think we should do ourselves a favor ando find a way to have more common code between the CD and GPU sides of things -- bug fixes should benefit both plugins and of course that's easier with either 1) fully shared code or 2) duplicated, but not overly diverging code.

I hope we're OK with slightly slow movement here. 🐌

shivamerla · 2025-11-19T14:54:25Z

In the long run (or short term?) I think we should do ourselves a favor ando find a way to have more common code between the CD and GPU sides of things -- bug fixes should benefit both plugins and of course that's easier with either 1) fully shared code or 2) duplicated, but not overly diverging code.

Totally agree on this. There is some duplication here between both plugins.

github-project-automation bot added this to Planning Board: k8s-dra-driver-gpu Nov 19, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Nov 19, 2025

shivamerla requested review from jgehrcke and klueska November 19, 2025 03:26

shivamerla mentioned this pull request Nov 19, 2025

Improve idempotency of the Prepare() call when handling duplicate requests for the same claim #729

Open

shivamerla self-assigned this Nov 21, 2025

klueska added the bug Issue/PR to expose/discuss/fix a bug label Nov 24, 2025

klueska modified the milestones: unscheduled, v25.12.0 Nov 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Prepare() idempotency by preventing reverse state transition and atomically detecting already-prepared claims #727

Fix Prepare() idempotency by preventing reverse state transition and atomically detecting already-prepared claims #727

Uh oh!

shivamerla commented Nov 19, 2025

Uh oh!

copy-pr-bot bot commented Nov 19, 2025

Uh oh!

shivamerla commented Nov 19, 2025

Uh oh!

jgehrcke commented Nov 19, 2025

Uh oh!

shivamerla commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix Prepare() idempotency by preventing reverse state transition and atomically detecting already-prepared claims #727

Are you sure you want to change the base?

Fix Prepare() idempotency by preventing reverse state transition and atomically detecting already-prepared claims #727

Uh oh!

Conversation

shivamerla commented Nov 19, 2025

Uh oh!

copy-pr-bot bot commented Nov 19, 2025

Uh oh!

shivamerla commented Nov 19, 2025

Uh oh!

jgehrcke commented Nov 19, 2025

Uh oh!

shivamerla commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants