Skip to content

Conversation

@shivamerla
Copy link
Contributor

  • Avoid overwriting PrepareCompleted with PrepareStarted on subsequent Prepare() calls.
  • Move claim prepared state checks into a single updateCheckpoint call for atomicity.
  • Ensure checkpoint state does not transition from completed to started state again.

…atomically detecting already-prepared claims.

* Avoid overwriting PrepareCompleted with PrepareStarted on subsequent Prepare() calls.
* Move claim prepared state checks into a single updateCheckpoint call for atomicity.
* Ensure checkpoint state does not transition from completed to started state again.

Signed-off-by: Shiva Krishna, Merla <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shivamerla
Copy link
Contributor Author

Run dir: /tmp/k8s-dra-driver-gpu-tests-out-smerla/bats-tests-1763521347-XXXXX.DYUaakUsGu
+ cd /cwd
+ echo 'Running k8s cluster cleanup (invasive)...'
Running k8s cluster cleanup (invasive)...
+ set +x
--- STARTING TEST SUITE ---
+ TMPDIR=/tmp/k8s-dra-driver-gpu-tests-out-smerla/bats-tests-1763521347-XXXXX.DYUaakUsGu
+ bats --print-output-on-failure --no-tempdir-cleanup --timing --abort tests/bats/test_gpu_basic.bats tests/bats/test_gpu_stress.bats
test_gpu_basic.bats
 ✓ 1 pod(s), 1 full GPU [19569]
 ✓ 2 pod(s), 1 full GPU each [21535]
 ✓ 2 pod(s), 1 full GPU (shared, 1 RC) [17509]
 ✓ 1 pod(s), 2 cntrs, 1 full GPU (shared, 1 RCT) [17893]
test_gpu_stress.bats
 ✓ Stress: shared ResourceClaim across 15 pods x 5 loops [275201]

5 tests, 0 failures in 355 seconds

@jgehrcke
Copy link
Collaborator

Thanks for thinking this through! Any brain power on that front is much appreciated.

Higher-level notes:

  • I haven't looked at the details yet, at all! Will do.
  • I will want to compare this to commits made as part of the dynamic MIG device allocation PoC -- I found at least one issue with the current (gpu plugin) code that I addressed; and I just didn't get to opening a PR yet since coming back from Kubecon.
  • In the long run (or short term?) I think we should do ourselves a favor ando find a way to have more common code between the CD and GPU sides of things -- bug fixes should benefit both plugins and of course that's easier with either 1) fully shared code or 2) duplicated, but not overly diverging code.

I hope we're OK with slightly slow movement here. 🐌

@shivamerla
Copy link
Contributor Author

  • In the long run (or short term?) I think we should do ourselves a favor ando find a way to have more common code between the CD and GPU sides of things -- bug fixes should benefit both plugins and of course that's easier with either 1) fully shared code or 2) duplicated, but not overly diverging code.

Totally agree on this. There is some duplication here between both plugins.

@shivamerla shivamerla self-assigned this Nov 21, 2025
@klueska klueska added the bug Issue/PR to expose/discuss/fix a bug label Nov 24, 2025
@klueska klueska modified the milestones: unscheduled, v25.12.0 Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue/PR to expose/discuss/fix a bug

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants