-
Notifications
You must be signed in to change notification settings - Fork 98
CD plugin: channel prepare: fail if allocated as of checkpoint #642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CD plugin: channel prepare: fail if allocated as of checkpoint #642
Conversation
|
|
||
| // For now, we treat each request as a request for channel zero, even if | ||
| // AllocationModeAll. | ||
| if err := s.allocateImexChannel(0); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from review: rename to AssertImexChannelNotAllocated() -- call further below, closer to other asserts.
| for claimUID, claim := range cp.V2.PreparedClaims { | ||
| // Ignore non-completed preparations: only one instance of this program | ||
| // is running, and we only run one Prepare() at any given time. Is that | ||
| // true during upgrades though? If this is not true, then we must fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that true during upgrades though?
Yes, of course -- that's why we introduced the file-based locking (to never have interleaving Prepare/Unprepare calls, even when having multiple processes).
|
Tested current state of the patch in a failover test. The new check in action, from a kubelet plugin log:
|
fbd041e to
8e568dc
Compare
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
8e568dc to
3bae548
Compare
Kevin and I discussed this today at length. The discussion naturally converged to what we believe is a fundamentally required state reconciliation; once again based on a periodic type of cleanup: periodically, we have to perform a (beautifully named) self-initiated Routine, at the high level: Perform periodically:
Value is two-fold:
Edit: now tracking this here: #643 Edit 2: thanks for another review, @klueska :) |
A patch proposal that addresses #641.
When a resource on a node gets released by a pod and then -- in quick succession -- gets consumed by a new pod (through a new resource claim) then we must make sure that the old resource claim actually gets
Unprepare()ed first, before the new resource claim getsPrepare()d.If said
Prepare()comes in early, it needs to be rejected because the device is still allocated from the node's point of view. IfUnprepare()ing the old resource claim is done last, workload fails. See #641.This patch implements one of many ways to make sure that the early
Prepare()gets rejected. It raises an interesting question about entries in the checkpoint JSON in thePrepareStartedstate.