refactor(controller): remove redundant SandboxUpdateOps concurrency check#342
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #342 +/- ##
==========================================
+ Coverage 74.65% 75.44% +0.79%
==========================================
Files 141 144 +3
Lines 9836 10366 +530
==========================================
+ Hits 7343 7821 +478
- Misses 2183 2206 +23
- Partials 310 339 +29
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hello @furykerry, @AiRanthem, and @zmberg I was reviewing the codebase and noticed an opportunity to clean up some tech debt in the Since the Let me know if you need any adjustments or further testing. |
|
@PRAteek-singHWY thanks for your patch, but concurrent update ops for a single sandbox should be avoided, so simply remove this concurrent check is not an option, we should comes up with some more flexible validation rules |
Hey @furykerry That makes sense, we definitely want to avoid race conditions on a single sandbox. Instead of the strict 'one-per-namespace' rule, I'll update this PR to implement a more granular check: I'll make the webhook verify that the selector of the new SandboxUpdateOps doesn't overlap with any currently active ones. This way, users can run concurrent updates across the namespace as long as they target different sets of sandboxes. I'll push the updated logic shortly. Thank you. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hey @furykerry, your point about avoiding race conditions on individual sandboxes makes total sense. I've updated the logic to block updates only if their selectors overlap, this keeps things safe for single sandboxes while allowing parallel updates across the rest of the namespace. I've also included a small fix in the Happy to iterate further based on ur suggestions. |
|
@zmberg the informer cache sync delay problem always exists in both webhook and controller , i think we need a final check in the sandbox controller. The sandbox controller can lock the sandbox in pre-upgrade phase so that no other sandboxupdateops can operate in the same sandbox |
Replace the hand-rolled SelectorsOverlap with a copy of openkruise/kruise's pkg/util/selector.go (IsSelectorOverlapping + IsSelectorLooseOverlap + helpers). Substitute std slices.Contains for the k8s.io/kubernetes/pkg/util/slice calls so we avoid that transitive dependency. The unused unsafe ValidatedLabelSelectorAsSelector fast-path is omitted; happy to add back if requested. Update the webhook caller and broaden unit tests to cover NotIn / Exists / DoesNotExist combinations exercised by the new util. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ioning to Updating The validating webhook reads from a (potentially stale) informer cache, so two SandboxUpdateOps with overlapping selectors can both be admitted when created back-to-back. Add a controller-side check before the Pending -> Updating transition: if a non-terminal sibling op in the namespace has an overlapping selector and is either already Updating or wins the creation-timestamp tiebreak, set Phase=Pending and requeue. Tiebreak: a sibling already in Updating wins outright; otherwise the older CreationTimestamp wins (lexicographic name comparison breaks exact-time ties). Note: the controller's informer cache also lags, so this still leaves a narrow residual race. The durable fix is a sandbox-level lock during the pre-upgrade phase, planned as a stacked follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lift codecov/patch coverage above the 74.65% gate by exercising the IsSelectorLooseOverlap branch of the kruise-vendored selector util and the equal-CreationTimestamp tiebreak in the SandboxUpdateOps reconciler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| // IsSelectorLooseOverlap indicates whether selectors overlap (indicates that selector1, selector2 have same key, and there is a certain intersection) | ||
| // 1. when selector1、selector2 don't have same key, it is considered non-overlap, e.g. selector1(a=b) and selector2(c=d) | ||
| // 2. when selector1、selector2 have same key, and matchLabels & matchExps are intersection, it is considered overlap. | ||
| func IsSelectorLooseOverlap(selector1, selector2 *metav1.LabelSelector) bool { |
There was a problem hiding this comment.
Dead code: IsSelectorLooseOverlap and its helpers are unused, consider remove them
There was a problem hiding this comment.
Dead code: IsSelectorLooseOverlap and its helpers are unused, consider remove them
HI @furykerry Done. Dropped IsSelectorLooseOverlap, isMatchExpOverlap, and convertSelectorToMatchExpressions from pkg/utils/sandboxutils/selector_utils.go, and removed the TestIsSelectorLooseOverlap table from selector_utils_test.go. IsSelectorOverlapping is the only entry point we actually use (webhook + new controller double-check), so the file is now just that plus its isDisjoint / sliceOverlaps / sliceContains helpers. go test ./pkg/utils/sandboxutils/... and go vet pass.
| return ctrl.Result{}, client.IgnoreNotFound(err) | ||
| } | ||
|
|
||
| // 2. Check if another SandboxUpdateOps in the same namespace is already Updating |
There was a problem hiding this comment.
After removing the old step-2 block, the remaining comments jump from // 1. Get SandboxUpdateOps directly to // 3. Handle deletion. Pre-existing duplicates (two // 3., two // 4., two // 7.) make this worse, but the PR exacerbates the problem by removing the only "step 2" anchor. Now the function reads as step 1 → step 3, which is confusing. Consider correct the step number
There was a problem hiding this comment.
Hi @furykerry will look into this.
Thank you.
There was a problem hiding this comment.
Hi @furykerry
Renumbered the steps in Reconcile so they read 1 → 10 in order, no gaps and no cross-step duplicates:
- Get SandboxUpdateOps
- Handle deletion (was
// 3., now fills the old "step 2" slot) - Ensure finalizer
- Terminal state short-circuit
- Concurrency safeguard (the new sibling re-check — was an unnumbered block, now an explicit step)
- Convert selector to
labels.Selector(was the duplicate// 3.) - List matching sandboxes (was the duplicate
// 4.) - Classify sandbox states
- Phase state machine
- Update status
Also removed the duplicate // 7. Update status comment on the alt return path so step 10 isn't repeated. No behavior change — pure comment cleanup. Tests still green.
There was a problem hiding this comment.
Hi @furykerry ,
would really appreciate your review and guidance on this.
If there are any iterations to be made.
Thank you
Summary
This PR removes a redundant concurrency check in the
SandboxUpdateOpscontroller, eliminating technical debt and optimizing the reconciliation loop.Problem
In
pkg/controller/sandboxupdateops/sandboxupdateops_controller.go, the controller was manually listing allSandboxUpdateOpsresources in a namespace to prevent concurrent update operations. This check was accompanied by aTODOacknowledging it as a short-term solution and suggesting a validating webhook.Since a validating webhook (
pkg/webhook/sandboxupdateops/validating/sandboxupdateops_validate.go) is now fully implemented and successfully rejects concurrent creations at the admission level, the controller's manual check is completely obsolete.Solution
Removed the redundant
opsListconflict-check logic and the outdatedTODOfrom the controller'sReconcilefunction. Also removed the corresponding obsolete unit test.Impact
ListAPI call on every reconciliation loop, reducing load on the Kubernetes API Server.Checklist
sandboxupdateops_controller.go.fixes #341