Skip to content

refactor(controller): remove redundant SandboxUpdateOps concurrency check#342

Open
PRAteek-singHWY wants to merge 9 commits into
openkruise:masterfrom
PRAteek-singHWY:refactor/sandboxupdateops-concurrency
Open

refactor(controller): remove redundant SandboxUpdateOps concurrency check#342
PRAteek-singHWY wants to merge 9 commits into
openkruise:masterfrom
PRAteek-singHWY:refactor/sandboxupdateops-concurrency

Conversation

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor

Summary

This PR removes a redundant concurrency check in the SandboxUpdateOps controller, eliminating technical debt and optimizing the reconciliation loop.

Problem

In pkg/controller/sandboxupdateops/sandboxupdateops_controller.go, the controller was manually listing all SandboxUpdateOps resources in a namespace to prevent concurrent update operations. This check was accompanied by a TODO acknowledging it as a short-term solution and suggesting a validating webhook.

Since a validating webhook (pkg/webhook/sandboxupdateops/validating/sandboxupdateops_validate.go) is now fully implemented and successfully rejects concurrent creations at the admission level, the controller's manual check is completely obsolete.

Solution

Removed the redundant opsList conflict-check logic and the outdated TODO from the controller's Reconcile function. Also removed the corresponding obsolete unit test.

Impact

  • Performance: Eliminates an unnecessary List API call on every reconciliation loop, reducing load on the Kubernetes API Server.
  • Tech Debt: Removes outdated comments and redundant race-prone controller logic.

Checklist

  • Removed redundant concurrency check in sandboxupdateops_controller.go.
  • Removed obsolete unit test for concurrency check.
  • Verified ValidatingWebhook already handles this validation.

fixes #341

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 85.21127% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.44%. Comparing base (edaad73) to head (4de0b74).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
pkg/utils/sandboxutils/selector_utils.go 88.88% 7 Missing and 4 partials ⚠️
...er/sandboxupdateops/sandboxupdateops_controller.go 74.35% 7 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #342      +/-   ##
==========================================
+ Coverage   74.65%   75.44%   +0.79%     
==========================================
  Files         141      144       +3     
  Lines        9836    10366     +530     
==========================================
+ Hits         7343     7821     +478     
- Misses       2183     2206      +23     
- Partials      310      339      +29     
Flag Coverage Δ
unittests 75.44% <85.21%> (+0.79%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor Author

PRAteek-singHWY commented May 7, 2026

Hello @furykerry, @AiRanthem, and @zmberg

I was reviewing the codebase and noticed an opportunity to clean up some tech debt in the SandboxUpdateOps controller. I found a redundant List() check that was manually preventing concurrent operations, accompanied by a TODO pointing to a webhook solution.

Since the sandboxupdateops_validate.go webhook is now fully implemented and successfully handles this at the admission layer, I’ve removed the obsolete loop and its associated test. This should optimize the reconciliation cycle by saving an unnecessary API server call.

Let me know if you need any adjustments or further testing.
Happy to iterate based on ur suggestions.
Thank You.

@furykerry
Copy link
Copy Markdown
Member

@PRAteek-singHWY thanks for your patch, but concurrent update ops for a single sandbox should be avoided, so simply remove this concurrent check is not an option, we should comes up with some more flexible validation rules

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor Author

@PRAteek-singHWY thanks for your patch, but concurrent update ops for a single sandbox should be avoided, so simply remove this concurrent check is not an option, we should comes up with some more flexible validation rules

Hey @furykerry
Thanks for the feedback.

That makes sense, we definitely want to avoid race conditions on a single sandbox.

Instead of the strict 'one-per-namespace' rule, I'll update this PR to implement a more granular check: I'll make the webhook verify that the selector of the new SandboxUpdateOps doesn't overlap with any currently active ones. This way, users can run concurrent updates across the namespace as long as they target different sets of sandboxes.

I'll push the updated logic shortly.

Thank you.

@kruise-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign furykerry for approval by writing /assign @furykerry in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kruise-bot kruise-bot added size/L and removed size/M labels May 7, 2026
@PRAteek-singHWY
Copy link
Copy Markdown
Contributor Author

PRAteek-singHWY commented May 7, 2026

Hey @furykerry, your point about avoiding race conditions on individual sandboxes makes total sense. I've updated the logic to block updates only if their selectors overlap, this keeps things safe for single sandboxes while allowing parallel updates across the rest of the namespace.

I've also included a small fix in the Makefile to add an envtest dependency for the test target. I noticed the unit-test CI was failing on fresh runners because the setup-envtest binary wasn't being downloaded automatically. This ensures the PR (and future ones) can pass CI consistently.

Happy to iterate further based on ur suggestions.
Thank you.

Comment thread pkg/utils/sandboxutils/selector_utils.go Outdated
Comment thread pkg/controller/sandboxupdateops/sandboxupdateops_controller.go
@furykerry furykerry requested a review from zmberg May 9, 2026 01:27
@furykerry
Copy link
Copy Markdown
Member

@zmberg the informer cache sync delay problem always exists in both webhook and controller , i think we need a final check in the sandbox controller. The sandbox controller can lock the sandbox in pre-upgrade phase so that no other sandboxupdateops can operate in the same sandbox

PRAteek-singHWY and others added 2 commits May 9, 2026 14:38
Replace the hand-rolled SelectorsOverlap with a copy of openkruise/kruise's
pkg/util/selector.go (IsSelectorOverlapping + IsSelectorLooseOverlap +
helpers). Substitute std slices.Contains for the
k8s.io/kubernetes/pkg/util/slice calls so we avoid that transitive
dependency. The unused unsafe ValidatedLabelSelectorAsSelector fast-path is
omitted; happy to add back if requested.

Update the webhook caller and broaden unit tests to cover NotIn / Exists /
DoesNotExist combinations exercised by the new util.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ioning to Updating

The validating webhook reads from a (potentially stale) informer cache, so
two SandboxUpdateOps with overlapping selectors can both be admitted when
created back-to-back. Add a controller-side check before the
Pending -> Updating transition: if a non-terminal sibling op in the
namespace has an overlapping selector and is either already Updating or
wins the creation-timestamp tiebreak, set Phase=Pending and requeue.

Tiebreak: a sibling already in Updating wins outright; otherwise the older
CreationTimestamp wins (lexicographic name comparison breaks exact-time
ties).

Note: the controller's informer cache also lags, so this still leaves a
narrow residual race. The durable fix is a sandbox-level lock during the
pre-upgrade phase, planned as a stacked follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kruise-bot kruise-bot added size/XL and removed size/L labels May 9, 2026
Lift codecov/patch coverage above the 74.65% gate by exercising the
IsSelectorLooseOverlap branch of the kruise-vendored selector util and
the equal-CreationTimestamp tiebreak in the SandboxUpdateOps reconciler.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
// IsSelectorLooseOverlap indicates whether selectors overlap (indicates that selector1, selector2 have same key, and there is a certain intersection)
// 1. when selector1、selector2 don't have same key, it is considered non-overlap, e.g. selector1(a=b) and selector2(c=d)
// 2. when selector1、selector2 have same key, and matchLabels & matchExps are intersection, it is considered overlap.
func IsSelectorLooseOverlap(selector1, selector2 *metav1.LabelSelector) bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code: IsSelectorLooseOverlap and its helpers are unused, consider remove them

Copy link
Copy Markdown
Contributor Author

@PRAteek-singHWY PRAteek-singHWY May 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code: IsSelectorLooseOverlap and its helpers are unused, consider remove them

HI @furykerry Done. Dropped IsSelectorLooseOverlap, isMatchExpOverlap, and convertSelectorToMatchExpressions from pkg/utils/sandboxutils/selector_utils.go, and removed the TestIsSelectorLooseOverlap table from selector_utils_test.go. IsSelectorOverlapping is the only entry point we actually use (webhook + new controller double-check), so the file is now just that plus its isDisjoint / sliceOverlaps / sliceContains helpers. go test ./pkg/utils/sandboxutils/... and go vet pass.

return ctrl.Result{}, client.IgnoreNotFound(err)
}

// 2. Check if another SandboxUpdateOps in the same namespace is already Updating
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After removing the old step-2 block, the remaining comments jump from // 1. Get SandboxUpdateOps directly to // 3. Handle deletion. Pre-existing duplicates (two // 3., two // 4., two // 7.) make this worse, but the PR exacerbates the problem by removing the only "step 2" anchor. Now the function reads as step 1 → step 3, which is confusing. Consider correct the step number

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @furykerry will look into this.
Thank you.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @furykerry
Renumbered the steps in Reconcile so they read 1 → 10 in order, no gaps and no cross-step duplicates:

  1. Get SandboxUpdateOps
  2. Handle deletion (was // 3., now fills the old "step 2" slot)
  3. Ensure finalizer
  4. Terminal state short-circuit
  5. Concurrency safeguard (the new sibling re-check — was an unnumbered block, now an explicit step)
  6. Convert selector to labels.Selector (was the duplicate // 3.)
  7. List matching sandboxes (was the duplicate // 4.)
  8. Classify sandbox states
  9. Phase state machine
  10. Update status

Also removed the duplicate // 7. Update status comment on the alt return path so step 10 isn't repeated. No behavior change — pure comment cleanup. Tests still green.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @furykerry ,
would really appreciate your review and guidance on this.
If there are any iterations to be made.
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove redundant SandboxUpdateOps concurrency check in controller

3 participants