Skip to content

fix: register scale-up expectations before creating sandboxes#393

Open
oindrilakha12-ui wants to merge 1 commit into
openkruise:masterfrom
oindrilakha12-ui:fix/expectation-race-issue-4
Open

fix: register scale-up expectations before creating sandboxes#393
oindrilakha12-ui wants to merge 1 commit into
openkruise:masterfrom
oindrilakha12-ui:fix/expectation-race-issue-4

Conversation

@oindrilakha12-ui
Copy link
Copy Markdown

What

Fixed a critical race condition in the SandboxSet controller where ExpectScale was called after r.Create.

Why

Under high load or high QPS, the Kubernetes API server can deliver the Create event to the controller's informer faster than the controller can execute the next line of code. If r.Create succeeds but the Create event is processed before ExpectScale is called:

  1. ObserveScale (in the event handler) finds no registered expectation for that sandbox and is a no-op.
  2. ExpectScale is then called, registering an expectation that will never be satisfied by an event (since the event already passed).
  3. The controller's SatisfiesExpectations check then blocks any further reconciliation for that SandboxSet until the expectation times out (default 30 seconds).

This results in severe "stalls" in warm pool replenishment.

Changes

  • Code: Moved scaleUpExpectation.ExpectScale before the r.Create call.
  • Code: Added a call to ObserveScale in the r.Create error path to ensure failed creations don't leave orphaned expectations.
  • Tests: Updated TestReconcile_BasicScale to manually clear expectations between reconcile calls. This is necessary because the fake test client doesn't trigger the watch events that would normally clear expectations in a real cluster.

Testing

  • go test ./pkg/controller/sandboxset/... ✅ (All tests pass with the expectation cleanup).
  • Manual verification of logic flow.

Fixes #392

@kruise-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign furykerry for approval by writing /assign @furykerry in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kruise-bot kruise-bot added size/S and removed size/M labels May 14, 2026
Moving ExpectScale before r.Create fixes a race condition where the
creation event can be delivered to the informer before the expectation
is registered. In such cases, the ObserveScale call in the event handler
fails to find a matching expectation, leading the controller to believe
a creation is still pending and blocking reconciliation for up to 30s.

Also added ObserveScale in the error path of r.Create to ensure any
registered expectations are cleared if creation fails.

Updated unit tests to manually clear expectations between successive
Reconcile calls, as the fake client does not trigger the event handlers
that would normally satisfy expectations.

Fixes openkruise#392
@oindrilakha12-ui oindrilakha12-ui force-pushed the fix/expectation-race-issue-4 branch from f4d97de to 1cefb71 Compare May 14, 2026 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: ExpectScale called after Create in SandboxSet controller causes reconciliation stalls

2 participants