Skip to content

Persist Sandbox status on reconcile errors#366

Open
ajatshatru01 wants to merge 2 commits into
openkruise:masterfrom
ajatshatru01:persist-sandbox-status-on-error
Open

Persist Sandbox status on reconcile errors#366
ajatshatru01 wants to merge 2 commits into
openkruise:masterfrom
ajatshatru01:persist-sandbox-status-on-error

Conversation

@ajatshatru01
Copy link
Copy Markdown
Contributor

Ⅰ. Describe what this PR does

This PR ensures the Sandbox controller persists computed status updates even when phase-specific reconcile logic returns an error.

Previously, most control-error paths returned immediately without calling updateSandboxStatus. Only the Upgrading phase had special-case status persistence. This meant useful status changes could be lost when reconcile failed, making the Kubernetes Sandbox.status less accurate during failed pause/resume/update/upgrade operations.

Changes

  • Call updateSandboxStatus on all control errors, not only SandboxUpgrading.
  • Preserve existing retry behavior by still returning the original reconcile error.
  • Log status persistence errors without masking the original control error.
  • Add a focused controller test that verifies status is persisted even when the control logic mutates newStatus and then returns an error.

Why

Users should be able to inspect Sandbox.status and understand what happened, even if the controller hit an error and will retry.

This improves visibility for fields such as:

  • status.phase
  • status.message
  • status.conditions

Ⅱ. Does this pull request fix one issue?

"NONE" but resolves the TODO at sandbox_controller.go (line 245)

Ⅲ. Describe how to verify it

go test ./pkg/controller/sandbox
go test ./pkg/webhook/sandboxset/validating

Ⅳ. Special notes for reviews

I also did manual kind verification
Deployed the locally built controller image into kind:

make docker-build-controller CONTROLLER_IMG=agent-sandbox-controller:latest
kind load docker-image agent-sandbox-controller:latest --name openkruise
kubectl -n sandbox-system rollout restart deploy/sandbox-controller-manager
kubectl -n sandbox-system rollout status deploy/sandbox-controller-manager

Created a demo Sandbox and confirmed it reached Running:

kubectl apply -f - <<'EOF'
apiVersion: agents.kruise.io/v1alpha1
kind: Sandbox
metadata:
  name: status-demo
  namespace: default
spec:
  template:
    spec:
      containers:
      - name: app
        image: nginx:latest
EOF

kubectl get sandbox status-demo -o yaml
kubectl get pod status-demo

Triggered a failing recreate upgrade with a failing preUpgrade hook:

kubectl patch sandbox status-demo --type merge -p '{
  "spec": {
    "upgradePolicy": {
      "type": "Recreate"
    },
    "lifecycle": {
      "preUpgrade": {
        "exec": {
          "command": ["sh", "-c", "exit 1"]
        }
      }
    },
    "template": {
      "spec": {
        "containers": [
          {
            "name": "app",
            "image": "nginx:1.25"
          }
        ]
      }
    }
  }
}'

Confirmed status was persisted with the failed upgrade state:

status:
  conditions:
  - lastTransitionTime: "2026-05-11T18:40:04Z"
    message: sandbox is upgrading
    reason: Upgrading
    status: "False"
    type: Ready
  - lastTransitionTime: "2026-05-11T18:38:07Z"
    message: ""
    reason: Succeeded
    status: "True"
    type: InplaceUpdate
  - lastTransitionTime: "2026-05-11T18:40:04Z"
    message: 'hook execution error: unavailable: dial tcp 10.244.0.8:49983: connect:
      connection refused'
    reason: PreUpgradeFailed
    status: "False"
    type: Upgrading
  nodeName: openkruise-control-plane

Copilot AI review requested due to automatic review settings May 12, 2026 15:19
@kruise-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign furykerry for approval by writing /assign @furykerry in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ajatshatru01
Copy link
Copy Markdown
Contributor Author

i have recreated the pr for the sandbox status feature, earlier pr had some merge conflicts which i have completely resolved
@furykerry do review

@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.92%. Comparing base (5d3212b) to head (c7ce439).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #366      +/-   ##
==========================================
+ Coverage   75.90%   75.92%   +0.01%     
==========================================
  Files         145      145              
  Lines       10626    10626              
==========================================
+ Hits         8066     8068       +2     
+ Misses       2212     2211       -1     
+ Partials      348      347       -1     
Flag Coverage Δ
unittests 75.92% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Sandbox controller reconcile flow to persist computed Sandbox.status updates even when the phase-specific control logic returns an error, improving status observability during failed operations.

Changes:

  • Persist newStatus via updateSandboxStatus on any control error (not just the Upgrading phase), while still returning the original reconcile error.
  • Log status persistence failures without masking the original control error.
  • Add a controller test that verifies status mutations made by control logic are persisted even when that control logic returns an error.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
pkg/controller/sandbox/sandbox_controller.go Persists computed Sandbox status on control errors across all phases, logging persistence failures while returning the original error.
pkg/controller/sandbox/sandbox_controller_test.go Adds a focused reconcile test ensuring status updates are persisted even when control logic fails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants