Skip to content

fix(sandbox-manager): use postCtx for InitRuntime after resume wait#328

Open
deepak0x wants to merge 1 commit into
openkruise:masterfrom
deepak0x:fix/post-resume-init-runtime-ctx
Open

fix(sandbox-manager): use postCtx for InitRuntime after resume wait#328
deepak0x wants to merge 1 commit into
openkruise:masterfrom
deepak0x:fix/post-resume-init-runtime-ctx

Conversation

@deepak0x
Copy link
Copy Markdown

@deepak0x deepak0x commented May 2, 2026

Ⅰ. Describe what this PR does

Changes line 318 of pkg/sandbox-manager/infra/sandboxcr/sandbox.go to use postCtx instead of the original ctx when calling runtime.InitRuntime, matching the three sister callsites (InplaceRefresh at L306, resolveCSIMountConfigs at L336, ProcessCSIMounts at L341) that already use postCtx.

Adds sandbox_resume_postctx_test.go, a regression test that arranges the deadline-boundary scenario and asserts the /init request lands on a fake envd HTTP server. Without the fix the request fails on the client side with context deadline exceeded before reaching the sidecar; with the fix it lands on the sidecar and Resume returns nil cleanly.

Ⅱ. Does this pull request fix one issue?

fixes #327

Ⅲ. Describe how to verify it

go test -count=1 -v -run '^TestSandbox_Resume_InitRuntimeUsesPostCtx$' ./pkg/sandbox-manager/infra/sandboxcr/ -timeout 60s

Test passes after the fix and fails on the previous master (was the bug-finder repro). Full sandboxcr package tests stay green (go test ./pkg/sandbox-manager/infra/sandboxcr/).

Ⅳ. Special notes for reviews

One-token change to make line 318 consistent with the three callsites updated in f2be4665. The new test follows the existing TestSandbox_Resume_ContextExpiredAfterWait pattern in pause_resume_test.go.

Note (out of scope for this PR, mentioned in #327): runtime.InitRuntime's retry loop appears to swallow the final error and return nil even when all attempts fail. That behavior is what made the line-318 bug a silent failure rather than a loud one — happy to send a separate PR for it once this lands.

The postCtx fallback at sandbox.go:299-305 builds a fresh 30s context
when the original ctx deadline is consumed by the resume wait. Three of
the four post-wait callsites already use postCtx (InplaceRefresh,
resolveCSIMountConfigs, ProcessCSIMounts); runtime.InitRuntime was
missed and still received the expired ctx, so its /init request to
agent-runtime failed with context deadline exceeded before reaching
the sidecar. A retry-loop quirk in InitRuntime swallows the error,
making this a silent failure: Resume() returns nil while the runtime
sidecar was never re-initialized.

Add a regression test that arranges the deadline-boundary case and
asserts the /init request lands on a fake envd HTTP server.
@kruise-bot kruise-bot requested review from AiRanthem and zmberg May 2, 2026 20:42
@kruise-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign airanthem for approval by writing /assign @airanthem in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link
Copy Markdown

codecov Bot commented May 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.69%. Comparing base (edaad73) to head (a959dcf).
⚠️ Report is 30 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #328      +/-   ##
==========================================
+ Coverage   74.65%   74.69%   +0.04%     
==========================================
  Files         141      141              
  Lines        9836     9836              
==========================================
+ Hits         7343     7347       +4     
+ Misses       2183     2181       -2     
+ Partials      310      308       -2     
Flag Coverage Δ
unittests 74.69% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kruise-bot
Copy link
Copy Markdown

@deepak0x: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Post-resume runtime re-init silently fails: sandbox.go:318 passes expired ctx to InitRuntime instead of postCtx

2 participants