Skip to content

fix(proxy): implement graceful shutdown for listener errors to prevent panics#382

Open
rakshaak29 wants to merge 6 commits into
openkruise:masterfrom
rakshaak29:fix/graceful-shutdown
Open

fix(proxy): implement graceful shutdown for listener errors to prevent panics#382
rakshaak29 wants to merge 6 commits into
openkruise:masterfrom
rakshaak29:fix/graceful-shutdown

Conversation

@rakshaak29
Copy link
Copy Markdown

What type of PR is this?
kind bug

What this PR does / why we need it:
This PR addresses a critical resilience flaw where background listener goroutines in both the proxy and e2b servers use klog.Fatalf upon encountering startup errors (such as port collisions or EADDRINUSE). Because klog.Fatalf invokes os.Exit(255), it completely bypasses the controller's graceful shutdown procedures, leaving active E2B claims, sandboxes, and webhook transactions abruptly killed without draining.

Changes introduced:

  1. pkg/proxy/server.go: Modified Run(errCh chan<- error) to accept an error channel. When ListenAndServe or grpc.Serve fails, the error is propagated to the channel instead of panicking.
  2. pkg/sandbox-manager/core.go: Updated SandboxManager.Run() to establish a cancellable runCtx. It now listens on the proxy's error channel and seamlessly triggers cancel() to cleanly terminate the infrastructure and memberlist peers if the proxy listeners fail.
  3. pkg/servers/e2b/core.go: Replaced klog.Fatalf in the HTTP listener goroutine with a signal directed at the existing sc.stop interrupt channel (syscall.SIGTERM), safely leaning on the controller's established shutdown mechanics.
  4. Tests: Updated pkg/proxy/ext_proc_test.go to inject an errCh to server.Run(), aligning with the new API. Minor fix applied to pkg/servers/web/framework_test.go to handle Go 1.22+ trailing slash redirects (http.StatusTemporaryRedirect) natively to keep CI fully green.

Which issue(s) this PR fixes:

Fixes #375

Special notes for your reviewer:
Local tests and linting (make test, make lint) pass successfully after this refactor. E2E SDK integration checks have also been accounted for.

Signed-off-by: rakshaak29 <rakshaak29@gmail.com>
…plateRef claims

Signed-off-by: rakshaak29 <rakshaak29@gmail.com>
Signed-off-by: rakshaak29 <rakshaak29@gmail.com>
@kruise-bot kruise-bot requested review from AiRanthem and zmberg May 14, 2026 06:59
@kruise-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign furykerry for approval by writing /assign @furykerry in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rakshaak29 and others added 3 commits May 14, 2026 17:05
Integration specs require a configured cluster; running go test ./...
without -tags=e2e executed them against the default kubeconfig and
failed every BeforeEach. Add //go:build e2e to all test/e2e sources and
pass --tags e2e from hack/run-agents-e2e-test.sh so Ginkgo still
compiles the suite.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] klog.Fatalf in proxy.Server.Run() background goroutines bypasses graceful shutdown and causes unrecoverable crashes

2 participants