Fix container auto-delete on rapid stop/start #841

realrajaryan · 2025-11-04T02:06:46Z

Fixes [Bug]: Container auto deletes when we stop a container and start the same container within few milliseconds. #833.

Type of Change

Bug fix
New feature
Breaking change
Documentation update

Motivation and Context

Currently, when stopping and immediately restarting a container, it would fail with the error:
“container expected to be in created state, got: shuttingDown” and then be automatically deleted.
The SandboxService process waits five seconds before exiting after shutdown. During this interval, a rapid restart could reconnect to the still-terminating process in the shuttingDown state, triggering a state validation error.

This fix forcefully terminates the SandboxService process with SIGKILL upon container exit, instead of waiting five seconds. The bootstrap now defensively checks for and cleans up any stale services before registering new ones, preventing reconnections to processes in the shuttingDown state.

Testing

Tested locally
Added/updated tests
Added/updated docs

dcantah · 2025-11-04T21:33:05Z

I think we should just get rid of the 5 second sleep+exit and deregister (and not ignore the error) in stop. The shutdown rpc is just meant as a way (our docs here are shitty, otherwise it'd be easier to tell 😄) for any implementations of the sandbox service to clean up any resources it may have created during container lifecycle (temp files, sockets, whatever). I think we can just rely on the APIServer to kill it, and not have the sandbox service itself exit on its own. I'm not sure on that entirely, but it seems fine to me to leave having it exit up to the APIServer.

So:

Remove the Task sleep and exit in shutdown rpc.
Always deregister after shutdown comes back successfully.
Get rid of the checks you added in bootstrap

realrajaryan · 2025-11-04T22:01:03Z

I think we should just get rid of the 5 second sleep+exit and deregister (and not ignore the error) in stop. The shutdown rpc is just meant as a way (our docs here are shitty, otherwise it'd be easier to tell 😄) for any implementations of the sandbox service to clean up any resources it may have created during container lifecycle (temp files, sockets, whatever). I think we can just rely on the APIServer to kill it, and not have the sandbox service itself exit on its own. I'm not sure on that entirely, but it seems fine to me to leave having it exit up to the APIServer.

So:

Remove the Task sleep and exit in shutdown rpc.

Always deregister after shutdown comes back successfully.

Get rid of the checks you added in bootstrap

I’m also gonna move service registration from create() to bootstrap(), because if create() registers the service and bootstrap() registers again it’ll fail. This way services start when containers are actually running.

dcantah · 2025-11-04T23:10:14Z

Sources/Services/ContainerAPIService/Containers/ContainersService.swift

+        let client = try state.getClient()
+        try await client.shutdown()


Sorry, I still think shutdown shouldn't be fatal and we can leave what we had as it's mostly best effort cleanup on the sb services end, I just meant to not ignore all of the errors like what you had in your original change.

We're in a weird spot if deregistration fails though. Curious on your thoughts on how to handle that.

ahh, If deregister fails, we could catch it and attempt kill SIGKILL + deregister as a fallback?

jglogan · 2025-11-18T19:45:47Z

Sources/Services/ContainerAPIService/Containers/ContainersService.swift

                results[config.id] = state
-                let plugin = runtimePlugins.first { $0.name == config.runtimeHandler }
-                guard let plugin else {
+                guard runtimePlugins.first(where: { $0.name == config.runtimeHandler }) != nil else {


nit: more concise here and on L155:

guard runtimePlugins.contains(where: { $0.name == config.runtimeHandler }) else { ... }

jglogan · 2025-11-18T20:28:40Z

Sources/Services/ContainerSandboxService/SandboxService.swift

            case .created, .stopped(_), .stopping:
                await self.setState(.shuttingDown)

-                Task {


Is it that containers won't ever stop gracefully anymore?

To ensure containers and attached volumes preserve written data don't we want something like:

Send SIGTERM to the primary workload and additional processes (execs)

Wait a certain amount of time for the processes to complete and be reaped and then vminitd terminates normally and the VM shuts down of its own accord

If this doesn't happen within a certain time shut down the VM forcefully

This isn't unlike what launchd does for macOS services

Truthfully we should ensure the container (singular as that's all we support currently, but can be expanded to whatever the N is) is dead in shutdown. If the sandbox svc is still servicing any containers we should return an error. Shutdown should solely be a final little handshake to cleanup some possible resources before getting killed.

If the sandbox svc is still servicing any containers we should return an error.

We should return an error for what? That the container cannot be stopped?

What should the UX be for stop, given that properly stopping a container (or pod) can be a somewhat slow operation?

How best do we handle cases where a container (or pod) workload processes don't shut down in a timely manner?

If container stop c1 c2 c3 c4 signals these four containers to shut down, do we want to return immediately after telling them, or wait until we know that all are in the stopped state? The former is more responsive but can produce the condition that prompted this issue.

We should return an error for what? That the container cannot be stopped?

No, if there are still containers running in the handler for shutdown on any given sandbox svc we should throw an error. I/we should probably spend a week documenting the behaviors of the various rpcs so folks know what we expect the behavior is for every call, but I was modeling shutdown much after containerd's. It's meant as a final place to do any resource cleanup (possibly none like in our case, but depends on the implementation) after we've gone through the rounds already and stopped (or they may have exited on their own) any containers.

realrajaryan requested a review from dcantah November 4, 2025 02:32

realrajaryan force-pushed the users/realrajaryan/kill-sandbox-service-on-exit branch from 9a67abd to ddf2c35 Compare November 4, 2025 22:06

Kill sandbox service on exit to prevent restart race condition

87a4290

realrajaryan force-pushed the users/realrajaryan/kill-sandbox-service-on-exit branch from ddf2c35 to 1388dbc Compare November 4, 2025 22:07

Remove shutdown sleep, let launchd erminate process

c8a4cb7

realrajaryan force-pushed the users/realrajaryan/kill-sandbox-service-on-exit branch from 1388dbc to c8a4cb7 Compare November 4, 2025 22:21

dcantah reviewed Nov 4, 2025

View reviewed changes

jglogan reviewed Nov 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix container auto-delete on rapid stop/start #841

Fix container auto-delete on rapid stop/start #841

Uh oh!

realrajaryan commented Nov 4, 2025

Uh oh!

dcantah commented Nov 4, 2025 •

edited

Loading

Uh oh!

realrajaryan commented Nov 4, 2025

Uh oh!

dcantah Nov 4, 2025

Uh oh!

dcantah Nov 4, 2025

Uh oh!

realrajaryan Nov 5, 2025

Uh oh!

jglogan Nov 18, 2025

Uh oh!

jglogan Nov 18, 2025

Uh oh!

dcantah Nov 19, 2025 •

edited

Loading

Uh oh!

jglogan Nov 19, 2025

Uh oh!

dcantah Nov 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		let client = try state.getClient()
		try await client.shutdown()

Fix container auto-delete on rapid stop/start #841

Are you sure you want to change the base?

Fix container auto-delete on rapid stop/start #841

Uh oh!

Conversation

realrajaryan commented Nov 4, 2025

Type of Change

Motivation and Context

Testing

Uh oh!

dcantah commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

realrajaryan commented Nov 4, 2025

Uh oh!

dcantah Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

dcantah Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

realrajaryan Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

jglogan Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

jglogan Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

dcantah Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jglogan Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

dcantah Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dcantah commented Nov 4, 2025 •

edited

Loading

dcantah Nov 19, 2025 •

edited

Loading

dcantah Nov 20, 2025 •

edited

Loading