Freeze container during live checkpoint for consistency#28963
Conversation
|
[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore. |
|
Arguably a breaking change? But, given we did the same to |
|
After discussion, let's get this in 6.0 |
Luap99
left a comment
There was a problem hiding this comment.
seems reasonable overall.
some nits for the test
Also please squash the commits into one, we like feature/bug and test in one commit.
@mheon It shouldn't be a breaking change. The only difference is that we keep a container paused until the checkpoint is fully written when cc: @adrianreber |
065ebd0 to
068d2c2
Compare
When checkpointing a container with --leave-running, libpod dumps the container's memory via the OCI runtime (CRIU) first and only captures the rootfs diff and named volumes afterwards. CRIU thaws the container as soon as the memory dump finishes, so the processes inside the container continue to run between the memory snapshot and the file-system capture. As a result, the checkpoint can be inconsistent: have CRIU images and a file system that reflect different points in time. To fix this, we freeze the container's cgroup before invoking the OCI runtime and thaw it again only after the checkpoint image/archive has been written. The OCI runtime calls CRIU with the freezer cgroup and restores it to its previous state once the dump completes, so a container that was already frozen stays frozen across the dump and the file system is captured at the same instant as the CRIU images. This mirrors the approach other engines (e.g. CRI-O and containerd). The default (stopping) checkpoint functionality is not affected by this issue because CRIU leaves the tasks dead after the dump. This patch also adds a regression test for the consistency of live (--leave-running) checkpoints. The container runs a workload that keeps an in-memory counter in sync with a value written to a file on its root file system, maintaining the invariant that the on-disk value never gets ahead of the in-memory counter. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
068d2c2 to
44d1e68
Compare
When checkpointing a container with
--leave-running, the processes inside the container continue running after CRIU captures the container's runtime state. Because the rootfs diff and named volumes are saved afterward (inexportCheckpoint()/createCheckpointImage()), this can result in an inconsistent checkpoint state, where the CRIU images reflect an earlier point in time than the captured filesystem state. To fix this, we freeze the container cgroup during the checkpoint operation similar to the approach used with other engines (e.g. CRI-O, containerd).