From 889cc17ff18953e74aac3fa2ad3a44185ea1292f Mon Sep 17 00:00:00 2001 From: Kir Kolyshkin Date: Wed, 5 May 2021 12:55:56 -0700 Subject: [PATCH] freezer: add delay after freeze I hate to keep adding those kludges, but lately TestFreeze (and TestSystemdFreeze) from libcontainer/integration fails a lot. The failure comes and goes, and is probably this is caused by a slow host allocated for the test, and a slow VM on top of it. To remediate, add a small sleep on every 25th iteration in between asking the kernel to freeze and checking its status. In the worst case scenario (failure to freeze), this adds about 0.4 ms (40 x 10 us) to the duration of the call. It is hard to measure how this affects CI as GHA plays a roulette when allocating a node to run the test on, but it seems to help. With additional debug info, I saw somewhat frequent "frozen after 24 retries" or "frozen after 49 retries", meaning it succeeded right after the added sleep. While at it, rewrite/improve the comments. Cherry-picked: 524abc59f46373a175b97bd07c4c7eccf5594cc6 Signed-off-by: Sascha Grunert --- libcontainer/cgroups/fs/freezer.go | 32 +++++++++++++++++++++++------- 1 file changed, 25 insertions(+), 7 deletions(-) diff --git a/libcontainer/cgroups/fs/freezer.go b/libcontainer/cgroups/fs/freezer.go index 1193ec271..20bb90f82 100644 --- a/libcontainer/cgroups/fs/freezer.go +++ b/libcontainer/cgroups/fs/freezer.go @@ -34,20 +34,31 @@ func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error { // kernel commit ef9fe980c6fcc1821), if FREEZING is seen, // userspace should either retry or thaw. While current // kernel cgroup v1 docs no longer mention a need to retry, - // the kernel (tested on v5.4, Ubuntu 20.04) can't reliably - // freeze a cgroup while new processes keep appearing in it + // even a recent kernel (v5.4, Ubuntu 20.04) can't reliably + // freeze a cgroup v1 while new processes keep appearing in it // (either via fork/clone or by writing new PIDs to // cgroup.procs). // - // The numbers below are chosen to have a decent chance to - // succeed even in the worst case scenario (runc pause/unpause - // with parallel runc exec). + // The numbers below are empirically chosen to have a decent + // chance to succeed in various scenarios ("runc pause/unpause + // with parallel runc exec" and "bare freeze/unfreeze on a very + // slow system"), tested on RHEL7 and Ubuntu 20.04 kernels. // // Adding any amount of sleep in between retries did not - // increase the chances of successful freeze. + // increase the chances of successful freeze in "pause/unpause + // with parallel exec" reproducer. OTOH, adding an occasional + // sleep helped for the case where the system is extremely slow + // (CentOS 7 VM on GHA CI). + // + // Alas, this is still a game of chances, since the real fix + // belong to the kernel (cgroup v2 do not have this bug). + for i := 0; i < 1000; i++ { if i%50 == 49 { - // Briefly thawing the cgroup also helps. + // Occasional thaw and sleep improves + // the chances to succeed in freezing + // in case new processes keep appearing + // in the cgroup. _ = fscommon.WriteFile(path, "freezer.state", string(configs.Thawed)) time.Sleep(10 * time.Millisecond) } @@ -56,6 +67,13 @@ func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error { return err } + if i%25 == 24 { + // Occasional short sleep before reading + // the state back also improves the chances to + // succeed in freezing in case of a very slow + // system. + time.Sleep(10 * time.Microsecond) + } state, err := fscommon.ReadFile(path, "freezer.state") if err != nil { return err