From 889cc17ff18953e74aac3fa2ad3a44185ea1292f Mon Sep 17 00:00:00 2001
From: Kir Kolyshkin <kolyshkin@gmail.com>
Date: Wed, 5 May 2021 12:55:56 -0700
Subject: [PATCH] freezer: add delay after freeze

I hate to keep adding those kludges, but lately TestFreeze (and
TestSystemdFreeze) from libcontainer/integration fails a lot. The
failure comes and goes, and is probably this is caused by a slow host
allocated for the test, and a slow VM on top of it.

To remediate, add a small sleep on every 25th iteration in between
asking the kernel to freeze and checking its status.

In the worst case scenario (failure to freeze), this adds about 0.4 ms
(40 x 10 us) to the duration of the call.

It is hard to measure how this affects CI as GHA plays a roulette when
allocating a node to run the test on, but it seems to help. With
additional debug info, I saw somewhat frequent "frozen after 24 retries"
or "frozen after 49 retries", meaning it succeeded right after the added
sleep.

While at it, rewrite/improve the comments.

Cherry-picked: 524abc59f46373a175b97bd07c4c7eccf5594cc6

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
---
 libcontainer/cgroups/fs/freezer.go | 32 +++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/libcontainer/cgroups/fs/freezer.go b/libcontainer/cgroups/fs/freezer.go
index 1193ec271..20bb90f82 100644
--- a/libcontainer/cgroups/fs/freezer.go
+++ b/libcontainer/cgroups/fs/freezer.go
@@ -34,20 +34,31 @@ func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error {
 		// kernel commit ef9fe980c6fcc1821), if FREEZING is seen,
 		// userspace should either retry or thaw. While current
 		// kernel cgroup v1 docs no longer mention a need to retry,
-		// the kernel (tested on v5.4, Ubuntu 20.04) can't reliably
-		// freeze a cgroup while new processes keep appearing in it
+		// even a recent kernel (v5.4, Ubuntu 20.04) can't reliably
+		// freeze a cgroup v1 while new processes keep appearing in it
 		// (either via fork/clone or by writing new PIDs to
 		// cgroup.procs).
 		//
-		// The numbers below are chosen to have a decent chance to
-		// succeed even in the worst case scenario (runc pause/unpause
-		// with parallel runc exec).
+		// The numbers below are empirically chosen to have a decent
+		// chance to succeed in various scenarios ("runc pause/unpause
+		// with parallel runc exec" and "bare freeze/unfreeze on a very
+		// slow system"), tested on RHEL7 and Ubuntu 20.04 kernels.
 		//
 		// Adding any amount of sleep in between retries did not
-		// increase the chances of successful freeze.
+		// increase the chances of successful freeze in "pause/unpause
+		// with parallel exec" reproducer. OTOH, adding an occasional
+		// sleep helped for the case where the system is extremely slow
+		// (CentOS 7 VM on GHA CI).
+		//
+		// Alas, this is still a game of chances, since the real fix
+		// belong to the kernel (cgroup v2 do not have this bug).
+
 		for i := 0; i < 1000; i++ {
 			if i%50 == 49 {
-				// Briefly thawing the cgroup also helps.
+				// Occasional thaw and sleep improves
+				// the chances to succeed in freezing
+				// in case new processes keep appearing
+				// in the cgroup.
 				_ = fscommon.WriteFile(path, "freezer.state", string(configs.Thawed))
 				time.Sleep(10 * time.Millisecond)
 			}
@@ -56,6 +67,13 @@ func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error {
 				return err
 			}
 
+			if i%25 == 24 {
+				// Occasional short sleep before reading
+				// the state back also improves the chances to
+				// succeed in freezing in case of a very slow
+				// system.
+				time.Sleep(10 * time.Microsecond)
+			}
 			state, err := fscommon.ReadFile(path, "freezer.state")
 			if err != nil {
 				return err