Skip to content

kola: Add soft-reboot support for external tests #4119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jmarrero
Copy link
Member

Implements soft-reboot capabilities for Kola,
it enables tests to use systemd's soft-reboot functionality.

The implementation follows the same pattern as regular reboots but for systemctl soft-reboot, tracks systemd boot
timestamps rather than kernel boot IDs for state detection.

Copy link

openshift-ci bot commented May 30, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

jmarrero added a commit to jmarrero/ostree that referenced this pull request May 30, 2025
@dustymabe
Copy link
Member

Implements soft-reboot capabilities for Kola,
it enables tests to use systemd's soft-reboot functionality.

Is there a case where a soft-reboot would be a better test than a hard reboot?

Is the goal here to reduce test time?

@jmarrero
Copy link
Member Author

Right now soft-reboots are not really supported by ostree but we are implementing soft-reboots for ostree on : ostreedev/ostree#3420
We wanted to use Kola for the tests as we already depend on it. The idea is to reduce reboot time and we only would support if no Kernel is present on the deployment we are going to soft-reboot to. But it's still in progress. Just that having Kola support it would make our testing cycle easier too.

@dustymabe
Copy link
Member

Ok, Just comment here when you move this out of draft

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall! Main issue is code duplication with the "hard" reboot path but that's probably hard to fix and not worth doing.

echo "test beginning"
# Check that boot_id stays the same across soft-reboot
INITIAL_BOOT_ID=$(cat /proc/sys/kernel/random/boot_id)
echo "Initial boot ID: $INITIAL_BOOT_ID" | logger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the |logger, we always log tests to the journal

Comment on lines 16 to 18
# Verify boot_id is the same (soft-reboot should not change it)
CURRENT_BOOT_ID=$(cat /proc/sys/kernel/random/boot_id)
echo "Current boot ID after soft-reboot: $CURRENT_BOOT_ID" | logger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this verified though? I think we need to save the boot id to a persistent file like /var/cache/kola-boot-id or something in the first boot and then compare it here.

mark2)
echo "test in mark2"
FINAL_BOOT_ID=$(cat /proc/sys/kernel/random/boot_id)
echo "Final boot ID after forced soft-reboot: $FINAL_BOOT_ID" | logger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@jmarrero
Copy link
Member Author

jmarrero commented Jun 3, 2025

OK test now dies with a timeout which is the same behavior I see when I do systemctl soft-reboot without any of our changes to ostree. I will try to see if I can get farther with our changes plus this test.

edit: looking at the console log the VM appears to come back. It's just cosa that loses the ability to ssh back in, even during a cosa run. I been trying to figure out why.

@jmarrero
Copy link
Member Author

jmarrero commented Jun 4, 2025

OK needed to modify qemu.go to get the test to comeback too.

Now need to figureout this failed service.

× coreos-ignition-firstboot-complete.service - CoreOS Mark Ignition Boot Complete
     Loaded: loaded (/usr/lib/systemd/system/coreos-ignition-firstboot-complete.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: failed (Result: exit-code) since Wed 2025-06-04 22:34:52 UTC; 18s ago
   Duration: 11.967s
 Invocation: 97458820b4e8426184180f42cd04bf13
       Docs: https://docs.fedoraproject.org/en-US/fedora-coreos/
    Process: 2729 ExecStart=/usr/libexec/coreos-ignition-firstboot-complete (code=exited, status=1/FAILURE)
   Main PID: 2729 (code=exited, status=1/FAILURE)
   Mem peak: 2.7M
        CPU: 11ms

Jun 04 22:34:52 cosa-devsh systemd[1]: Starting coreos-ignition-firstboot-complete.service - CoreOS Mark Ignition Boot Complete...
Jun 04 22:34:52 cosa-devsh coreos-ignition-firstboot-complete[2755]: rm: cannot remove '/boot/ignition.firstboot': No such file or directory
Jun 04 22:34:52 cosa-devsh systemd[1]: coreos-ignition-firstboot-complete.service: Main process exited, code=exited, status=1/FAILURE
Jun 04 22:34:52 cosa-devsh systemd[1]: coreos-ignition-firstboot-complete.service: Failed with result 'exit-code'.
Jun 04 22:34:52 cosa-devsh systemd[1]: Failed to start coreos-ignition-firstboot-complete.service - CoreOS Mark Ignition Boot Complete.

@jmarrero jmarrero force-pushed the soft-reboot branch 2 times, most recently from 4944160 to d0743f6 Compare June 4, 2025 23:20
@jmarrero
Copy link
Member Author

jmarrero commented Jun 4, 2025

https://github.com/coreos/fedora-coreos-config/blob/20feb176f19c3142b7256c1eb5bf1cb7c53b29b9/overlay.d/05core/usr/libexec/coreos-ignition-firstboot-complete#L14

I am not sure of the story of that service but it looks like we could ignore that failed service without any real issues.

@cgwalters
Copy link
Member

My offhand guess as to what's happening here is that service isn't prepared for soft reboots. https://github.com/coreos/fedora-coreos-config/blob/20feb176f19c3142b7256c1eb5bf1cb7c53b29b9/overlay.d/05core/usr/lib/systemd/system/coreos-ignition-firstboot-complete.service#L4 is still triggered because we didn't change the kernel commandline across the soft reboot.

I think CoreOS probably does need to fix this because soft rebooting from the initial boot is a sane thing to do and I see real world use cases for it. It is perhaps as simple as ConditionPathExists=!/boot/ignition.firstboot in the unit.

jmarrero added a commit to jmarrero/fedora-coreos-config that referenced this pull request Jun 5, 2025
On coreos/coreos-assembler#4119 surfaced that
this service would fail on soft-reboots, it's non fatal but would make
the Kola tests fail. It looks like originally this was set to fail instead of
risking not running it:
https://github.com/coreos/fedora-coreos-config/blob/20feb176f19c3142b7256c1eb5bf1cb7c53b29b9/overlay.d/05core/usr/libexec/coreos-ignition-firstboot-complete#L14
jmarrero added a commit to jmarrero/fedora-coreos-config that referenced this pull request Jun 5, 2025
coreos/coreos-assembler#4119 surfaced that
this service would fail on soft-reboots, it's non fatal but would make
the Kola tests fail. It looks like originally this was set to fail instead of
risking not running it:
https://github.com/coreos/fedora-coreos-config/blob/20feb176f19c3142b7256c1eb5bf1cb7c53b29b9/overlay.d/05core/usr/libexec/coreos-ignition-firstboot-complete#L14
@cgwalters cgwalters mentioned this pull request Jun 5, 2025
@cgwalters
Copy link
Member

Holy cow that was painful to figure out but fixes in #4133 get us improved error message handling to the point where I now see

2025-06-05T20:53:19Z kola: dropping to shell: kolet failed: kolet run-test-unit failed:  mkfifo: cannot create fifo '/run/kolet-reboot': File exists

Which was the real bug here - with soft reboots /run persists (this will be a big trap!) and so we need to change the mkfifo to be idempotent, which I'll do after rebasing this PR on the prep one.

jmarrero added a commit to jmarrero/fedora-coreos-config that referenced this pull request Jun 6, 2025
coreos/coreos-assembler#4119 surfaced that
this service would fail on soft-reboots, it's non fatal but would make
the Kola tests fail. It looks like originally this was set to fail instead of
risking not running it:
https://github.com/coreos/fedora-coreos-config/blob/20feb176f19c3142b7256c1eb5bf1cb7c53b29b9/overlay.d/05core/usr/libexec/coreos-ignition-firstboot-complete#L14
jmarrero added a commit to jmarrero/fedora-coreos-config that referenced this pull request Jun 6, 2025
coreos/coreos-assembler#4119 surfaced that
this service would fail on soft-reboots, it's non fatal but would make
the Kola tests fail. It looks like originally this was set to fail instead of
risking not running it:
https://github.com/coreos/fedora-coreos-config/blob/20feb176f19c3142b7256c1eb5bf1cb7c53b29b9/overlay.d/05core/usr/libexec/coreos-ignition-firstboot-complete#L14

Co-authored-by: Colin Walters <[email protected]>
jmarrero added a commit to jmarrero/fedora-coreos-config that referenced this pull request Jun 6, 2025
…t-reboots

coreos/coreos-assembler#4119 surfaced that
this service would fail on soft-reboots, it's non fatal but would make
the Kola tests fail. It looks like originally this was set to fail instead of
risking not running it:
https://github.com/coreos/fedora-coreos-config/blob/20feb176f19c3142b7256c1eb5bf1cb7c53b29b9/overlay.d/05core/usr/libexec/coreos-ignition-firstboot-complete#L14

Co-authored-by: Colin Walters <[email protected]>
jmarrero added a commit to jmarrero/fedora-coreos-config that referenced this pull request Jun 6, 2025
coreos/coreos-assembler#4119 surfaced that
this services using ConditionKernelCommandLine=ignition.firstboot
would fail on soft-reboots, it's non fatal but would make the Kola
tests fail.

Co-authored-by: Colin Walters <[email protected]>
@jmarrero jmarrero marked this pull request as ready for review June 6, 2025 17:36
Implements soft-reboot capabilities for Kola,
it enables tests to use systemd's soft-reboot functionality.

The implementation follows the same pattern as regular reboots but
for `systemctl soft-reboot`, tracks systemd boot
timestamps rather than kernel boot IDs for state detection.

Co-Authored-By: Colin Walters <[email protected]>
Co-Authored-By: Claude <[email protected]>

Signed-off-by: Colin Walters <[email protected]>
Signed-off-by: Joseph Marrero Corchado <[email protected]>
@jmarrero
Copy link
Member Author

jmarrero commented Jun 8, 2025

/test rhcos

Copy link

openshift-ci bot commented Jun 8, 2025

@jmarrero: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/rhcos 6796892 link true /test rhcos

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

jmarrero added a commit to jmarrero/fedora-coreos-config that referenced this pull request Jun 9, 2025
coreos/coreos-assembler#4119 surfaced that
this services using ConditionKernelCommandLine=ignition.firstboot
would fail on soft-reboots, it's non fatal but would make the Kola
tests fail.

Co-authored-by: Jonathan Lebon <[email protected]>
Co-authored-by: Colin Walters <[email protected]>
jlebon added a commit to coreos/fedora-coreos-config that referenced this pull request Jun 10, 2025
coreos/coreos-assembler#4119 surfaced that
this services using ConditionKernelCommandLine=ignition.firstboot
would fail on soft-reboots, it's non fatal but would make the Kola
tests fail.

Co-authored-by: Jonathan Lebon <[email protected]>
Co-authored-by: Colin Walters <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants