-
Notifications
You must be signed in to change notification settings - Fork 67
increase the boot partition to 256 MB #3027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I'm slightly confused about what you're proposing. The first part sounds like existing deployments would be resized on upgrade. That's essential, given how close we are to the limit right now. On the other hand, you've only changed the disk layout here, and you also say that older systems won't be upgradeable anymore. I did do an earlier write-up of the potential solutions, and resizing the partition was one of them. Sorry for not sharing that more widely. Assuming you did mean to resize on upgrade, here is what I wrote: Resize the EFI-SYSTEM partition by shrinking USRThe USR partitions are currently only using about 50% of their 1GB size. These partitions are read-only, so some of this could be given to EFI-SYSTEM. See the current disk layout. Method
Pros
Cons
I was told this approach was too risky. The space issue aside, we also want to be able to update the bootloader and have a dynamic configuration for better security and more flexibility going forwards. Most of my proposal is really about that, and actually moving the kernel is a relatively small change on top. If the concern is security, then I believe what I have proposed actually makes things more secure than what we have right now. Existing BIOS deployments are arguably slightly less secure, but those were never particularly secure to begin with. If the concern is the ability to actually pull the plan off, then I already have most of it working. Forgive my bias here, but I think the end result is quite nice. |
Just to make thing clear: this is the only change I am proposing. As a consequence, the older systems can be upgraded as long as their /boot partition space still allows it. The older systems do not get any sort of partition changes, the WORKFLOW DOES NOT change, this is the only change to all Flatcar repositories (besides maybe a better error message on the update-engine code to verify if the older systems can fit the new initrd before actually trying to copy it). |
I think both approaches have some merit, and it's great to have a good pro/con discussion. Re: active instances; the vast majority of Flatcar instances out there appears to use in-place updates for upgrading. All these, even the ones newly deployed today, would be affected if we only implement boot partition resizing. Looking at the benefits of both approaches on the other hand, I believe they are complementary: It would be great to have more flexibility re: partition layout and sizes in future versions of Flatcar, and it would be good to have a more robust bootloader process like the one the kernel move implicitly includes. So I would not look at the two approaches as being contradictory, or even competitive - quite the opposite, they enable each other. |
@t-lo I find the workflow proposed in https://hackmd.io/@flatcar/H1MDJuu7lg -> very complex and thus prone to issues down the road and worried on the security aspect of Flatcar. Adding complexity is prone to adding bugs or security issues. I also see some limitations to the proposal. Once the change proposed in https://hackmd.io/@flatcar/H1MDJuu7lg will be done, it cannot be reverted and we have to stick to this new workflow. Well, this MR change also cannot be reverted, but the implications of this MR change are very clear. I am also thinking about the security profile being changed as proposed in https://hackmd.io/@flatcar/H1MDJuu7lg -> this might mean that users need to re-profile / re-certify the new Flatcar booting process from the perspective of security compliance. I suggest also to bring these changes in the weekly meetings and maybe ask user feedback before actually merging them. This MR change is simple and can be done today, the longer we wait, more Flatcar instances will get deployed with the 128MB /boot partition. I see the change https://hackmd.io/@flatcar/H1MDJuu7lg as more of a last resort change and can be done when it really needs to be done. |
I have tracked the remaining space in /boot, looking at how much is taken up by the non-kernel files when initially deployed together with two copies of the kernel from each release. The situation was worst in 3975, before I made some adjustments to buy us a little more time. If you initially deploy that on amd64 and then add two kernel images from the 4368 nightly, that only leaves you with 1.7MB free. The amount the kernel is allowed to increase is only half that. That can disappear very quickly, especially if we enable more of the things we'd actually like or stop going far out of our way to keep the size down. That leaves very little time for systems that started on 3975 to have remedial action applied. Even if you assume that only a small number of systems will have started on 3975, the others won't have much longer. I also imagine that users with very many deployments that began on a range of versions would prefer to see them behave consistently. These users really won't appreciate being told they cannot upgrade their fleet from say, 2 months time, and even a year or so from now would be an unwelcome inconvenience. I have been working under the assumption that we absolutely must not burn this bridge unless we have to. |
Both good points, I welcome the discussion. Introducing only the new partition format would have limited impact as we would still need to test with the smaller partition sizes, and releases failing this test won't be greenlit. There is an immediate benefit for custom-built images though, they can easily activate kernel options and increase initrd sizes w/o messing with the partitioning themselves. The scheme discussed in the "kernel move" document will allow us to transition without any user impact, and at the same time improve bootloader robustness. The security concepts introduced in that process also work towards our signing / trusted boot efforts (to which Chewi has made significant contributions) - if we doubt any of these, we should absolutely, and openly, discuss. I do second your impression that we're proposing a significant change that needs to be thoroughly discussed. We brought it up in the DevSync and announced the write-up in the latest Office Hours - https://www.youtube.com/live/IXx2ZI963_Q?si=6fCyDtv3xDoBHQqP&t=899 - and we will continue to provide opportunities to further discuss. Your proposal of getting user feedback is awesome, we should definitely take a stab at this. |
@t-lo I see a middle way, but quite complicated though: to have this PR change introduced sooner than later, and introduce the work presented in https://hackmd.io/@flatcar/H1MDJuu7lg in I was also thinking that the kernel move might also bring some complexity in how the ARM64 / and maybe RISC-V current / future work might be impacted: #2556 #2485 - but I need to check them case by case to see how the usage of uboot + grub + DTBs might be impacted. The nice part on this is that Flatcar does not officially support these workflows. Also, there is the case of the fallback to be tested and added as a Mantle functional test: Flatcar supports fallback to the good partition if the upgraded version fails to boot. If there is a Flatcar already installed, let s say stable-2024, and then it is upgraded to stable-2026 (with the kernel move), and the stable-2026 fails to boot for any reason, then the stable-2024 should still be able to boot. Flatcar initrd (early boot) is already quite complicated, documentation would be needed for new / potential Flatcar maintainers to better understand the workflow of initrd and bootengine, with the new changes. Both scenarios - basic new instance, and upgrades should have a better documentation. I will be glad to work on such documentation, as I found it very hard at first to understand what happens there, and now, the only way to understand it is to look in the boot logs / systemd logs and bootengine repo. Thanks. |
When looking at the overall things we are trying to achieve, it seems quite similar to what the industry is doing for some time: https://wiki.gentoo.org/wiki/Unified_kernel_image. Maybe we should be looking at this in more depth, see if the upstream implementations regarding UKIs fit the Flatcar workflow (things like https://github.com/systemd/mkosi), and have a broader discussion and decision going further. As noted by @chewi in the https://hackmd.io/@flatcar/H1MDJuu7lg |
#2556 is actually one of my big motivators for doing this. If we were to merge that today, it would immediately push us over the limit. Postponing the migration to when it's absolutely required is just delaying the inevitable, and not even for very long. It would also make upgrades fairly unpredictable for users. Will it do the migration this time or not? I don't believe these changes will make arm64 or RISC-V support any harder. If anything, allowing GRUB upgrades and more flexibility with command line arguments could make it easier. The ability to switch to a pre-migration slot is an essential part of the work. The new GRUB configuration will know how to boot the older kernel. With the new configuration being more flexible, that's not difficult to do. For UEFI systems, the old GRUB image will be installed as a fallback to be invoked by the shim or manually. I also aim to have a menu entry in the new GRUB pointing to the old GRUB. I'm not planning to allow for an actual downgrade (as opposed to just switching the slot) but I gather we don't support that anyway. My document is fairly long because I went into some detail explaining how we'll get there and why I've taken this approach. The actual amount of change may be less than you expect. Most of it concerns the load-a script. Much of it is literally changing the kernel path or just reordering existing parts of the build process. I know there has been interest in UKIs, so I did say early on that having the kernel in /usr would rule them out because EFI firmware cannot load from btrfs. Maybe you can load them from GRUB, but then what's the point? I'm not convinced that Flatcar needs to go down the UKI route. They can simplify the security model, but at a cost of flexibility that I don't think Flatcar can afford. Not using UKIs doesn't make things inherently insecure. Given all that, I don't see my proposal as a temporary workaround. Sorry, I'm not sure what having USR-A/USR-B as entire disk images has to do with UKIs. |
@ader1990 would you be available for discussing this (open ended! No bias from our side) in the dev sync on Wednesday? Would be awesome to chat about this more broadly, with the occasional deep dive. |
Hello, a discussion is always good. |
d4d9739
to
a997b05
Compare
increase the boot partition to 256 MB
The boot partition is currently 128 MB, and reaching a usage size that is approaching soon 50+1% when Flatcar is freshly installed. During the upgrade process, the Flatcar upgrade workflow will put the new initrd in the /boot partition, thus using more space (almost doubling the used space).
Let's make this used space calculation with a more mathematical description.
Example of
du
output:/boot partition contains the following data:
When an upgrade is made, the boot partition size increases with the new intird size INITRD_PART_B_SIZE, currently at 59MB.
So, after an upgrade, we have: GRUB_FILES_SIZE + INITRD_PART_A_SIZE + INITRD_PART_B_SIZE ~= 5 + 59 + 59 ~= 123 MB size. Very close to 128 MB.
Doubling the boot partition will solve the future issue.
I consider that making the boot partition bigger is a more elegant and simpler solution than going with the quite complicated way presented here:
https://hackmd.io/@flatcar/H1MDJuu7lg
There are security implications with the moving approach, as for example, the verity hash workflow is changed, thus the security profile changes.
The downside of this approach would be that, after some time, the upgrade of the older versions won't work anymore, but by the time it becomes an issue, the whole global fleet should be replaced with new Flatcar. Same as with cgroups issue, we can stop the upgrade with an error message if the /boot won't have enough space for existing setups.
How to use
[ describe what reviewers need to do in order to validate this PR ]
Testing done
[Describe the testing you have done before submitting this PR. Please include both the commands you issued as well as the output you got.]
changelog/
directory (user-facing change, bug fix, security fix, update)/boot
and/usr
size, packages, list files for any missing binaries, kernel modules, config files, kernel modules, etc.