-
Notifications
You must be signed in to change notification settings - Fork 46
Description
There are currently a few checks we do before marking an RoT slot as ready to update. With ongoing work in Hubris on transient boot selection, we need to review these checks and make sure they account for any edge cases.
From the discussion in #8295 :
lzrd Jun 12, 2025
We'll also need to check the transient boot selection and the pending persistent boot selection. Also, the active slot should match the persistent boot selection.
These checks can't be done, and the conflicts won't be seen, until Hubris PR oxidecomputer/hubris#2050 is merged.
Any deviation would probably indicate a previous failed update. An ignition power-cycle is the big hammer that can be used, but we don't want bugs where we are continually power-cycling equipment. So, RoT and SP resets should be considered if transient and pending persistent are != None. Active != Persistent also needs to be considered.
karencfv Jun 12, 2025
These are the checks I currently have https://github.com/oxidecomputer/omicron/blob/main/nexus/mgs-updates/src/rot_updater.rs#L370-L386
// If transient boot is being used, the persistent preference is not going to match
// the active slot. At the moment, this mismatch can also mean one of the partitions
// had a bad signature check. We don't have a way to tell this appart yet.
// oxidecomputer/hubris#2066
//
// For now, this discrepancy will mean a bad signature check. That's ok, we can continue.
// The logic here should change when transient boot preference is implemented.
if expected_persistent_boot_preference != active {
info!(log, "expected_persistent_boot_preference does not match active slot.
This could mean a previous broken update attempt.");
};// If pending_persistent_boot_preference or transient_boot_preference is/are some,
// then we need to wait, an update is happening.
if transient_boot_preference.is_some() || pending_persistent_boot_preference.is_some() {
return Err(PrecheckError::EphemeralRotBootPreferenceSet);
}Do they make sense for now?
Could the discrepancies you mention mean two things?
- There might be an ongoing update
- It might be a previous failed update
If so, how can I know the difference?