Skip to content

Review RoT update pre-checks #8349

@karencfv

Description

@karencfv

There are currently a few checks we do before marking an RoT slot as ready to update. With ongoing work in Hubris on transient boot selection, we need to review these checks and make sure they account for any edge cases.

From the discussion in #8295 :

lzrd Jun 12, 2025

We'll also need to check the transient boot selection and the pending persistent boot selection. Also, the active slot should match the persistent boot selection.
These checks can't be done, and the conflicts won't be seen, until Hubris PR oxidecomputer/hubris#2050 is merged.
Any deviation would probably indicate a previous failed update. An ignition power-cycle is the big hammer that can be used, but we don't want bugs where we are continually power-cycling equipment. So, RoT and SP resets should be considered if transient and pending persistent are != None. Active != Persistent also needs to be considered.

karencfv Jun 12, 2025

These are the checks I currently have https://github.com/oxidecomputer/omicron/blob/main/nexus/mgs-updates/src/rot_updater.rs#L370-L386

// If transient boot is being used, the persistent preference is not going to match
// the active slot. At the moment, this mismatch can also mean one of the partitions
// had a bad signature check. We don't have a way to tell this appart yet.
// oxidecomputer/hubris#2066
//
// For now, this discrepancy will mean a bad signature check. That's ok, we can continue.
// The logic here should change when transient boot preference is implemented.
if expected_persistent_boot_preference != active {
info!(log, "expected_persistent_boot_preference does not match active slot.
This could mean a previous broken update attempt.");
};

// If pending_persistent_boot_preference or transient_boot_preference is/are some,
// then we need to wait, an update is happening.
if transient_boot_preference.is_some() || pending_persistent_boot_preference.is_some() {
return Err(PrecheckError::EphemeralRotBootPreferenceSet);
}

Do they make sense for now?

Could the discrepancies you mention mean two things?

  1. There might be an ongoing update
  2. It might be a previous failed update

If so, how can I know the difference?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions