packrat ereport storage and snitch implementation #2126

hawkw · 2025-07-03T20:08:14Z

Things fail.

Not finding out about it sucks.

This branch implements the Hubris side of the ereport ingestion system,
as described RFD 545. Work on this was started by @cbiffle in #2002,
which implemented the core ring-buffer data structure used to store
ereports. Meanwhile, oxidecomputer/management-gateway-service#370,
oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added
the MGS and Omicron components of this system.

This branch picks up where Cliff left off, and "draws the rest of the
owl" by implementing the aggregation of ereports in the packrat task
using this data structure, and adding a new snitch task, which acts as
a proxy to allow ereports stored by packrat to be read over the
management network.

Architecture

Ereports are stored by packrat because we would like as many tasks as
possible to be able to report errors by making IPC call to the task
responsible for ereport storage. This means that the task aggregating
ereports must be a high-priority task, so that as many other tasks as
possible may be its clients. Additionally, we would like to include the
system's VPD identity as metadata for ereports, and this data is already
stored by packrat. Finally, we would like to minimize the likelihood of
the task that stores ereports crashing, as this would result in data
loss, and packrat already is expected not to crash.

On the other hand, the task that actually evacuates these ereports over
the management network must run at a priority lower than that of the
net task, of which it is a client. Thus the separation of
responsibilities between packrat and the snitch. The snitch task is
fairly simple. It receives packets sent to the ereport socket,
interprets the request message, and forwards the request to packrat. Any
ereports sent back by packrat are sent in response to the request. The
snitch ends up being a pretty dumb, stateless proxy: as the response
packet is encoded by packrat; all we end up doing is taking the bytes
received from packrat and stuffing them into the socket's send queue.
The real purpose of this thing is just to serve as a trampoline between
the high priority level of packrat and a priority level lower than that
of the net task.

`snitch-core` Fixes

While testing behavior when the ereport buffer is full, I found a
potential panic in the existing snitch-core code. Previously, every
time ereports are read from the buffer while it is in the Losing state
(i.e., ereports have been discarded because the buffer was full),
snitch-core attempts to insert a new loss record at the end of the
buffer (calling recover_if_needed()). This ensures that the data loss
is reported to the reader ASAP. The problem is that this code assumed
that there would always be space for an additional loss record, and
panicked if it didn't fit. I added a test reproducing this panic in
ff93754, and fixed it in
22044d1 by changing the calculation of
whether recovery is possible.

When recover_if_needed is called while in the Losing state, we call
the free_space() method to determine whether we can recover. In the
Losing state, this method would calculate the free space by
subtracting the space required for the loss record that must be
encoded to transition out of the Losing state. However, in the case
where recover_if_required() is called with required_space: None
(which indicates that we're not trying to recover because we want to
insert a new record, but just because we want to report ongoing data
loss to the caller), we check that the free space is greater than or
equal to 0. This means that we would still try to insert a loss
record even if the free space was 0, resulting in a panic. I've fixed
this by moving the check that there's space for a loss record out of the
calculation of free_space() and into the required space, in addition
to the requested value (which is 0 in the "we are inserting the loss
record to report loss" case). This way, we only insert the loss record
if it fits, which is the correct behavior.

I've also changed the assignment of ENAs in snitch-core to start at 1,
rather than 0, since ENA 0 is reserved in the wire protocol to indicate
"no ENA". In the "committed ENA" request field this means "don't flush
any ereports", and in the "start ENA" response field, ENA 0 means "no
ereports in this packet". Thus, the ereport store must start assigning
ENAs at ENA 1 for the initial loss record.

Testing

Currently, no tasks actually produce ereports. To test that everything
works correctly, it was necessary to add a source of ereports, so I've
added a little task that just generates test ereports when asked
via hiffy. I've included some of that in [this comment][4]. This was
also used for testing the data-loss behavior discussed above.

app/grapefruit/app-dev.toml

mkeeter · 2025-07-10T18:39:59Z

app/gimletlet/app-ereportlet.toml

@@ -0,0 +1,31 @@
+# Gimletlet Ereport test application


I don't feel strongly about this, but I'd be fine having this be part of the basic gimletlet image, since it's additive.

Yeah. I was thinking we might not always want it to save on flash, but that seems like a problem that could be solved when it actually becomes a problem!

app/psc/base.toml

task/packrat-api/build.rs

mkeeter · 2025-07-10T19:16:17Z

task/packrat/src/ereport.rs

+            // We *could* include the Hubris image ID here even if our VPD
+            // identity hasn't been set, but sending an empty metadata map
+            // ensures that MGS will continue asking for metadata on subsequent
+            // requests.


Protocol question: what ensures that MGS keeps asking for metadata?

MGS always gets the new restart ID, so subsequent requests from MGS won't take this conditional (unless MGS deliberately avoids updating its restart ID if there's no metadata?)

When MGS receives a request to read ereports from a SP, if its metadata map for that SP is empty, it will send a "metadata refresh request" with restart ID 0 and a limit of 0 ereports, before it forwards the request it received:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L109-L150

MGS doesn't actually store the restart ID, that's stored by whoever is making requests to MGS. It only updates its metadata map if it receives a non-empty metadata map from the SP, or if the request's restart ID doesn't match the SP's:

https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L351-L360

Looking at the code again, I think the Option around the metadata map complicates things and it really should be checking if the actual map is empty...I think there might be a bug where we don't re-request metadata if the SP restarts and then sends empty meta. I'll go fix that.

task/packrat/src/ereport.rs

mkeeter · 2025-07-10T19:24:12Z

task/packrat/src/ereport.rs

+                ByteGather(r.slices.0, r.slices.1),
+            );
+            let mut c =
+                minicbor::encode::write::Cursor::new(&mut self.recv[..]);


Should we implement minicbor::write::Write on a (wrapper type around a) lease, to avoid having to copy data twice? It would be a decent amount of plumbing, so maybe not required in this PR...

that's a good idea. we should probably do that in a lib crate, because tasks that want to send ereports to packrat will probably also want to write into a lease.

however, there is another reason for the double-buffering in this instance, which is that we're currently using it to determine how long the ereport is before copying it into the lease, so that we can put it in the lease only if it fits. since what we're doing here is "write as many ereports as fit", we don't want to error out if we run out of space when encoding. we could probably handle that by hanging onto the previous position we had written to, and not advancing it unless we wrote the whole thing.

we could probably handle that by hanging onto the previous position we had written to, and not advancing it unless we wrote the whole thing.

Yeah, I was thinking along those lines. We'd need a wrapper type and a new error type which distinguishes between "lease write failed" and "ran out of space", so I'm fine punting for now.

yeah, i started on this, but it's just complicated enough i'd kinda prefer to do it in a follow-up branch

Did this in #2164, there was a bit of jank due to the minicbor API but I think it was worth it.

task/packrat/src/ereport.rs

lib/snitch-core/src/lib.rs

Presently, the `ereport::Worker` struct [stores the metadata map in an `Option`][1]. Metadata refresh requests (`restart_id=0, start_ena=0, limit=0`) are [sent to the SP if the `Option` is `None`][2]. The option [is set to `Some`][3] if we receive a packet from the SP where the metadata map is non-empty, or if the restart ID mismatches the requested one. If I recall correctly, the `Option` was intended to distinguish between "we just started up" and "we received an explicit empty metadata map". But, I don't actually think we _should_ be distinguishing between those cases. When the SP has restarted and given us an empty metadata map, this may be because we requested ereports from `packrat` _before_ VPD has been loaded (as I discussed in oxidecomputer/hubris#2126 (comment)). In that case, when the SP sends us an empty metadata map, we want to keep requesting the metadata on every subsequent request, as it might be set later. Thus, this commit just removes the `Option` and has it start out with an _empty_ map, and overwrites the existing map if the restart IDs are mismatched, *or* any time the current map is empty and the received one is non-empty. I've also added an additional test for this behavior. Fioxes #409 [1]: https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L79-L83 [2]: https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L109-L111 [3]: https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L351-L360

Presently, the `ereport::Worker` struct [stores the metadata map in an `Option`][1]. Metadata refresh requests (`restart_id=0, start_ena=0, limit=0`) are [sent to the SP if the `Option` is `None`][2]. The option [is set to `Some`][3] if we receive a packet from the SP where the metadata map is non-empty, or if the restart ID mismatches the requested one. If I recall correctly, the `Option` was intended to distinguish between "we just started up" and "we received an explicit empty metadata map". But, I don't actually think we _should_ be distinguishing between those cases. When the SP has restarted and given us an empty metadata map, this may be because we requested ereports from `packrat` _before_ VPD has been loaded (as I discussed in oxidecomputer/hubris#2126 (comment)). In that case, when the SP sends us an empty metadata map, we want to keep requesting the metadata on every subsequent request, as it might be set later. Thus, this commit just removes the `Option` and has it start out with an _empty_ map, and overwrites the existing map if the restart IDs are mismatched, *or* any time the current map is empty and the received one is non-empty. I've also added an additional test for this behavior. Fixes #409 [1]: https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L79-L83 [2]: https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L109-L111 [3]: https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L351-L360

cbiffle and others added 17 commits July 2, 2025 15:32

WIP

5a626d5

disembreak merge conflict with cosmo

da365ee

restart nonce plumbing

7c60074

ringbuf

b3d580e

actually make CBOR lists

6c40169

reticulating packrat

0b6406c

add rng_driver deps

2df36e8

implement more of packrat, feature-flag ereports

9dde194

add VPD metadata

7dca1ed

reticulating app.toml

a0d2954

draw the rest of the owl

4689314

right-size rx buf, give everyone a snitch

c4d4530

rm packrat size limits

31a951e

untangle rng_driver prio mess on gimletlet

c055af3

proper timestampiness

3334d72

PROPERLY untangle priority mess on gimletlet

db5d6d6

BEHOLD THE EREPORTULATOR

adaa20d

hawkw force-pushed the eliza/snitch-again branch from e6ba297 to adaa20d Compare July 3, 2025 20:53

hawkw added 6 commits July 3, 2025 13:53

shut clippy up

653765f

always send empty meta map

3dd9f20

dont ringbuf zerocopy types

60bb54a

be smart about not overflowing buf

5f1634b

implement limit

ee7d004

give ereportulator a Hiffy interface

809fe11

hawkw mentioned this pull request Jul 5, 2025

mechanism for jefe to deliver ereports to packrat #2127

Open

hawkw added 5 commits July 6, 2025 12:35

fix type mismatch for disabled ereport IPCs

90b401d

packrat docs

728dfbb

don't crash packrat on internal errors

3135cb8

embiggen packrat ereport stacks

9745992

reticulating stack sizes

9a9befa