Skip to content

packrat ereport storage and snitch implementation #2126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 61 commits into
base: master
Choose a base branch
from

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Jul 3, 2025

Things fail.

Not finding out about it sucks.

This branch implements the Hubris side of the ereport ingestion system,
as described RFD 545. Work on this was started by @cbiffle in #2002,
which implemented the core ring-buffer data structure used to store
ereports. Meanwhile, oxidecomputer/management-gateway-service#370,
oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added
the MGS and Omicron components of this system.

This branch picks up where Cliff left off, and "draws the rest of the
owl" by implementing the aggregation of ereports in the packrat task
using this data structure, and adding a new snitch task, which acts as
a proxy to allow ereports stored by packrat to be read over the
management network.

Architecture

Ereports are stored by packrat because we would like as many tasks as
possible to be able to report errors by making IPC call to the task
responsible for ereport storage. This means that the task aggregating
ereports must be a high-priority task, so that as many other tasks as
possible may be its clients. Additionally, we would like to include the
system's VPD identity as metadata for ereports, and this data is already
stored by packrat. Finally, we would like to minimize the likelihood of
the task that stores ereports crashing, as this would result in data
loss, and packrat already is expected not to crash.

On the other hand, the task that actually evacuates these ereports over
the management network must run at a priority lower than that of the
net task, of which it is a client. Thus the separation of
responsibilities between packrat and the snitch. The snitch task is
fairly simple. It receives packets sent to the ereport socket,
interprets the request message, and forwards the request to packrat. Any
ereports sent back by packrat are sent in response to the request. The
snitch ends up being a pretty dumb, stateless proxy: as the response
packet is encoded by packrat; all we end up doing is taking the bytes
received from packrat and stuffing them into the socket's send queue.
The real purpose of this thing is just to serve as a trampoline between
the high priority level of packrat and a priority level lower than that
of the net task.

snitch-core Fixes

While testing behavior when the ereport buffer is full, I found a
potential panic in the existing snitch-core code. Previously, every
time ereports are read from the buffer while it is in the Losing state
(i.e., ereports have been discarded because the buffer was full),
snitch-core attempts to insert a new loss record at the end of the
buffer (calling recover_if_needed()). This ensures that the data loss
is reported to the reader ASAP. The problem is that this code assumed
that there would always be space for an additional loss record, and
panicked if it didn't fit. I added a test reproducing this panic in
ff93754, and fixed it in
22044d1 by changing the calculation of
whether recovery is possible.

When recover_if_needed is called while in the Losing state, we call
the free_space() method to determine whether we can recover. In the
Losing state, this method would calculate the free space by
subtracting the space required for the loss record
that must be
encoded to transition out of the Losing state. However, in the case
where recover_if_required() is called with required_space: None
(which indicates that we're not trying to recover because we want to
insert a new record, but just because we want to report ongoing data
loss to the caller), we check that the free space is greater than or
equal to 0
. This means that we would still try to insert a loss
record even if the free space was 0, resulting in a panic. I've fixed
this by moving the check that there's space for a loss record out of the
calculation of free_space() and into the required space, in addition
to the requested value (which is 0 in the "we are inserting the loss
record to report loss" case). This way, we only insert the loss record
if it fits, which is the correct behavior.

I've also changed the assignment of ENAs in snitch-core to start at 1,
rather than 0, since ENA 0 is reserved in the wire protocol to indicate
"no ENA". In the "committed ENA" request field this means "don't flush
any ereports", and in the "start ENA" response field, ENA 0 means "no
ereports in this packet". Thus, the ereport store must start assigning
ENAs at ENA 1 for the initial loss record.

Testing

Currently, no tasks actually produce ereports. To test that everything
works correctly, it was necessary to add a source of ereports, so I've
added a little task that just generates test ereports when asked
via hiffy. I've included some of that in [this comment][4]. This was
also used for testing the data-loss behavior discussed above.

@hawkw hawkw force-pushed the eliza/snitch-again branch from e6ba297 to adaa20d Compare July 3, 2025 20:53
@@ -0,0 +1,31 @@
# Gimletlet Ereport test application
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about this, but I'd be fine having this be part of the basic gimletlet image, since it's additive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I was thinking we might not always want it to save on flash, but that seems like a problem that could be solved when it actually becomes a problem!

// We *could* include the Hubris image ID here even if our VPD
// identity hasn't been set, but sending an empty metadata map
// ensures that MGS will continue asking for metadata on subsequent
// requests.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Protocol question: what ensures that MGS keeps asking for metadata?

MGS always gets the new restart ID, so subsequent requests from MGS won't take this conditional (unless MGS deliberately avoids updating its restart ID if there's no metadata?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When MGS receives a request to read ereports from a SP, if its metadata map for that SP is empty, it will send a "metadata refresh request" with restart ID 0 and a limit of 0 ereports, before it forwards the request it received:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L109-L150

MGS doesn't actually store the restart ID, that's stored by whoever is making requests to MGS. It only updates its metadata map if it receives a non-empty metadata map from the SP, or if the request's restart ID doesn't match the SP's:

https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L351-L360

Looking at the code again, I think the Option around the metadata map complicates things and it really should be checking if the actual map is empty...I think there might be a bug where we don't re-request metadata if the SP restarts and then sends empty meta. I'll go fix that.

ByteGather(r.slices.0, r.slices.1),
);
let mut c =
minicbor::encode::write::Cursor::new(&mut self.recv[..]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement minicbor::write::Write on a (wrapper type around a) lease, to avoid having to copy data twice? It would be a decent amount of plumbing, so maybe not required in this PR...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good idea. we should probably do that in a lib crate, because tasks that want to send ereports to packrat will probably also want to write into a lease.

however, there is another reason for the double-buffering in this instance, which is that we're currently using it to determine how long the ereport is before copying it into the lease, so that we can put it in the lease only if it fits. since what we're doing here is "write as many ereports as fit", we don't want to error out if we run out of space when encoding. we could probably handle that by hanging onto the previous position we had written to, and not advancing it unless we wrote the whole thing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could probably handle that by hanging onto the previous position we had written to, and not advancing it unless we wrote the whole thing.

Yeah, I was thinking along those lines. We'd need a wrapper type and a new error type which distinguishes between "lease write failed" and "ran out of space", so I'm fine punting for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i started on this, but it's just complicated enough i'd kinda prefer to do it in a follow-up branch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this in #2164, there was a bit of jank due to the minicbor API but I think it was worth it.

hawkw added a commit to oxidecomputer/management-gateway-service that referenced this pull request Jul 11, 2025
Presently, the `ereport::Worker` struct [stores the metadata map in an
`Option`][1]. Metadata refresh requests (`restart_id=0, start_ena=0,
limit=0`) are [sent to the SP if the `Option` is `None`][2]. The option
[is set to `Some`][3] if we receive a packet from the SP where the
metadata map is non-empty, or if the restart ID mismatches the requested
one.

If I recall correctly, the `Option` was intended to distinguish between
"we just started up" and "we received an explicit empty metadata map".
But, I don't actually think we _should_ be distinguishing between those
cases. When the SP has restarted and given us an empty metadata map,
this may be because we requested ereports from `packrat` _before_ VPD
has been loaded (as I discussed in
oxidecomputer/hubris#2126 (comment)).
In that case, when the SP sends us an empty metadata map, we want to
keep requesting the metadata on every subsequent request, as it might be
set later.

Thus, this commit just removes the `Option` and has it start out with an
_empty_ map, and overwrites the existing map if the restart IDs are
mismatched, *or* any time the current map is empty and the received one
is non-empty. I've also added an additional test for this behavior.

Fioxes #409

[1]:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L79-L83
[2]:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L109-L111
[3]:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L351-L360
hawkw added a commit to oxidecomputer/management-gateway-service that referenced this pull request Jul 11, 2025
Presently, the `ereport::Worker` struct [stores the metadata map in an
`Option`][1]. Metadata refresh requests (`restart_id=0, start_ena=0,
limit=0`) are [sent to the SP if the `Option` is `None`][2]. The option
[is set to `Some`][3] if we receive a packet from the SP where the
metadata map is non-empty, or if the restart ID mismatches the requested
one.

If I recall correctly, the `Option` was intended to distinguish between
"we just started up" and "we received an explicit empty metadata map".
But, I don't actually think we _should_ be distinguishing between those
cases. When the SP has restarted and given us an empty metadata map,
this may be because we requested ereports from `packrat` _before_ VPD
has been loaded (as I discussed in
oxidecomputer/hubris#2126 (comment)).
In that case, when the SP sends us an empty metadata map, we want to
keep requesting the metadata on every subsequent request, as it might be
set later.

Thus, this commit just removes the `Option` and has it start out with an
_empty_ map, and overwrites the existing map if the restart IDs are
mismatched, *or* any time the current map is empty and the received one
is non-empty. I've also added an additional test for this behavior.

Fixes #409

[1]:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L79-L83
[2]:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L109-L111
[3]:
https://github.com/oxidecomputer/management-gateway-service/blob/77e316c812aa057b9714d0d99c4a7bdd36d45be2/gateway-sp-comms/src/ereport.rs#L351-L360
@hawkw hawkw added the ⚠️ ereport if you see something, say something! label Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚠️ ereport if you see something, say something!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants