fix(gnovm/store): body-first AddMemPackage ordering + fail-fast IterMemPackage#5605
Open
moul wants to merge 5 commits intognolang:masterfrom
Open
fix(gnovm/store): body-first AddMemPackage ordering + fail-fast IterMemPackage#5605moul wants to merge 5 commits intognolang:masterfrom
moul wants to merge 5 commits intognolang:masterfrom
Conversation
…c in IterMemPackage
Collaborator
🛠 PR Checks SummaryAll Automated Checks passed. ✅ Manual Checks (for Reviewers):
Read More🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers. ✅ Automated Checks (for Contributors):🟢 Maintainers must be able to edit this pull request (more info) ☑️ Contributor Actions:
☑️ Reviewer Actions:
📚 Resources:Debug
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…slices Lint fallout from the body-first ordering change: AddMemPackage no longer calls incGetPackageIndexCounter so it's now unused; the two new test cases were flagged by prealloc — both bound the slice to ctr.
Reverses the earlier 'yield nil + let consumer skip' approach: with the body-first ordering in AddMemPackage neither a missing-index nor a missing-body should be reachable, so observing one means the substores have diverged below the gnovm layer (DB-level WAL crash). Feeding nil mpkgs to consumers would just SIGSEGV later in ParseMemPackage and lose the cause. Panic at the source with a message that names the slot/path and tells the operator to replay from a clean snapshot. Validation runs eagerly on the caller's goroutine (one O(N) load at boot) so the panic surfaces at the call site instead of inside an orphan goroutine where tests can't recover it. Memory cost is acceptable for restart-time iteration. Tests assert the two corrupt-state panics carry the expected message and slot identifier; the happy-path round-trip in TestAddMemPackage_WriteOrder IsBodyFirst still passes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Split out from #5597 (review-only stack) for atomic review.
What changes
defaultStore.AddMemPackagepreviously wrote in the order: counter → index slot → body. A SIGKILL between any two of those left the store inconsistent — counter pointing at an index pointing at nothing, or counter ahead of an unwritten index slot.IterMemPackagethen either panicked deep in a producer goroutine or yielded a nil*std.MemPackagethat SIGSEGV'd inParseMemPackageon the consumer side.This PR:
AddMemPackageto body (iavlStore) → index slot (baseStore) → counter bump (baseStore). Each crash window is now either harmless (orphaned body, never iterated) or self-healing on retry (slot at N+1 with counter still N gets overwritten on next add).IterMemPackagefail-fast on observed inconsistency. With body-first ordering the inconsistencies handled here should be unreachable; if they ever surface, the substores have diverged below the gnovm layer (DB-level WAL crash) and the only safe recovery is replay from a clean snapshot. The panic message names the slot/path and tells the operator what to do — much better than a quiet `fmt.Fprintln` of a corruption warning, and much better than yielding nil and SIGSEGV'ing later.Replay walltime on a real gno-cluster node dropped from ~12 min to ~36 s as a side effect — we no longer redo aborted writes on retry.
Why fail-fast over yield-nil-and-skip
An earlier draft of this PR paired with a consumer-side defensive nil-skip in `PreprocessAllFilesAndSaveBlockNodes` (now closed: #5606). That approach was wrong: it converted hard corruption into silent semi-corruption, with the realm's owner seeing random VM errors at first import — strictly harder to diagnose than a clear panic at boot. For a node about to enter consensus, booting with quarantined corrupt state is also a determinism risk (this node may hash differently than the rest of the network).
The right behaviour is: refuse to boot, name the inconsistency, point the operator at the recovery path. A real atomic write across substores (pebble batch spanning both) is the proper fix and belongs in a follow-up PR — but until then, body-first ordering plus loud failure is the correct posture.
Tests
gnovm/pkg/gnolang/store_test.go:All three pass. `go test ./gnovm/pkg/gnolang/ -short` and `go test ./gno.land/pkg/sdk/vm/ -run Gas` both pass locally.
Context
One slice of the test13 hardfork-readiness stack at #5597 (rc5-master `0e423f30`), reworked here in response to review feedback to drop the consumer-side defensive skip. A dedicated ADR will follow.
cc @aeddi (original commit author preserved).
Co-authored from #5597.