Skip to content

fix(gnovm/store): body-first AddMemPackage ordering + fail-fast IterMemPackage#5605

Open
moul wants to merge 5 commits intognolang:masterfrom
moul:dev/moul/gnovm-store-body-first
Open

fix(gnovm/store): body-first AddMemPackage ordering + fail-fast IterMemPackage#5605
moul wants to merge 5 commits intognolang:masterfrom
moul:dev/moul/gnovm-store-body-first

Conversation

@moul
Copy link
Copy Markdown
Member

@moul moul commented Apr 27, 2026

Split out from #5597 (review-only stack) for atomic review.

What changes

defaultStore.AddMemPackage previously wrote in the order: counter → index slot → body. A SIGKILL between any two of those left the store inconsistent — counter pointing at an index pointing at nothing, or counter ahead of an unwritten index slot. IterMemPackage then either panicked deep in a producer goroutine or yielded a nil *std.MemPackage that SIGSEGV'd in ParseMemPackage on the consumer side.

This PR:

  1. Reorders writes in AddMemPackage to body (iavlStore) → index slot (baseStore) → counter bump (baseStore). Each crash window is now either harmless (orphaned body, never iterated) or self-healing on retry (slot at N+1 with counter still N gets overwritten on next add).
  2. Makes IterMemPackage fail-fast on observed inconsistency. With body-first ordering the inconsistencies handled here should be unreachable; if they ever surface, the substores have diverged below the gnovm layer (DB-level WAL crash) and the only safe recovery is replay from a clean snapshot. The panic message names the slot/path and tells the operator what to do — much better than a quiet `fmt.Fprintln` of a corruption warning, and much better than yielding nil and SIGSEGV'ing later.
  3. Validation runs eagerly on the caller's goroutine. The previous channel-based design ran validation inside a producer goroutine where panics couldn't be recovered by tests and would crash the process without surfacing through a useful trace. The eager pass costs O(N) memory at boot, which is fine for restart-time iteration.

Replay walltime on a real gno-cluster node dropped from ~12 min to ~36 s as a side effect — we no longer redo aborted writes on retry.

Why fail-fast over yield-nil-and-skip

An earlier draft of this PR paired with a consumer-side defensive nil-skip in `PreprocessAllFilesAndSaveBlockNodes` (now closed: #5606). That approach was wrong: it converted hard corruption into silent semi-corruption, with the realm's owner seeing random VM errors at first import — strictly harder to diagnose than a clear panic at boot. For a node about to enter consensus, booting with quarantined corrupt state is also a determinism risk (this node may hash differently than the rest of the network).

The right behaviour is: refuse to boot, name the inconsistency, point the operator at the recovery path. A real atomic write across substores (pebble batch spanning both) is the proper fix and belongs in a follow-up PR — but until then, body-first ordering plus loud failure is the correct posture.

Tests

gnovm/pkg/gnolang/store_test.go:

  • `TestAddMemPackage_WriteOrderIsBodyFirst` — captures the actual store call sequence, asserts body before index before counter, then round-trips two adds via `IterMemPackage` to prove the happy path still works.
  • `TestIterMemPackage_MissingIndexPanics` — counter > 0, index slot empty → must panic with "corrupt package index", "slot 1", "replay".
  • `TestIterMemPackage_InconsistentBaseStorePanics` — index slot present, iavlStore body absent → must panic with "substore divergence", "slot 2", "replay".

All three pass. `go test ./gnovm/pkg/gnolang/ -short` and `go test ./gno.land/pkg/sdk/vm/ -run Gas` both pass locally.

Context

One slice of the test13 hardfork-readiness stack at #5597 (rc5-master `0e423f30`), reworked here in response to review feedback to drop the consumer-side defensive skip. A dedicated ADR will follow.

cc @aeddi (original commit author preserved).

Co-authored from #5597.

@Gno2D2
Copy link
Copy Markdown
Collaborator

Gno2D2 commented Apr 27, 2026

🛠 PR Checks Summary

All Automated Checks passed. ✅

Manual Checks (for Reviewers):
  • IGNORE the bot requirements for this PR (force green CI check)
Read More

🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers.

✅ Automated Checks (for Contributors):

🟢 Maintainers must be able to edit this pull request (more info)

☑️ Contributor Actions:
  1. Fix any issues flagged by automated checks.
  2. Follow the Contributor Checklist to ensure your PR is ready for review.
    • Add new tests, or document why they are unnecessary.
    • Provide clear examples/screenshots, if necessary.
    • Update documentation, if required.
    • Ensure no breaking changes, or include BREAKING CHANGE notes.
    • Link related issues/PRs, where applicable.
☑️ Reviewer Actions:
  1. Complete manual checks for the PR, including the guidelines and additional checks if applicable.
📚 Resources:
Debug
Automated Checks
Maintainers must be able to edit this pull request (more info)

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 The pull request was created from a fork (head branch repo: moul/gno)

Then

🟢 Requirement satisfied
└── 🟢 Maintainer can modify this pull request

Manual Checks
**IGNORE** the bot requirements for this PR (force green CI check)

If

🟢 Condition met
└── 🟢 On every pull request

Can be checked by

  • Any user with comment edit permission

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 88.57143% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
gnovm/pkg/gnolang/store.go 88.57% 2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

…slices

Lint fallout from the body-first ordering change: AddMemPackage no longer
calls incGetPackageIndexCounter so it's now unused; the two new test cases
were flagged by prealloc — both bound the slice to ctr.
@moul moul marked this pull request as draft April 27, 2026 18:13
Reverses the earlier 'yield nil + let consumer skip' approach: with the
body-first ordering in AddMemPackage neither a missing-index nor a
missing-body should be reachable, so observing one means the substores
have diverged below the gnovm layer (DB-level WAL crash). Feeding nil
mpkgs to consumers would just SIGSEGV later in ParseMemPackage and lose
the cause. Panic at the source with a message that names the slot/path
and tells the operator to replay from a clean snapshot.

Validation runs eagerly on the caller's goroutine (one O(N) load at boot)
so the panic surfaces at the call site instead of inside an orphan
goroutine where tests can't recover it. Memory cost is acceptable for
restart-time iteration.

Tests assert the two corrupt-state panics carry the expected message and
slot identifier; the happy-path round-trip in TestAddMemPackage_WriteOrder
IsBodyFirst still passes.
@moul moul changed the title fix(gnovm/store): body-first AddMemPackage ordering + skip-don't-panic in IterMemPackage fix(gnovm/store): body-first AddMemPackage ordering + fail-fast IterMemPackage Apr 27, 2026
@moul moul marked this pull request as ready for review April 27, 2026 19:38
@moul moul requested review from omarsy and thehowl April 27, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

📦 🤖 gnovm Issues or PRs gnovm related

Projects

Status: No status
Status: 📥 Inbox

Development

Successfully merging this pull request may close these issues.

3 participants