fix(bptree): deep-dive: correctness, safety, and performance fixes#5591
Open
clockworkgr wants to merge 15 commits intofeat/jae/bp32treefrom
Open
fix(bptree): deep-dive: correctness, safety, and performance fixes#5591clockworkgr wants to merge 15 commits intofeat/jae/bp32treefrom
clockworkgr wants to merge 15 commits intofeat/jae/bp32treefrom
Conversation
…t pruning Iterators never incremented versionReaders because every call site passed version=0 to newIterator. Close() guarded decrVersionReaders on `version > 0`, so the reader-tracking code path was dead. An ImmutableTree iterator on version V did not prevent PruneVersionsTo(V) from deleting nodes the iterator still walks — a silent correctness bug that surfaces as a torn tree read on any long iteration that overlaps a prune. Exporter already did this correctly (export.go incrVersionReaders on Export, decr on Close). Iterators had the same machinery wired but disconnected. Changes: - ImmutableTree now carries an optional `ndb *nodeDB`. MutableTree.GetImmutable sets it so DB-backed snapshots know which nodeDB to register against. - NewIteratorWithNDB prefers the ImmutableTree's own ndb (so the version it tracks is imm.version, not 0). When present, it increments versionReaders at construction; Close() decrements as before. - ImmutableTree.Iterator also registers a reader when the snapshot is DB-backed. - MutableTree.Iterator deliberately does NOT register: the working tree is never prunable (PruneVersionsTo rejects toVersion >= latest). Tests: - TestPrune_IteratorBlocksPrune exercises the ImmutableTree.Iterator path. - TestPrune_StoreIteratorBlocksPrune exercises NewIteratorWithNDB (the store wrapper path). Both assert that an open iterator blocks pruning of its version and that Close releases the hold. The existing TestPrune_VersionReaders (Exporter path) still passes.
saveNode iterated children with inner.getChild(i), which lazily loaded every sibling from DB even when the sibling was not modified. The inner early-return (`if node.GetNodeKey() != nil return nil`) then skipped the actual save — meaning the DB read was pure waste. At blockchain scale this is a major performance issue: a single Set touches one root→leaf path (H nodes loaded, with B-1 siblings per level still unloaded). SaveVersion then pulled ~H*(B-1) extra nodes from disk to do nothing with them. For a 100M-leaf tree with B=32, H=6, ~100 mutations/block, that's ~18K unnecessary DB reads per block just during save; most miss the 10K LRU on cold paths. Fix: walk inner.childNodes directly. If childNodes[i] is nil, the child is unloaded and therefore unchanged — its serialized ref (children[i]) and cached hash (childHashes[i]) are still authoritative from the prior load, and there is nothing to recurse into. Only loaded children (which are either the modified COW'd path or previously-accessed hot nodes) are passed to saveNode; for clean children, saveNode still early-returns, but we skip the DB hit entirely. Correctness: the COW/split paths all explicitly populate childNodes[i] for nodes that need saving — the root-split newRoot (insert.go:25-37), the innerInsert split distribution (insert.go:195-210), and setChild for the modified child. So any dirty child is guaranteed to be in childNodes. Test: TestSaveVersion_DoesNotForceLoadSiblings uses a counting DB wrapper and asserts zero node reads during SaveVersion after a single Set on a reloaded tree. Cache size is set to 2 to force DB reads on any sibling access, making the regression observable.
… Rollback corruption When SaveVersion detects that the target version already exists with a matching hash (typical of a deterministic replay after a crash), it returned early without clearing t.sessionValues, t.versionOrphans, or t.nextValueNonce. In a deterministic replay the replay's ValueKeys collide with the persisted vks (same version, same allocation order, same nonces), so the replay's eager SaveValue calls simply overwrote the live DB entries with the same content — correct. But because sessionValues still referenced those vks, a subsequent Rollback would DeleteValueDirect each of them, wiping live DB values that the working tree and the persisted tree both point at. The tree then returned nil for Get on those keys. Fix: clear t.sessionValues / t.versionOrphans / t.nextValueNonce on the idempotent-save path. Do NOT delete any vks from DB — in the collision case those are the live entries, and we cannot cheaply detect divergent allocation orders without scanning every leaf of the persisted tree. Divergent allocation (e.g., the replay's Set order diverges from the original's) is a pre-existing and unrelated corruption issue: the replay writes different values at the same DB keys, clobbering the originals. That cannot be fixed in SaveVersion — it's a consequence of eager value writes plus deterministic nonce allocation. If/when that becomes a real concern, the right fix is elsewhere (e.g., hash-suffix vks, or deferred value writes). This commit deliberately does not try to solve it. Tests: - TestSaveVersion_IdempotentReplayRollbackSafe: deterministic-replay case, asserts Rollback after idempotent save leaves the DB and Gets intact. - TestSaveVersion_IdempotentEmptySessionNoOp: the trivial case (re-save without any session writes), asserts no DB changes and stable hash.
nodeDB.VersionExists swallowed DB errors from db.Has() and returned false. In SaveVersion this is destructive: if the underlying Has() transiently fails while an existing version N+1 is in the DB with a different hash, SaveVersion would enter the "fresh save" branch and write the new root pointer, overwriting the real version — silent data corruption. Fix: - Keep VersionExists(int64) bool for interface compatibility (store.Tree exposes it to external consumers). The boolean form now logs at Error level when the underlying Has returns an error, so the failure class is observable instead of invisible. - Add a new error-propagating variant versionExistsE(int64) (bool, error) and use it in SaveVersion. A DB failure now aborts the save with a wrapped error rather than silently writing over existing state. Other callers of VersionExists (PruneVersionsTo, findNextVersion, store Query, etc.) keep the boolean form. They are either robust to a false negative (prune-loop skips a version that actually exists — no corruption, just unfinished work that retries on the next prune) or informational. If more callers need error propagation later, versionExistsE is available. Test: TestSaveVersion_PropagatesVersionExistsError wraps the DB in a failingHasDB that returns an error for every PrefixRoot key lookup, then asserts SaveVersion returns a wrapped error containing the underlying cause. Before the fix this test would pass silently — SaveVersion would proceed as if the version didn't exist.
…values
ndb.GetValue returned (nil, nil) for both "stored empty value" and "value
not found in DB" (corruption or pruning bug). A missing value then looked
identical to a stored empty value all the way up through MutableTree.Get,
Store.Get, and Iterator.Value — silent data loss masquerading as an empty
result.
The DB contract says "Get returns nil iff key doesn't exist", and compliant
backends preserve empty-vs-missing by returning []byte{} for a stored empty
slice. Non-compliant backends may collapse both to nil. Either way, the
previous GetValue couldn't disambiguate.
Fix: when db.Get returns nil, disambiguate with db.Has.
- nil result + Has=false → wrapped ErrKeyDoesNotExist (corruption signal)
- nil result + Has=true → []byte{} (normalize empty to non-nil)
- non-nil result → passthrough (common path, one DB call)
The second DB call only fires on the nil-result path, so the common
non-empty case pays no cost.
Callers that propagate errors (MutableTree.Get, ImmutableTree.Get,
Iterator.Value, createExistenceProof) now surface corruption loudly
instead of returning apparently-successful nil. Store.Get wraps
resolveValue errors in a panic, which is the correct behavior for a DB
that has lost a value the tree still references.
Tests:
- TestGetValue_MissingValueReturnsError deletes a value directly from the
underlying DB (simulating corruption) and asserts GetValue returns a
wrapped ErrKeyDoesNotExist.
- TestGetValue_EmptyValueReturnsEmptySlice asserts Set(k, []byte{}) round
trips as a non-nil empty slice — both immediately and after SaveVersion.
…rsion edge case) pruneVersion(v, nextV) only processed orphans[nextV] values, relying on a prior pruneVersion(v-1, v) iteration to have handled orphans[v] values. For the very first pruned version in a batch there is no such prior iteration, so any values listed in orphans[first] would leak — the record was deleted at the end of pruneVersion but its values were not. In practice this is benign: orphans[V] is only written when SaveVersion(V) had non-empty versionOrphans (displaced values from prior versions). SaveVersion of a first version has no prior state to displace, so orphans[first] is never written. The new TestOrphans_FirstVersionEmpty locks that invariant in across: - default initialVersion with no sets - default initialVersion with sets - default initialVersion with set+remove+set churn - initialVersion=100 with sets All produce empty orphans[first]. But the invariant is implicit and fragile — any future change that writes a non-empty orphans[first] (tests, an import path, a different initial-state migration) would silently leak values. Fix: pruneVersion now processes orphans[v] as well as orphans[nextV]. On iterations where v > first, the prior pruneVersion(v-1, v) already deleted the orphans[v] record, so LoadOrphans returns an empty slice — the call is a no-op with no batch bloat. On the first iteration (v == first), any stray values get cleaned up. TestPrune_ConsumesOrphansOfFirstVersion seeds a synthetic orphans[1] record pointing at a planted value and asserts PruneVersionsTo(1) removes it — exactly the regression scenario the defensive code catches.
…-leak timebomb) deleteAllNodesForVersion and deleteSubtree were only called when findNextVersion returned 0, which means "no existing version after v up to latest". But PruneVersionsTo rejects toVersion >= latest, so the loop body always runs with v <= toVersion < latest, and findNextVersion is guaranteed to find at least `latest` as a successor. The nextV == 0 branch was unreachable. This matters because if the guard were ever relaxed (or bypassed by a future refactor) the path would silently leak values: deleteSubtree walked the tree top-down deleting NODE keys only — it never called DeleteValue on leaf valueKeys, and the path also skipped the orphan-list processing that the dual-tree-walk variant does. The last version's leaf values would all remain in the DB forever. Fix: - Remove deleteAllNodesForVersion and deleteSubtree. Dead code hides latent bugs. - Replace the `if nextV == 0` branch with a loud error that documents the invariant. If this ever fires we want to know, not silently corrupt or leak. No new tests: the old path wasn't exercised by any test (it was unreachable), and the new error path is purely defensive. All existing prune tests continue to pass.
…ribute redistributeRight/Left have two cases each (leaf, inner). The leaf cases used copyKey when moving a key between nodes; the inner cases did not. The asymmetry was not an active bug (the shared-slice windows are transient and no code mutates key bytes in place), but it's easy for future changes to break the invariant "keys stored in different nodes are independent slices" — e.g., by introducing a byte-level key transformation. Tighten to the strict invariant: every key stored in a different node is an independent slice. Leaves and inners now behave the same. Small allocation cost per redistribute; paid once per underflow-recovery, which is rare in typical workloads. All existing tests pass; no behavior-visible change.
… on corruption) The type-specific decoders (readInnerNode, readLeafNode) consumed exactly the bytes they expected and left anything extra in the reader. ReadNode discarded the reader afterwards, so a corrupt on-disk payload with extra trailing bytes decoded successfully as if the truncation-at-"expected end" were intentional — a silent corruption vector. Fix: after the per-type decoder returns, assert r.Len() == 0 before returning the decoded node. Non-zero means the on-disk payload is longer than the decoder's schema requires — almost certainly corruption or a schema/version mismatch. Fail loudly. Tests: TestReadNode_RejectsTrailingBytes (leaf) and _Inner append a few garbage bytes to a freshly-serialized node and assert ReadNode returns an error whose message references the trailing bytes. Clean round-trips still succeed.
LeafNode.Serialize wrote a 12-byte zero-filled placeholder when
valueKeys[i] was nil. On deserialization that round-tripped as a
{version:0, nonce:0} NodeKey, which silently maps to "value not found"
on every Get — turning an upstream wiring bug into a silent data-loss
bug that only surfaces at read time, far from the actual cause.
InnerNode.Serialize did not guard children[i] at all. A nil ref would
Write 0 bytes and produce a byte stream one slot short of the expected
length, shifting every subsequent read during deserialization and
potentially decoding as a completely different (but syntactically
valid) node. Even worse than the leaf case.
Fix:
- LeafNode.Serialize returns an error on nil valueKeys[i] (and also
validates length == NodeKeySize).
- InnerNode.Serialize returns an error on nil children[i] (and
validates length).
saveNode is responsible for wiring both fields before Serialize runs —
this commit codifies that invariant as an explicit check, so the next
regression surfaces as a Serialize error at the exact boundary it was
violated, not as a "value missing" three calls later.
Test plan:
- TestLeafSerialize_RejectsNilValueKey: directly constructs a leaf with
one unset valueKey and asserts Serialize errors out.
- TestInnerSerialize_RejectsNilChildRef: same for inner child refs.
- Pre-existing serialization tests (LongKeys, FullLeaf, LeafBasic,
TruncatedLeaf, LeafGoldenBytes) updated to set valueKeys properly.
These tests previously relied on the zero-fill fallback; they test
serialization framing, not valueKey semantics, so the fix is to wire
legitimate valueKeys.
SaveValue writes eagerly (direct db.Set, not batched) so Get works before SaveVersion. Rollback tracks the eager writes via in-memory sessionValues and cleans them up — but if the process crashes (panic, OOM kill, power loss) before SaveVersion or Rollback runs, sessionValues is lost with the process and the values stay in the DB forever, unreferenced by any saved tree. Each crash grows the value DB. Fix: add nodeDB.cleanupCrashedSessionValues(latestVersion) and call it from MutableTree.Load(). Orphans are exactly the values whose vk.version is strictly greater than latestVersion — SaveVersion(N) commits the tree root pointer, which is the atomic point that "endorses" vks at version <=N. Any vk at version >N was written by a session targeting a version that never materialized. Value keys are PrefixVal || (8-byte big-endian vk.version) || (4-byte nonce), so orphans form a contiguous suffix of the value keyspace above latestVersion. The iterator is seeded at that boundary — zero cost on clean startup, O(orphans) on recovery. Log line emitted when non-zero. Design note: the README states values are never GC'd and dead values after pruning are "harmless noise" — this commit does NOT change that. It only cleans up values that were never referenced to begin with (crashed sessions that didn't reach SaveVersion). Values from successfully-committed but later-displaced keys continue to be cleaned only through the orphan-list path in prune.go. Tests: - TestLoad_CleansUpCrashedSessionValues: Set without SaveVersion, abandon tree, reload — asserts leaks are cleaned and committed values survive. - TestLoad_CleanShutdownNoCleanup: sanity-check that legitimate values are not touched.
PruneVersionsTo accumulated every node- and value- delete across all pruned versions into a single batch before the final Commit. For a startup catch-up that prunes thousands of versions this means a batch holding millions of pending deletes — potentially blowing RSS on nodes with constrained memory, and a long flush spike at the end. The FlushThreshold option was defined but never wired up. Use it here: after each pruneVersion call, check the current batch byte size. If it exceeds the threshold, Commit, which resets the batch. The default threshold was 100 KiB (tiny for prune workloads); raise the prune-internal fallback to 4 MiB so typical per-block prunes still commit once. Safety: - Pruning is idempotent. If a crash happens after some flushes but before setFirstVersion, the partially-pruned versions have already had their root references deleted from DB. discoverVersions on the next startup recomputes firstVersion correctly, and a retry only re-processes the unfinished tail. - GetByteSize errors fall through to the final Commit — we lose the intermediate bound but not correctness. Tests: - TestPruneVersionsTo_FlushesBatchUnderThreshold sets a 128-byte threshold, prunes 19 versions, and asserts the underlying DB sees multiple NewBatch calls (proving mid-run flushes). The latest version is then reloaded to verify correctness. - TestPruneVersionsTo_ZeroFlushThresholdUsesDefault verifies that at the default threshold a tiny prune does not unnecessarily flush.
The README claimed "Deduplication: identical values stored once (content-addressed)". The code does not do this — every Set allocates a fresh ValueKey (version, nonce) and writes a new DB entry, so two Set calls with identical values produce two DB entries, not one. Replace the dedup bullet with an explicit "No content-addressed deduplication" note explaining the actual behavior and why (simplicity over ref-counting). Also add a sentence about the new crashed-session value cleanup on Load(), since that rounds out the "values are never GC'd" story the section is trying to tell. This is a doc-only change; implementing real dedup is a larger design decision (ref counting, hash-keyed value table) that belongs in its own PR.
readBytes caps length-prefixed fields at 1 MiB (maxReadBytesLen) to prevent OOM from adversarial DB payloads. The write side had no matching cap: Set and Importer.Add accepted arbitrarily long keys. A key over 1 MiB would serialize successfully (writeBytes just writes the varint + bytes) but fail to deserialize on the next Load — readBytes would reject it. That permanently wedges the version containing the oversized key: it's in the persisted bytes, cannot be read back, and therefore cannot be modified or pruned. Fix: - Export a MaxKeyLen constant (= maxReadBytesLen) and ErrKeyTooLong. - Set wraps the error with the actual vs allowed sizes. - Importer.Add rejects both leaf keys and inner separator keys over the limit (an untrusted export stream must not poison a fresh tree). Also rejects empty leaf keys while we're in the neighborhood — the rest of the code treats an empty key as "invalid". Tests: - TestSet_RejectsKeyOverMax: at-boundary succeeds, over-boundary fails with a wrapped ErrKeyTooLong, tree state is unchanged on rejection. - TestImport_RejectsKeyOverMax: importer rejects an oversized leaf key.
treeInsert took a defensive copyKey of the input unconditionally, but the key is only STORED on the new-insert paths in leafInsert. Updates reuse the existing leaf.keys[pos]; the copied slice is just dropped. For update-heavy workloads (e.g., repeatedly writing to the same keys, which is common in VM state-storage patterns), that's one wasted allocation per Set. Fix: move the copy from treeInsert into leafInsert, after the update-vs-insert branch. Update path does no key allocation. New-insert and split-overflow paths still take a defensive copy before storing. Correctness: the copy is still taken wherever the caller's slice ends up referenced by the tree — we haven't weakened the defensive-copy contract. TestTreeInsert_NewKeyStillCopies mutates the caller's input after Set and asserts Get returns the original value. Tests: - TestTreeInsert_UpdateDoesNotCopyKey uses AllocsPerRun to bound per-update allocations. Regresses loudly if the unconditional copyKey is reintroduced in treeInsert. - TestTreeInsert_NewKeyStillCopies verifies the defensive-copy contract is intact on new inserts.
Collaborator
🛠 PR Checks SummaryAll Automated Checks passed. ✅ Manual Checks (for Reviewers):
Read More🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers. ✅ Automated Checks (for Contributors):No automated checks match this pull request. ☑️ Contributor Actions:
☑️ Reviewer Actions:
📚 Resources:Debug
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fifteen commits on
feat/alex/bp32tree-deep-dive, one per finding, each with tests.Correctness / safety
newIteratorhard-codedversion=0, soincrVersionReadersnever fired.PruneVersionsTocould delete nodes a live iterator was walking. Iterators now register againstimm.versionand release onClose.SaveVersionleaked session state (82eebc9) — on a deterministic replay hitting the "version already exists with same hash" path,sessionValuessurvived the early return. A subsequentRollbackthenDeleteValueDirect'd vks that collided with live persisted entries, wiping real data. Now clears session counters on that branch.VersionExistsswallowed DB errors (536f84a) — a transientHasfailure silently returned "does not exist", lettingSaveVersionoverwrite an existing version. AddedversionExistsEthat propagates the error;SaveVersionuses it.GetValuecouldn't distinguish missing from empty (11cf2a2) — returned(nil, nil)for both cases, masking corruption as empty values. Now disambiguates withHas; missing → wrappedErrKeyDoesNotExist, empty → non-nil[]byte{}.SaveValuewrites eagerly,sessionValuesis in-memory only, so a process crash beforeSaveVersion/Rollbackleaked values permanently.Load()now scans the value keyspace abovelatestVersion(a contiguous suffix — zero cost on clean shutdown) and deletes the orphans.MaxKeyLencap on Set/Import (5f091db) — read side capped length-prefixed fields at 1 MiB but the write side was unbounded. An oversized key would serialize but fail to deserialize, permanently wedging that version.SetandImporternow enforce the cap withErrKeyTooLong.deleteAllNodesForVersionremoved (daaae1d) — thenextV == 0branch never fired given thetoVersion < latestguard, but if reached it deleted nodes without touching leaf values or orphan lists — a dormant value-leak timebomb. Removed; the branch now returns a loud error documenting the invariant.Fail-fast guards (corruption detection)
ReadNodetrailing-bytes check (c43bf6a) — corrupt payloads with extra bytes previously decoded as silently truncated nodes.Serializerejects nilvalueKey/child-ref (5387af5) — the leaf path wrote a 12-byte zero placeholder (round-tripped as{v:0,n:0}→ silent "value not found"); the inner path wrote 0 bytes (shifting every subsequent read). Both now error.Performance / memory
saveNodeforce-loaded unchanged siblings (b311fbb) — the big one.inner.getChild(i)fetched every sibling of every COW'd inner just to early-return on "already saved". At blockchain scale (100M leaves, H=6, B=32, ~100 mutations/block) that's ~18K unnecessary DB reads per block. Now walkschildNodesdirectly; unloaded children are unchanged by construction.PruneVersionsTobatch memory (4dd84b8) — accumulated every delete across all pruned versions into a single batch. Wired the previously-unusedFlushThreshold(default 4 MiB internal) so a startup catch-up prune of thousands of versions doesn't balloon RSS. Pruning is idempotent, so intermediate commits are safe.copyKeyon update path (ee34a2d) —treeInsertcopied the key unconditionally, but updates don't store it. Moved the copy into the new-insert/split branches ofleafInsert; update-heavy workloads skip the allocation.Defensive / hygiene
pruneVersiononly consumedorphans[nextV], assuming a prior iteration handledorphans[v]. For the first iteration there's no prior.orphans[first]is empty in practice (new invariant testTestOrphans_FirstVersionEmptylocks that in across default/initialVersion/churn cases), but the defensive consumption is free (LoadOrphansreturns empty when the record is gone) and catches any future regression.redistributeLeft/Rightinner-casecopyKey(75a04d8) — leaf branches copied on separator transfer, inner branches didn't. Not an active bug (no code mutates key bytes), but the asymmetry was easy to break. Tightened to "every key in a different node is an independent slice."Docs
ValueKeyon everySet; identical values produce separate DB entries. Replaced the bullet with an explicit "No content-addressed deduplication" note and added a reference to the new crash-recovery cleanup.Also verified (no code change)
walkAndPrunevisited-set — old-tree traversal visits each node at most once by construction;newRootis only used for hash comparison, not deletion. Batch double-delete would be idempotent anyway.Tests
Every behavior-changing fix has a regression test. New test files in
tm2/pkg/bptree/:save_sibling_load_test.govalue_missing_test.goversion_exists_error_test.goorphan_first_test.gonode_framing_test.gocrash_recovery_test.goprune_flush_test.gokey_len_test.goinsert_update_alloc_test.goPre-existing suites (
tm2/pkg/bptree,tm2/pkg/store/bptree) remain green.Disclosure
AI-assisted review and implementation.