What happened
A key written twice within one batch can read back its OLD value through the read buffer, even though the buffer's own contract is newest-wins. This is a transient in-memory read-your-writes violation on non-safeRange buckets (Meta, Lease, Auth, Cluster, Alarm). It is a stale read out of the in-memory buffer, not a durability/data-loss problem; the committed bbolt contents are unaffected.
The defect is in bucketBuffer.merge in server/storage/backend/tx_buffer.go. When the read buffer already holds a bucket, txWriteBuffer.writeback folds the write buffer in via merge. merge appends the source entries and then takes a no-overlap fast-path (around line 222) that returns early without calling dedupe when the source's smallest key is greater than the destination's largest key. That fast-path only checks the destination/source boundary; it is blind to duplicates internal to the source. So a key duplicated within one batch survives into the merged buffer.
A later single-key read forces limit=1 (read_tx.go UnsafeRange, endKey == nil), and bucketBuffer.Range uses sort.Search, which returns the smallest matching index. With [(k,OLD),(k,NEW)] that is the OLD entry. Because UnsafeRange returns as soon as the buffer satisfies the limit (len(keys) == limit, here 1), the stale OLD value is returned and bbolt is never consulted.
Minimal deterministic repro
This is at the bucketBuffer level, so it does not depend on Go's sort behavior (it calls merge directly with a pre-ordered source):
// server/storage/backend/tx_buffer_test.go, package backend
func TestMergeNoOverlapKeepsNewestForIntraBatchDuplicate(t *testing.T) {
// read buffer already holds a key strictly less than the src keys,
// so merge takes the no-overlap fast-path (around line 222).
rb := &bucketBuffer{buf: make([]kv, 10), used: 0}
rb.add([]byte("a"), []byte("1"))
// write buffer holds the SAME key twice (a key overwritten within one
// batch), pre-ordered OLD-before-NEW so the test does not depend on sort.
wb := &bucketBuffer{buf: make([]kv, 10), used: 0}
wb.add([]byte("k"), []byte("OLD"))
wb.add([]byte("k"), []byte("NEW"))
rb.merge(wb) // 'a' < 'k' -> no-overlap fast-path returns WITHOUT dedupe
// limit=1 single-key read, as read_tx forces for endKey==nil on a
// non-safeRange bucket.
keys, vals := rb.Range([]byte("k"), nil, 1)
assert.Equal(t, [][]byte{[]byte("k")}, keys)
assert.Equal(t, [][]byte{[]byte("NEW")}, vals)
}
Expected vs actual
- Expected:
Range("k", nil, 1) returns NEW. The dedupe contract is documented (the dedupe comment, "removes duplicates, using only newest update") and pinned by TestDedupe, which asserts newest-wins for a duplicate key.
- Actual: it returns
OLD. The source-internal duplicate is never deduped, and sort.Search returns the first (smallest-index) match.
Why this is not by design
writeback deduplicates correctly on the other path: when the read buffer lacks the bucket it calls wb.dedupe() (newest-wins). Only the merge path skips it, behind a // assume no duplicate keys comment that put/putInternal/add do not enforce (they append unconditionally). The contract is newest-wins, and TestDedupe pins exactly that, so a stale read here is a regression, not intended behavior.
Existing tests miss this case: TestRangeAfterOverwriteMatch overwrites a key but in a first writeback with an empty read buffer, so it hits the direct-assign/dedupe path, not merge. TestRangeAfterAlternatingBucketWriteMatch reaches merge but with no intra-batch duplicate. Nothing exercises merge plus an intra-batch duplicate plus the no-overlap fast-path.
Impact
Non-safeRange buckets are affected (schema/bucket.go: Key is the only safeRangeBucket; Meta, Lease, Alarm, Cluster, Auth* are not). The clearest concern is Meta, which holds the consistent index and is read via UnsafeRange(Meta, key, nil, 0) (the exact endKey == nil single-key shape that forces limit=1).
Scoping note for triage: the end-to-end exposure is conditional. The no-overlap fast-path fires only when the duplicated key is greater than every key already in the read buffer, so a steady-state Meta key that is already present in the read buffer overlaps and takes the dedupe path correctly. The window is a same-batch double-write of a key not yet in the read buffer (and greater than its current max). The bucketBuffer.merge contract violation itself is unconditional and deterministic, as the repro shows; the Meta read-your-writes consequence is the plausible downstream rather than a guaranteed-every-apply outcome.
I traced this at current main (61d518f). I prepared this report with the assistance of generative AI tooling and verified every file and line reference against the source myself.
What happened
A key written twice within one batch can read back its OLD value through the read buffer, even though the buffer's own contract is newest-wins. This is a transient in-memory read-your-writes violation on non-safeRange buckets (Meta, Lease, Auth, Cluster, Alarm). It is a stale read out of the in-memory buffer, not a durability/data-loss problem; the committed bbolt contents are unaffected.
The defect is in
bucketBuffer.mergeinserver/storage/backend/tx_buffer.go. When the read buffer already holds a bucket,txWriteBuffer.writebackfolds the write buffer in viamerge.mergeappends the source entries and then takes a no-overlap fast-path (around line 222) that returns early without callingdedupewhen the source's smallest key is greater than the destination's largest key. That fast-path only checks the destination/source boundary; it is blind to duplicates internal to the source. So a key duplicated within one batch survives into the merged buffer.A later single-key read forces
limit=1(read_tx.goUnsafeRange,endKey == nil), andbucketBuffer.Rangeusessort.Search, which returns the smallest matching index. With[(k,OLD),(k,NEW)]that is the OLD entry. BecauseUnsafeRangereturns as soon as the buffer satisfies the limit (len(keys) == limit, here 1), the stale OLD value is returned and bbolt is never consulted.Minimal deterministic repro
This is at the
bucketBufferlevel, so it does not depend on Go's sort behavior (it callsmergedirectly with a pre-ordered source):Expected vs actual
Range("k", nil, 1)returnsNEW. The dedupe contract is documented (thededupecomment, "removes duplicates, using only newest update") and pinned byTestDedupe, which asserts newest-wins for a duplicate key.OLD. The source-internal duplicate is never deduped, andsort.Searchreturns the first (smallest-index) match.Why this is not by design
writebackdeduplicates correctly on the other path: when the read buffer lacks the bucket it callswb.dedupe()(newest-wins). Only the merge path skips it, behind a// assume no duplicate keyscomment thatput/putInternal/adddo not enforce (they append unconditionally). The contract is newest-wins, andTestDedupepins exactly that, so a stale read here is a regression, not intended behavior.Existing tests miss this case:
TestRangeAfterOverwriteMatchoverwrites a key but in a first writeback with an empty read buffer, so it hits the direct-assign/dedupe path, not merge.TestRangeAfterAlternatingBucketWriteMatchreaches merge but with no intra-batch duplicate. Nothing exercises merge plus an intra-batch duplicate plus the no-overlap fast-path.Impact
Non-safeRange buckets are affected (
schema/bucket.go:Keyis the only safeRangeBucket;Meta,Lease,Alarm,Cluster,Auth*are not). The clearest concern isMeta, which holds the consistent index and is read viaUnsafeRange(Meta, key, nil, 0)(the exactendKey == nilsingle-key shape that forceslimit=1).Scoping note for triage: the end-to-end exposure is conditional. The no-overlap fast-path fires only when the duplicated key is greater than every key already in the read buffer, so a steady-state Meta key that is already present in the read buffer overlaps and takes the dedupe path correctly. The window is a same-batch double-write of a key not yet in the read buffer (and greater than its current max). The
bucketBuffer.mergecontract violation itself is unconditional and deterministic, as the repro shows; the Meta read-your-writes consequence is the plausible downstream rather than a guaranteed-every-apply outcome.I traced this at current main (
61d518f). I prepared this report with the assistance of generative AI tooling and verified every file and line reference against the source myself.