Forward sync columns by root #7946

pawanjay176 · 2025-08-27T02:09:19Z

Issue Addressed

N/A

The Problem

Our current strategy of syncing blocks + columns by range works roughly as follows for each batch:

Find a peer from the current SyncingChain to fetch blocks from and send a BlocksByRange request
Find peer(s) from the global peer pool that should have columns for the same batch based on their status message and the columns they are supposed to custody , and send them a by DataColumnsByRange request at the same time
Once we get responses for all blocks + columns components, try to couple them by checking that the block_root and the kzg_commitment matches. If the coupling failed, try to re-request the failed columns.
Send them for processing and try to progress the chain

This strategy works decently well when the chain is finalizing as most of our peers are on the same chain. However, in times of non-finality we need to potentially sync multiple head chains.
This leads to issues with our current approach because the block peer and the data column peers might have a different view of the canonical chain due to multiple heads. So when we use the above approach, it is possible that the block peer returns us a batch of blocks for chain A while some or all data column peers send us the batch of data columns for a different chain B. Different data column peers might also be following different chains.

We initially tried to get around this problem by selecting column peers only from within the current SyncingChain. Each SyncingChain represents a head_root that we are trying to sync to and we group peers based on same head_root. That way, we know for sure that the block and column peers are on the same chain. This works in theory, but in practice, during long periods of non-finality, we tend to create multiple head chains based on the head_root and split the global peerset. Pre-fulu, this isn't a big deal since all peers are supposed to have all the blob data.
But splitting peers with peerdas is a big challenge due to not all peers having the full data available. There are supernodes, but during bad network conditions, supernodes would be getting way too many requests and not even have any incoming peer slots. As we saw on fusaka devnets, this strategy leads to sync getting stalled and not progressing.

Proposed Changes

1. Use `DataColumnsByRoot` instead of `DataColumnsByRange` to fetch columns for forward sync

This is the main change. The new strategy would go as follows:

Find a peer from the current SyncingChain to fetch blocks from and send a BlocksByRange request
Hold off on requesting for columns until we receive the response for the above blocks request
Once we get the blocks response, extract all the block_roots and trigger a DataColumnsByRoot request for every block in the response that has any blobs based on the expected_kzg_commitments field.
Since we request by root, we know what we are expecting in the response. The column peer's view of the canonical chain might be chain A, but if we ask for chain B and they have chain B in their fork choice, they can still serve us what we need.
We couple the block + column responses and send them for processing as before.

(4) kinda assumes that most synced/advanced peers would have different chains in their fork choice to be able to serve specific by root requests. My hunch is that this is true, but we should validate this in a devnet 4 like chain split scenario.

Note: that we currently use this by root strategy only for forward sync, not for backfill. Backfill has to deal with only a single canonical chain so byrange requests should work well there.

2. ResponsiblePeers to attribute peer fault correctly

Adds the ResponsiblePeers struct which stores the block and the column peers that we made the download requests to.
For most of our peer attributable errors, the processing error indicates whether the block peer was at fault or if the column peer was at fault.

We now communicate this information back to sync and downscore specific peers based on the fault type. This imo, is an improvement over current unstable where most of the time, we attribute fault to the peer that "completed" the request by being the last peer to respond.
Due to this ambiguity in fault attribution, we weren't downscoring pretty serious processing errors like InvalidKzgProofs, InvalidExecutionPayload etc. I think this PR attributes the errors to the right peers. Reviewers please check that this claim is actually true.

3. Make `AwaitingDownload` an allowable in-between state

Note: This has been extracted to its own PR here and merged #7984
Prior to peerdas, a batch should never have been in AwaitingDownload state because we immediataly try to move from AwaitingDownload to Downloading state by sending batches. This was always possible as long as we had peers in the SyncingChain in the pre-peerdas world.

However, this is no longer the case as a batch can be stuck waiting in AwaitingDownload state if we have no peers to request the columns from. This PR makes AwaitingDownload to be an allowable in between state. If a batch is found to be in this state, then we attempt to send the batch instead of erroring like before.
Note to reviewer: We need to make sure that this doesn't lead to a bunch of batches stuck in AwaitingDownload state if the chain can be progressed.

pawanjay176 · 2025-09-17T22:17:38Z

This one should be ready for another round of review. Used claude to fix up the sync tests, lmk if you wanted a different approach. I don't have strong opinions on the test suite.

mergify · 2025-09-22T01:04:53Z

This pull request has merge conflicts. Could you please resolve them @pawanjay176? 🙏

eserilev

I've gone through about half of this PR. Mostly just nits

In general there are probably opportunities to refactor/reduce code duplication but I agree with you that this PR is already big and any refactors should be made in a subsequent PR

I'll aim to finish my review tomorrow

eserilev · 2025-09-24T05:43:45Z

beacon_node/network/src/network_beacon_processor/sync_methods.rs

+                if matches!(epe, ExecutionPayloadError::RejectedByExecutionEngine { .. }) {
+                    debug!(
+                        error = ?err,
+                        "Invalid execution payload rejected by EE"
+                    );
+                    Err(ChainSegmentFailed {
+                        message: format!(
+                            "Peer sent a block containing invalid execution payload. Reason: {:?}",
+                            err
+                        ),
+                        peer_action: Some(PeerAction::LowToleranceError),
+                        faulty_component: Some(FaultyComponent::Blocks), // todo(pawan): recheck this
+                    })


Is there a reason why RejectedExecutionEngine is a low tolerance error but other ExecutionPayloadErrors are not? Like InvalidPayoadTimestamp for example

Yeah agree this is quite opaque. I created a new penalize_sync_peer method to extract the functionality in aa6a1bc
Would be good to get extra eyes on this cc @dapplion

eserilev · 2025-09-24T06:02:34Z

beacon_node/network/src/sync/block_sidecar_coupling.rs

+    /// Note: this variant starts out in an uninitialized state because we typically make
+    /// the column requests by root only **after** we have fetched the corresponding blocks.
+    /// We can initialize this variant only after the columns requests have been made.
+    DataColumnsFromRoot {


maybe we could rename the old DataColumns variant to DataColumnsByRange just for clarity

beacon_node/network/src/sync/block_sidecar_coupling.rs

beacon_node/network/src/sync/network_context.rs

eserilev

I tested this branch on devnet-3. I ran both range sync and backfill sync in the fullnode and supernode case and it worked well. Supernode backfill is slow, but its also the same on unstable. I guess we wouldn't really see any sync improvements unless we tried syncing on a forky network. I'm happy to green check this once CI is passing

beacon_node/network/src/sync/backfill_sync/mod.rs

dapplion · 2025-09-25T20:14:46Z

@pawanjay176 this commit merges data_columns_by_root_range_requests and data_columns_by_root_requests with a -216 lines diff. Feel free to cherry-pick into your branch but I feel the SyncNetworkContext is easier to understand this way 26ed0ebae

dapplion · 2025-09-25T21:42:45Z

beacon_node/network/src/sync/network_context.rs

+                    }
+                    debug!(?requests, "Successfully retried requests");
+                }
+                Err(err) => {


You should match Err(RpcRequestSendError::NoPeer(err)) explicitly and if it's an InternalError discard the request. Otherwise we may infinte loop

Discarding the request is also not great as that may stall sync silently. I thought this way, atleast we know what's causing the stalling by making some noise in the logs

dapplion · 2025-09-25T21:59:11Z

beacon_node/network/src/sync/network_context/custody.rs

    /// can result in undefined state, so it's treated as a hard error and the lookup is dropped.
    UnexpectedRequestId {
-        expected_req_id: DataColumnsByRootRequestId,
-        req_id: DataColumnsByRootRequestId,


Is this change necessary? Pushed a commit that reverts this change and everything compiles fine. Feel free to cherry-pick 51ab53b

Yeah don't remember why i did this tbh

Ohh I remember now. In this PR, the DataColumnsByRootRequestId struct became larger because DataColumnsByRootRequester became an enum.

So the linter was throwing a bunch of errors like

error: the `Err`-variant returned from this function is very large --> beacon_node/network/src/sync/network_context/custody.rs:455:10 | 48 | / UnexpectedRequestId { 49 | | expected_req_id: DataColumnsByRootRequestId, 50 | | req_id: DataColumnsByRootRequestId, 51 | | }, | |_____- the largest variant contains at least 224 bytes ... 455 | ) -> Result<(), Error> { | ^^^^^^^^^^^^^^^^^ | = help: try reducing the size of `sync::network_context::custody::Error`, for example by boxing large elements or replacing it with `Box<sync::network_context::custody::Error>` = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#result_large_err

Changed it to Id to keep the struct smaller. I felt it gave us similar info to debug if required.

We implement custom impl Display for all these Id types so the visual result is the same

Yeah but we would need to silence the clippy warnings all around places which return RpcRequestErrors in that case.

beacon_node/network/src/sync/range_sync/chain.rs

…uled (sigp#8109) sigp#8105 (to be confirmed) I noticed a large number of failed discovery requests after deploying latest `unstable` to some of our testnet and mainnet nodes. This is because of a recent PeerDAS change to attempt to maintain sufficient peers across data column subnets - this shouldn't be enabled on network without peerdas scheduled, otherwise it will keep retrying discovery on these subnets and never succeed. Also removed some unused files. Co-Authored-By: Jimmy Chen <[email protected]> Co-Authored-By: Jimmy Chen <[email protected]>

…igp#8112) - PR sigp#8045 introduced a regression of how lookup sync interacts with the da_checker. Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip. I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events: - Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block - Create lookup and leave in AwaitingDownload(block in processing cache) state - Block from HTTP API finishes importing - Lookup is left stuck Closes sigp#8104 - sigp#8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy. For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves sigp#8104 by allowing lookup sync to import the block twice in that case. Co-Authored-By: dapplion <[email protected]>

This reverts commit 6ea14016f3d164456bc4c3cae0355ab532fe1a86.

pawanjay176 added 20 commits August 14, 2025 09:07

Penalize if invalid EL block

490b627

Priorotize status v2

836f9c6

Increase columns_by_root quota

156449c

Reduce backfill buffer size

6bd8944

Without retries

9455153

Add a function to retry column requests that could not be made

5337e46

Small fixes

ca9cfd5

Try to avoid chains failing for rpc errors

68cce37

Fix bug in initialization code

6da924b

Also penalize all batch peers for availability check errors

1a0df30

Avoid root requests for backfill sync

17c4e34

Implement responsible peer tracking

fdce537

Request columns from global peer pool

4540195

Random logs

521778b

Merge branch 'unstable' into blocks-then-columns

da27441

Handle 0 blobs per epoch case

52762b9

Merge branch 'unstable' into blocks-then-columns

7c214f5

Merge branch 'unstable' into blocks-then-columns

90d319f

Remove debug statements

27d0b36

Add docs

a97cf88

jimmygchen added syncing v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky labels Aug 27, 2025

pawanjay176 added 7 commits August 27, 2025 14:26

Fix bug with partial column responses before all column requests sent

05adb71

Remove more debug logs

b4bc7fe

Merge branch 'unstable' into blocks-then-columns

8386bd9

AwaitingValidation state only needs block peer

7331323

Revise error tolerance

da1aaba

Merge branch 'unstable' into blocks-then-columns

8e1337d

Merge branch 'unstable' into blocks-then-columns

19b0a5c

jimmygchen self-requested a review August 29, 2025 08:35

pawanjay176 added 2 commits September 17, 2025 15:13

Add some metrics

2f35c36

Merge branch 'unstable' into blocks-then-columns

4a59d35

pawanjay176 added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Sep 17, 2025

mergify bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 22, 2025

eserilev reviewed Sep 24, 2025

View reviewed changes

pawanjay176 added 3 commits September 24, 2025 11:19

Merge branch 'unstable' into blocks-then-columns

27195ca

Create a custom penalize_sync_peer method for clarity

aa6a1bc

Fix nits

4b0b655

eserilev reviewed Sep 24, 2025

View reviewed changes

dapplion reviewed Sep 25, 2025

View reviewed changes

beacon_node/network/src/sync/backfill_sync/mod.rs Outdated Show resolved Hide resolved

dapplion reviewed Sep 25, 2025

View reviewed changes

Rename DataColumnsFromRange

7650032

dapplion reviewed Sep 25, 2025

View reviewed changes

beacon_node/network/src/sync/range_sync/chain.rs Show resolved Hide resolved

dapplion reviewed Sep 25, 2025

View reviewed changes

beacon_node/network/src/sync/range_sync/chain.rs Outdated Show resolved Hide resolved

dapplion and others added 8 commits September 25, 2025 16:18

De-duplicate data columns by root request type

7488755

Revert type change in UnexpectedRequestId

c2aa4ae

Fix issues from review

cf46d10

Revert "Revert type change in UnexpectedRequestId"

421e954

This reverts commit 6ea14016f3d164456bc4c3cae0355ab532fe1a86.

Fix variant name

826a06e

Merge branch 'unstable' into blocks-then-columns

c491856

jimmygchen added v8.0.0 Q4 2025 Fusaka Mainnet Release and removed v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky labels Sep 28, 2025

Forward sync columns by root #7946

Are you sure you want to change the base?

Forward sync columns by root #7946

Uh oh!

Conversation

pawanjay176 commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

The Problem

Proposed Changes

1. Use DataColumnsByRoot instead of DataColumnsByRange to fetch columns for forward sync

2. ResponsiblePeers to attribute peer fault correctly

3. Make AwaitingDownload an allowable in-between state

Uh oh!

pawanjay176 commented Sep 17, 2025

Uh oh!

mergify bot commented Sep 22, 2025

Uh oh!

eserilev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eserilev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dapplion commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pawanjay176 commented Aug 27, 2025 •

edited

Loading

1. Use `DataColumnsByRoot` instead of `DataColumnsByRange` to fetch columns for forward sync

3. Make `AwaitingDownload` an allowable in-between state

dapplion commented Sep 25, 2025 •

edited

Loading