fix: handle ApprovedBlock in GenesisValidator for late-joiner recovery#489
Conversation
When a genesis validator joins boot's connections after the UnapprovedBlock broadcasts but before boot reaches required_signatures, it has nothing to sign. Boot then sends ApprovedBlock to all peers including the late validator, but GenesisValidator::handle had no ApprovedBlock arm, so the message hit `_ => Ok(())` and was silently dropped. The validator stayed in GenesisValidator state forever, logging "Casper engine present but Casper not initialized yet." Add a CasperMessage::ApprovedBlock arm that transitions to Initializing. Initializing::init already proactively re-requests ApprovedBlock from bootstrap (the comment at initializing.rs:239-242 anticipated this race), and Initializing::handle accepts the response, validates it, and transitions to Running — the same path a late-joining non-genesis node already takes. Reproduction (pre-fix): F1R3FLY_NODE_IMAGE=...:local pytest test_consensus_safety.py::test_validator_failure_recovery --keep-running hit failure on attempt 3 of a 20-attempt loop (~33% rate). Verification (post-fix): same loop ran 20/20 passes; full integration suite improved from 78 pass / 10 fail / 6 error to 89 pass / 5 fail / 0 error, eliminating all 9 startup-timeout failures and the 6 cascading ws_shard fixture errors. Co-Authored-By: Claude <noreply@anthropic.com>
|
Two items worth addressing before merge: 1. Edge case:
|
…ression test PR #489 review flagged that handle_approved_block_late shared seen_candidates with handle_unapproved_block, conflating two unrelated dedup concerns. If a validator failed to sign an UnapprovedBlock (no transition out of GenesisValidator) and the same hash later arrived as ApprovedBlock, the guard would block recovery. Drop the guard. A successful transition_to_initializing replaces the engine, so subsequent ApprovedBlock messages route to Initializing::handle, not back here. Concurrent duplicates during the brief transition window are serialized by the engine_cell write, and Initializing::init's ApprovedBlockRequest is idempotent at bootstrap. Add transitions_to_initializing_on_late_approved_block targeting the GenesisValidator::handle(ApprovedBlock) path directly: send the message with no prior UnapprovedBlock, assert that Initializing::init's ApprovedBlockRequest reaches the transport layer. Co-Authored-By: Claude <noreply@anthropic.com>
…tion
approve_block_protocol_test asserts a delta on a process-global metrics
counter ("genesis") between a baseline read and a post-action read.
That counter is incremented from add_approval — anywhere a valid
UnapprovedBlock signature is processed. The approve-block tests are
all #[serial], but two other test files exercise the same code path
without the marker:
- genesis_validator_spec::respond_on_unapproved_block_messages_with_block_approval
(sends UnapprovedBlock to a GenesisValidator → block_approver → add_approval)
- block_approver_protocol_test (calls unapproved_block_packet_handler
directly across six tests)
Without serialization, those tests can run in parallel with an
approve_block_protocol_test in flight and corrupt its delta — the TODO
entry tracked this as a ~1-in-3 flake.
Mark all genesis_validator_spec and block_approver_protocol_test
#[tokio::test]s as #[serial] so they share serial_test's mutex with
approve_block_protocol_test. Verified across three consecutive full
casper test runs (343/343 each). Drop the now-resolved TODO entry.
Co-Authored-By: Claude <noreply@anthropic.com>
|
Both review items addressed in Item 1 —
|
Brings in PR #489 (`fb59611f` "fix: handle ApprovedBlock in GenesisValidator for late-joiner recovery"). PR #489 added a second call site to `transition_to_initializing` from the new `handle_approved_block` arm in `GenesisValidator`. On rust/staging that call uses 22 args; on this branch the signature has 23 args because PR #488 added the `rejected_deploy_buffer` parameter. Auto-merge resolved cleanly at the textual level but produced a compile error: the new call at `genesis_validator.rs:219` and the new test in `genesis_validator_spec.rs:158` both omit the `rejected_deploy_buffer` arg. Resolution adds `&self.rejected_deploy_buffer` (lib) and `fixture.rejected_deploy_buffer.clone()` (test) to bring those call sites in line with the existing UnapprovedBlock-arm call site at ~line 282 and the other test sites at ~lines 47 and 248. CI symptom: PR #488's GitHub virtual merge with rust/staging fails to build because it produces this exact two-call-site state, but without the merge commit on the source branch the auto-merge has no way to apply the fixup. This merge commit captures the fixup so CI's virtual merge collapses to a no-op. Casper test suite: 350/350 (was 349 pre-merge, +1 for PR #489's new genesis_validator_spec test). Co-Authored-By: Claude <noreply@anthropic.com>
Summary
When a genesis validator joins boot's connections after the
UnapprovedBlockbroadcasts but before boot reachesrequired_signatures, it has nothing to sign. Boot then sendsApprovedBlockto all peers including the late validator, butGenesisValidator::handlehad noApprovedBlockarm, so the message hit `_ => Ok(())` and was silently dropped. The validator stayed inGenesisValidatorstate forever, logging `"Casper engine present but Casper not initialized yet"`.Fix
Add a
CasperMessage::ApprovedBlockarm toGenesisValidator::handlethat transitions toInitializing.Initializing::initalready proactively re-requestsApprovedBlockfrom bootstrap (the comment atinitializing.rs:239-242explicitly anticipated this race), andInitializing::handleaccepts the response, validates it, and transitions toRunning— the same path a late-joining non-genesis node already takes.Race window
~50ms-200ms between boot's last
UnapprovedBlockbroadcast and boot reaching quorum. Whichever genesis validator falls into this window loses the ceremony deterministically.Reproduction (pre-fix)
```
F1R3FLY_NODE_IMAGE=f1r3flyindustries/f1r3fly-rust-node:local poetry run pytest \
integration-tests/test/tests/custom/test_consensus_safety.py::test_validator_failure_recovery \
--keep-running -v
```
In a 20-attempt loop, failure hit on attempt 3 (~33% rate).
Verification (post-fix)
The 11 net wins came from: 7 of 9 validator-startup failures now passing (
test_validator_failure_recovery,test_validator_failure_halts_finalization,test_ftt_boundary_strict_greater_than,test_epoch_transition_under_heartbeat,test_merge_determinism_asymmetric_divergence,test_synchrony_constraint,test_trim_state) plus all 6test_websocket.pycascading errors gone. The 2 remaining custom-test failures (test_load,test_shard_degradation) no longer fail at startup; they now expose pre-existing sustained-load issues that were previously masked.Test plan
Co-Authored-By: Claude noreply@anthropic.com