Skip to content

fix: handle ApprovedBlock in GenesisValidator for late-joiner recovery#489

Merged
spreston8 merged 3 commits intorust/stagingfrom
fix/genesis-validator-late-join-recovery
Apr 28, 2026
Merged

fix: handle ApprovedBlock in GenesisValidator for late-joiner recovery#489
spreston8 merged 3 commits intorust/stagingfrom
fix/genesis-validator-late-join-recovery

Conversation

@spreston8
Copy link
Copy Markdown
Collaborator

Summary

When a genesis validator joins boot's connections after the UnapprovedBlock broadcasts but before boot reaches required_signatures, it has nothing to sign. Boot then sends ApprovedBlock to all peers including the late validator, but GenesisValidator::handle had no ApprovedBlock arm, so the message hit `_ => Ok(())` and was silently dropped. The validator stayed in GenesisValidator state forever, logging `"Casper engine present but Casper not initialized yet"`.

Fix

Add a CasperMessage::ApprovedBlock arm to GenesisValidator::handle that transitions to Initializing. Initializing::init already proactively re-requests ApprovedBlock from bootstrap (the comment at initializing.rs:239-242 explicitly anticipated this race), and Initializing::handle accepts the response, validates it, and transitions to Running — the same path a late-joining non-genesis node already takes.

Race window

~50ms-200ms between boot's last UnapprovedBlock broadcast and boot reaching quorum. Whichever genesis validator falls into this window loses the ceremony deterministically.

Reproduction (pre-fix)

```
F1R3FLY_NODE_IMAGE=f1r3flyindustries/f1r3fly-rust-node:local poetry run pytest \
integration-tests/test/tests/custom/test_consensus_safety.py::test_validator_failure_recovery \
--keep-running -v
```

In a 20-attempt loop, failure hit on attempt 3 (~33% rate).

Verification (post-fix)

Metric Pre-fix Post-fix
20-attempt single-test loop failure on attempt 3 20/20 passed
Full integration suite (92 tests, filtered) 78 pass / 10 fail / 6 error 89 pass / 5 fail / 0 error

The 11 net wins came from: 7 of 9 validator-startup failures now passing (test_validator_failure_recovery, test_validator_failure_halts_finalization, test_ftt_boundary_strict_greater_than, test_epoch_transition_under_heartbeat, test_merge_determinism_asymmetric_divergence, test_synchrony_constraint, test_trim_state) plus all 6 test_websocket.py cascading errors gone. The 2 remaining custom-test failures (test_load, test_shard_degradation) no longer fail at startup; they now expose pre-existing sustained-load issues that were previously masked.

Test plan

  • Compile clean: `cargo check -p casper`
  • Single-test loop: 20/20 passes
  • Full integration suite: 89 pass / 5 fail (vs 78/10/6 pre-fix)
  • CI on `rust/staging` PR

Co-Authored-By: Claude noreply@anthropic.com

When a genesis validator joins boot's connections after the UnapprovedBlock
broadcasts but before boot reaches required_signatures, it has nothing to
sign. Boot then sends ApprovedBlock to all peers including the late
validator, but GenesisValidator::handle had no ApprovedBlock arm, so the
message hit `_ => Ok(())` and was silently dropped. The validator stayed
in GenesisValidator state forever, logging "Casper engine present but
Casper not initialized yet."

Add a CasperMessage::ApprovedBlock arm that transitions to Initializing.
Initializing::init already proactively re-requests ApprovedBlock from
bootstrap (the comment at initializing.rs:239-242 anticipated this race),
and Initializing::handle accepts the response, validates it, and
transitions to Running — the same path a late-joining non-genesis node
already takes.

Reproduction (pre-fix): F1R3FLY_NODE_IMAGE=...:local pytest
  test_consensus_safety.py::test_validator_failure_recovery --keep-running
hit failure on attempt 3 of a 20-attempt loop (~33% rate).

Verification (post-fix): same loop ran 20/20 passes; full integration
suite improved from 78 pass / 10 fail / 6 error to 89 pass / 5 fail / 0
error, eliminating all 9 startup-timeout failures and the 6 cascading
ws_shard fixture errors.

Co-Authored-By: Claude <noreply@anthropic.com>
@spreston8
Copy link
Copy Markdown
Collaborator Author

Two items worth addressing before merge:

1. Edge case: is_repeated may block recovery when both UnapprovedBlock and ApprovedBlock arrive

handle_approved_block_late calls is_repeated(&hash) against self.seen_candidates. The seen_candidates set is also populated by handle_unapproved_block via ack(hash) after V signs the candidate. Both UnapprovedBlock and ApprovedBlock carry the same block_hash (it's the same genesis candidate, just at different ceremony stages).

This creates a potential gap. Consider:

  1. Boot broadcasts UnapprovedBlock to V at T+0
  2. V's handle_unapproved_block validates and signs the candidate, calls ack(hash)seen_candidates now contains hash
  3. V sends BlockApproval back to boot
  4. Boot reaches quorum, transitions to Running, broadcasts ApprovedBlock (same hash) at T+5
  5. V's handle_approved_block_late is invoked
  6. is_repeated(&hash) returns true → early return → V stays in GenesisValidator forever

In this scenario V is stuck despite having participated correctly. The original race the PR targets (V never sees the UnapprovedBlock, only the ApprovedBlock) is a different code path where seen_candidates is empty, so the is_repeated check passes and recovery works. But for V that signed and then gets the ApprovedBlock, the dedup check is misapplied.

The semantic mismatch: seen_candidates was designed to dedup repeat UnapprovedBlock sends from the same block. Reusing it as a guard for ApprovedBlock conflates "we've seen this candidate's content" with "we've already processed an ApprovedBlock for it." For ApprovedBlock, the relevant question is "have we already transitioned to Initializing?" — which is structurally guaranteed once because the engine itself is replaced after transition_to_initializing succeeds.

Suggested fix options:

  • Drop the is_repeated check entirely from handle_approved_block_late. Since the engine transitions away from GenesisValidator on success, a second ApprovedBlock with the same hash cannot be received in this state.
  • Or track ApprovedBlock arrivals separately from seen_candidates (e.g. a separate seen_approved_blocks: Mutex<HashSet<BlockHash>> if dedup is desired for some reason).

Could you confirm whether V can actually be in a state where it has signed an UnapprovedBlock but is still in GenesisValidator when the ApprovedBlock arrives? If the signing flow already transitions out of GenesisValidator, this concern is moot. If it doesn't, the is_repeated check needs revisiting.

2. Add a targeted regression test

The existing test_validator_failure_recovery exercises the path indirectly (and the 20/20 passes are good evidence the fix works), but a unit test targeting GenesisValidator::handle(CasperMessage::ApprovedBlock) directly would lock in the regression-proof behavior. Suggested coverage:

Co-Authored-By: Claude noreply@anthropic.com

spreston8 and others added 2 commits April 28, 2026 14:18
…ression test

PR #489 review flagged that handle_approved_block_late shared
seen_candidates with handle_unapproved_block, conflating two unrelated
dedup concerns. If a validator failed to sign an UnapprovedBlock (no
transition out of GenesisValidator) and the same hash later arrived as
ApprovedBlock, the guard would block recovery.

Drop the guard. A successful transition_to_initializing replaces the
engine, so subsequent ApprovedBlock messages route to
Initializing::handle, not back here. Concurrent duplicates during the
brief transition window are serialized by the engine_cell write, and
Initializing::init's ApprovedBlockRequest is idempotent at bootstrap.

Add transitions_to_initializing_on_late_approved_block targeting the
GenesisValidator::handle(ApprovedBlock) path directly: send the message
with no prior UnapprovedBlock, assert that Initializing::init's
ApprovedBlockRequest reaches the transport layer.

Co-Authored-By: Claude <noreply@anthropic.com>
…tion

approve_block_protocol_test asserts a delta on a process-global metrics
counter ("genesis") between a baseline read and a post-action read.
That counter is incremented from add_approval — anywhere a valid
UnapprovedBlock signature is processed. The approve-block tests are
all #[serial], but two other test files exercise the same code path
without the marker:

- genesis_validator_spec::respond_on_unapproved_block_messages_with_block_approval
  (sends UnapprovedBlock to a GenesisValidator → block_approver → add_approval)
- block_approver_protocol_test (calls unapproved_block_packet_handler
  directly across six tests)

Without serialization, those tests can run in parallel with an
approve_block_protocol_test in flight and corrupt its delta — the TODO
entry tracked this as a ~1-in-3 flake.

Mark all genesis_validator_spec and block_approver_protocol_test
#[tokio::test]s as #[serial] so they share serial_test's mutex with
approve_block_protocol_test. Verified across three consecutive full
casper test runs (343/343 each). Drop the now-resolved TODO entry.

Co-Authored-By: Claude <noreply@anthropic.com>
@spreston8
Copy link
Copy Markdown
Collaborator Author

Both review items addressed in 152005ea, plus a follow-up 66a0ed42 that resolves a related flake surfaced during verification.

Item 1 — is_repeated dropped (152005ea)

You're right that seen_candidates was conflating two unrelated dedup concerns. Walking the code paths:

  • handle_unapproved_block calls is_repeated → ack(hash) → block_approver.unapproved_block_packet_handler → transition_to_initializing — on success, V is no longer in GenesisValidator.
  • The only way V is in GenesisValidator AND has the hash in seen_candidates is if unapproved_block_packet_handler returned an error before the transition (signing/validation failure). In that pathological case, a later ApprovedBlock for the same candidate would have been incorrectly suppressed.

Dropped the guard. After successful transition_to_initializing the engine is replaced, so subsequent ApprovedBlock messages route to Initializing::handle. Concurrent duplicates during the brief transition window are serialized by the engine_cell write, and Initializing::init's ApprovedBlockRequest is idempotent at bootstrap. Doc comment expanded to record the reasoning.

Item 2 — Regression test added (152005ea)

transitions_to_initializing_on_late_approved_block targets the path directly:

  1. Construct ApprovedBlock from the genesis candidate (no prior UnapprovedBlock).
  2. Send CasperMessage::ApprovedBlock to a fresh GenesisValidator.
  3. Assert that an ApprovedBlockRequest reaches the transport layer — proof that the engine transitioned to Initializing, whose init proactively requests the approved block.
  4. Send the same ApprovedBlock again to exercise the post-transition path: routes to Initializing::handle, no panic/stall.

Bonus — fixed the flake noted in services/f1r3node-rust/TODO.md (66a0ed42)

Verifying the fix surfaced approve_block_protocol_test::should_send_approved_block_message_to_peers_once_approved_block_is_created failing intermittently. Root cause is not shared LMDB (the TODO's hypothesis); it's the process-global metrics counter "genesis" incremented in add_approval. approve_block_protocol_test is #[serial], but genesis_validator_spec and block_approver_protocol_test were not — they exercise the same add_approval code path and could mutate the counter mid-assertion.

Marked all counter-incrementing tests in those two files #[serial]. Three consecutive full casper test runs (343/343 each) post-fix. Dropped the resolved TODO entry.

Branch updated, full suite green.

Co-Authored-By: Claude noreply@anthropic.com

@spreston8 spreston8 merged commit fb59611 into rust/staging Apr 28, 2026
20 checks passed
@spreston8 spreston8 deleted the fix/genesis-validator-late-join-recovery branch April 28, 2026 22:31
spreston8 added a commit that referenced this pull request Apr 29, 2026
Brings in PR #489 (`fb59611f` "fix: handle ApprovedBlock in
GenesisValidator for late-joiner recovery"). PR #489 added a second
call site to `transition_to_initializing` from the new
`handle_approved_block` arm in `GenesisValidator`. On rust/staging
that call uses 22 args; on this branch the signature has 23 args
because PR #488 added the `rejected_deploy_buffer` parameter.

Auto-merge resolved cleanly at the textual level but produced a
compile error: the new call at `genesis_validator.rs:219` and the
new test in `genesis_validator_spec.rs:158` both omit the
`rejected_deploy_buffer` arg. Resolution adds
`&self.rejected_deploy_buffer` (lib) and
`fixture.rejected_deploy_buffer.clone()` (test) to bring those call
sites in line with the existing UnapprovedBlock-arm call site at
~line 282 and the other test sites at ~lines 47 and 248.

CI symptom: PR #488's GitHub virtual merge with rust/staging fails
to build because it produces this exact two-call-site state, but
without the merge commit on the source branch the auto-merge has
no way to apply the fixup. This merge commit captures the fixup so
CI's virtual merge collapses to a no-op.

Casper test suite: 350/350 (was 349 pre-merge, +1 for PR #489's
new genesis_validator_spec test).

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant