Open
Conversation
#489) * fix: handle ApprovedBlock in GenesisValidator for late-joiner recovery When a genesis validator joins boot's connections after the UnapprovedBlock broadcasts but before boot reaches required_signatures, it has nothing to sign. Boot then sends ApprovedBlock to all peers including the late validator, but GenesisValidator::handle had no ApprovedBlock arm, so the message hit `_ => Ok(())` and was silently dropped. The validator stayed in GenesisValidator state forever, logging "Casper engine present but Casper not initialized yet." Add a CasperMessage::ApprovedBlock arm that transitions to Initializing. Initializing::init already proactively re-requests ApprovedBlock from bootstrap (the comment at initializing.rs:239-242 anticipated this race), and Initializing::handle accepts the response, validates it, and transitions to Running — the same path a late-joining non-genesis node already takes. Reproduction (pre-fix): F1R3FLY_NODE_IMAGE=...:local pytest test_consensus_safety.py::test_validator_failure_recovery --keep-running hit failure on attempt 3 of a 20-attempt loop (~33% rate). Verification (post-fix): same loop ran 20/20 passes; full integration suite improved from 78 pass / 10 fail / 6 error to 89 pass / 5 fail / 0 error, eliminating all 9 startup-timeout failures and the 6 cascading ws_shard fixture errors. Co-Authored-By: Claude <noreply@anthropic.com> * review: drop is_repeated guard in late-ApprovedBlock handler, add regression test PR #489 review flagged that handle_approved_block_late shared seen_candidates with handle_unapproved_block, conflating two unrelated dedup concerns. If a validator failed to sign an UnapprovedBlock (no transition out of GenesisValidator) and the same hash later arrived as ApprovedBlock, the guard would block recovery. Drop the guard. A successful transition_to_initializing replaces the engine, so subsequent ApprovedBlock messages route to Initializing::handle, not back here. Concurrent duplicates during the brief transition window are serialized by the engine_cell write, and Initializing::init's ApprovedBlockRequest is idempotent at bootstrap. Add transitions_to_initializing_on_late_approved_block targeting the GenesisValidator::handle(ApprovedBlock) path directly: send the message with no prior UnapprovedBlock, assert that Initializing::init's ApprovedBlockRequest reaches the transport layer. Co-Authored-By: Claude <noreply@anthropic.com> * test: serialize genesis-counter-incrementing tests to fix flaky assertion approve_block_protocol_test asserts a delta on a process-global metrics counter ("genesis") between a baseline read and a post-action read. That counter is incremented from add_approval — anywhere a valid UnapprovedBlock signature is processed. The approve-block tests are all #[serial], but two other test files exercise the same code path without the marker: - genesis_validator_spec::respond_on_unapproved_block_messages_with_block_approval (sends UnapprovedBlock to a GenesisValidator → block_approver → add_approval) - block_approver_protocol_test (calls unapproved_block_packet_handler directly across six tests) Without serialization, those tests can run in parallel with an approve_block_protocol_test in flight and corrupt its delta — the TODO entry tracked this as a ~1-in-3 flake. Mark all genesis_validator_spec and block_approver_protocol_test #[tokio::test]s as #[serial] so they share serial_test's mutex with approve_block_protocol_test. Verified across three consecutive full casper test runs (343/343 each). Drop the now-resolved TODO entry. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
….conf, eliminate kamon (#492) * feat(node): self-contained binary — embed Rholang resources, defaults.conf, eliminate kamon Bake the genesis-ceremony Rholang sources and the HOCON defaults into the node binary at compile time so a node only needs --data-dir and ports to run. Production CWD requirement (workspace tree at runtime), the `DEFAULT_DIR` env var, and the kamon.conf parallel-config-file shim are all gone. Changes: - Embed all 11 .rho/.rhox genesis resources via include_str! in a new casper/src/rust/genesis/contracts/embedded_rho.rs. Rewrite the call sites in standard_deploys.rs to use a small `embedded_source` helper. Delete CompiledRholangSource::apply / apply_with_env / load_source and CompiledRholangTemplate::load_template (the file-loading machinery that walked an 8-path search ladder relative to CWD). Production code no longer touches the filesystem for Rholang sources; missing-asset bugs become build errors. - Embed defaults.conf via include_str! + hocon::HoconLoader::load_str. Drop the `default_dir: &Path` parameter from configuration::builder::build and the `DEFAULT_DIR` env var read in main.rs. Override semantics unchanged: --config-file <path> and <data_dir>/rnode.conf auto-load still apply on top of the embedded baseline. - Eliminate kamon.conf entirely. The InfluxDB and Zipkin reporters in node/src/rust/diagnostics/ are hand-rolled Rust — they only borrowed the JVM-Kamon HOCON schema for migration compatibility. Move the two fields the Rust code actually consumed (tick_interval and the InfluxDB endpoint) into NodeConf::Metrics under defaults.conf. Delete KamonConf, configuration/kamon.rs, kamon.conf, the load_kamon_config parsing path, and KamonConf plumbing through main.rs / mod.rs / diagnostics. - Delete unreferenced JVM logging artifacts: logback.xml, logging-template/* (never read by any Rust code). - Test-side filesystem .rho loading is preserved via a new casper/tests/util/rholang/test_rho_loader.rs (load_test_rho), kept out of the production binary. The 20 test files that loaded .rho fixtures from disk are migrated. - Dockerfile: drop the COPY steps that staged node/src/main/resources/, casper/src/main/resources/, and rholang/examples/ into the runtime image. The binary is now fully self-contained. Verified: cargo build --release --tests --workspace clean cargo test --release -p casper --test mod 345 passed / 0 failed ./target/release/node run --standalone reaches isReady:true from /tmp with no DEFAULT_DIR env var and no workspace tree on disk * docs(node): align README + Helm chart with self-contained binary Tail-end documentation and config-template cleanup that the self-contained-binary commit (d129086) implied but didn't touch. - README.md: clarify that the built-in defaults.conf is embedded into the binary at compile time via include_str! (not a runtime file lookup), so operators understand that the on-disk file is the source of the embed rather than a runtime dependency. - docs/node/README.md: update the config build pipeline description. Step 1 now describes `HoconLoader::new().load_str(EMBEDDED_DEFAULTS)` with a note that no `node/src/main/resources/` directory is required at runtime. Step 2 is reworded for accuracy. - node/src/main/resources/defaults.conf: drop the stale `kamon-influxdb` reference from the metrics-section comment. - Helm chart cleanup: * docker/helm/f1r3fly/configs/common/logback.xml: deleted (the JVM-Logback config was never read by the Rust binary; the enclosing `node/src/main/resources/logback.xml` was deleted in d129086, but the Helm chart still mounted a sibling copy). * docker/helm/f1r3fly/templates/statefulsets.yaml: drop the /var/lib/rnode/logback.xml subPath mount that pointed at the now-removed ConfigMap entry. * docker/helm/f1r3fly/templates/{deployable,observer}-rnode-configmaps.yaml: rewrite the Kamon-era comment in the embedded defaults.conf template; metrics endpoints (tick-interval, influxdb-endpoint) now live under the same `metrics` section in NodeConf. No code changes; binary rebuilt to verify the embedded HOCON still parses cleanly. Justfile already used --config-file overrides correctly (no DEFAULT_DIR env var, no CWD requirement); no recipe changes needed.
* fix(merge): expand rejection to DAG descendants to prevent stale diffs
When conflict resolution rejected a deploy chain, diffs from descendant
blocks (computed against the rejected chain's post-state) were still
applied to the LCA base, producing internally inconsistent merged state.
Reproduced at code level via stale_diff_application_corrupts_merged_state.
Rejection expansion: after conflict resolution, walk DAG descendants of
rejected blocks within merge scope and reject affected branches whole.
Conservative-only — no event-log refinement, since event logs miss the
indirect dependencies that cause the bug.
Deploy de-duplication: preemptive dedup on (source_block_number desc,
source_block_hash byte-lex asc). Dormant until the rejected-deploy
recovery mechanism ships.
Foundations:
- source_block_hash and source_block_number on DeployChainIndex
- block_number threaded through BlockIndex::new and its callers
- ConflictSetMerger::merge split into resolve_conflicts +
compute_merged_state so DagMerger can interpose expansion
Also:
- Hand-rolled Hash impl on DeployChainIndex matching PartialEq (the
derived Hash covered all fields, violating the hash/eq contract)
- Removed now-dead hash_code and pre_state_hash fields
- KeyValueRejectedDeployBuffer skeleton (will be wired in a follow-up)
- Two pre-existing proof tests marked #[ignore]:
* concurrent_registry_inserts_should_not_conflict — assertion
contradicts multi-parent DAG semantics; awaits rewrite
* finalization_does_not_guarantee_canonical_state — flaky
precondition under the two-bridge merge setup
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(merge): recover rejected deploys via per-node buffer
When the merge algorithm drops a deploy from the canonical merged state,
its data is now placed in a new RejectedDeployBuffer so the block creator
can re-propose it in a subsequent block. Previously rejected deploys were
silently lost even though their effects never made it into canonical state.
Buffer: KeyValueRejectedDeployBuffer mirrors KeyValueDeployStorage in
shape and LMDB backing (new "rejected_deploy_buffer" store registered in
RNodeKeyValueStoreManager; shares deploy_storage sizing).
Merge-time populate: dag_merger::merge now returns (sig, source_block_hash)
pairs. compute_parents_post_state groups by source block, fetches each
block once, extracts the Signed<DeployData>, and inserts into the buffer.
Scope awareness: CasperSnapshot carries a new rejected_in_scope DashSet,
populated alongside deploys_in_scope during the ancestor BFS. The cache
key covers both sets under one (generation, LFB) tuple. A lightweight
rejected_deploy_sigs decoder on KeyValueBlockStore returns the sig list
without decoding the full block body.
Re-inclusion filter: prepare_user_deploys unions DeployStorage with
RejectedDeployBuffer and re-includes any valid deploy that is both in
deploys_in_scope and rejected_in_scope — its effects never landed, so
proposing it again is correct.
Finalization cleanup: record_directly_finalized purges from both pools.
Sigs in body.deploys of a finalized block are removed from both storage
and buffer; sigs in body.rejected_deploys of a finalized block are also
removed from the buffer (definitively lost, not recoverable from here).
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(dag): restore invalid-block latest-message update and bonded-validator justifications
Four divergences from the source-of-truth Scala implementation had
disabled slashing visibility in the Rust node:
- new_latest_messages gated on !invalid, so equivocation blocks never
became a sender's latest message.
- The sender-advance branch gated on !invalid for the same reason.
- Block-creator justifications used valid_latest_metas (filtered),
excluding equivocators from the justification set and causing
justification_follows to reject otherwise-valid blocks.
- max_seq_nums used the filtered set too, omitting equivocators' sequence
numbers downstream.
With these restored, invalid_latest_messages fires as intended,
prepare_slashing_deploys issues slashes for equivocators, and the
pre-existing multi_parent_casper_should_succeed_at_slashing test passes.
Flips dag_storage_should_not_replace_latest_message_with_invalid_block_from_same_sender
to dag_storage_should_advance_latest_message_to_invalid_block_from_same_sender
with inverted assertions reflecting the corrected behavior.
Co-Authored-By: Claude <noreply@anthropic.com>
* feat(merge): recover merge-rejected slashes via block-creator dedup
When the merge rejects a deploy chain that contains a slash, the slash
effect is silently lost to cost-optimal rejection — SYS_SLASH_DEPLOY_COST
is 0 so any conflicting chain with cost >0 wins, and the equivocator
remains bonded. Attackers can sustain cheap conflicts to starve slashing
indefinitely.
The fix surfaces the rejected slash metadata from the merge step and has
the block creator re-issue any slash not already covered by its own
invalid_latest_messages view. The slash then lands in the merge block's
own body.system_deploys, bypassing cost-optimal rejection on the parents.
The merge pipeline stays pure — no runtime threading, no new validation
surface. Slash re-issuance flows through the existing SlashDeploy
execution path, so determinism invariants are unchanged.
- dag_merger::merge now returns (state, rejected_user_pairs, rejected_slash_pairs),
splitting rejected pairs by is_slash_deploy_id. Close-block and heartbeat
system deploys remain intentionally dropped.
- compute_parents_post_state extracts RejectedSlash metadata by reading
each distinct source block's body.system_deploys once. All slashes within
a block share a synthetic sig, so one rejected chain represents every
slash in the source block — iterating body.system_deploys produces the
right recovery set.
- New casper/src/rust/merging/rejected_slash.rs defines RejectedSlash and
filter_recoverable, with the dedup key being
(invalid_block_hash, issuer_public_key). Unit tests cover: own-slash
covers merge-rejected duplicate (Attack 6), merge-rejected survives when
uncovered by own (Attack 1), mixed coverage with multiple equivocators
(Attack 4), issuer discrimination on same equivocator (Attack 7), and
empty-input regression guard.
- block_creator::create calls compute_parents_post_state once before
system-deploy construction to surface the rejected slashes, dedups
against own slashing_deploys, and appends non-duplicates as fresh
SlashDeploys signed under the proposer's identity. The downstream
compute_deploys_checkpoint call hits the parents-post-state cache so
the merge is not re-run.
- ParentsPostStateCacheVal extended to (StateHash, Vec<Bytes>, Vec<RejectedSlash>)
so cache hits return the full 3-tuple.
- Regression assertion in bridge_query_survives_multi_parent_merge
confirms non-slash merges surface an empty rejected_slashes list.
Co-Authored-By: Claude <noreply@anthropic.com>
* feat(api): deploy_finalization_status query by deploy sig
Adds a canonical-state finalization status API for deploys, replacing
block-hash polling. After the merge fix, a block can finalize while some
of its deploys' effects were dropped by merge rejection — polling by
block hash returns true even though canonical state disagrees. Polling
by deploy sig via this API correctly reports the effect's presence in
canonical state.
States follow the design decision:
- Finalized — sig in a finalized block's body.deploys with is_failed=false,
and not in any finalized descendant's body.rejected_deploys
- Failed — sig in a finalized block with is_failed=true (explicit
runtime failure)
- Pending — sig alive: in deploy storage, in a non-finalized block, in
the rejected-deploy buffer awaiting re-proposal, or rejected after
finalization and awaiting canonical recovery
- Expired — valid_after_block_number + deployLifespan elapsed without
canonical inclusion
Response carries `state`, `rejection_count`, and `latest_block_hash`
(optional).
Architecture: single-pass canonical-chain walk from LFB backward for
deployLifespan blocks. For each block: check body.deploys for a clean or
failed match, check body.rejected_deploys for a sig match. Track the
highest-height observation for `latest_block_hash`, count rejection
occurrences for `rejection_count`, and resolve the terminal state from
the observations. Uses the lightweight rejected_deploy_sigs decoder to
avoid full body decode on the rejection-check arm.
Defensive error handling:
- Storage errors during first-seen block fetch → propagated as API error
- Missing block body when sig is indexed → warn log + Pending_unknown
- Sig indexed but absent from body.deploys → API error (state inconsistency)
- LFB with no block_number entry → API error (invariant violation)
- Blocks missing from store during scan → warn log + continue (scan
robustness over hard failure; result may be incomplete)
Trait addition: `Casper::casper_shard_conf() -> &CasperShardConf` to give
BlockAPI access to deployLifespan. Impls added on MultiParentCasperImpl
and both NoOpsCasperEffect test stubs.
gRPC surface:
- DeployServiceCommon.proto: DeployFinalizationStatusQuery message,
DeployFinalizationStateProto enum, DeployFinalizationStatusInfo message
(with optional latestBlockHash for explicit absent/present)
- DeployServiceV1.proto: rpc deployFinalizationStatus +
DeployFinalizationStatusResponse
- node/src/rust/api/deploy_grpc_service_v1.rs: server handler delegating
to BlockAPI
HTTP surface:
- node/src/rust/api/web_api.rs: WebApi trait method +
DeployFinalizationStatusJson with Option<String> for latest_block_hash
so JSON serializes null when absent
- node/src/rust/web/web_api_routes.rs: GET
/api/deploy-finalization-status/{deploy_sig_hex}
Tests:
- casper lib tests (2): state enum construction, state distinctness
- casper integration smoke test (1): unknown_sig_returns_pending_with_empty_fields
exercises the full EngineCell → BlockAPI path
Performance: zero background cost; O(deployLifespan) block-sig reads per
query, dominated by proto decode on the lightweight rejected_deploy_sigs
decoder. Sub-millisecond for typical lifespans.
Consensus safety: read-only API, no new attack surface, no new storage,
no new trait methods beyond the shard_conf getter.
Deep end-to-end tests (Finalized, Failed, Expired, nonzero rejection
count) require real equivocation + merge-rejection fixtures and are
deferred.
Co-Authored-By: Claude <noreply@anthropic.com>
* feat(casper): gate rejected-deploy buffer population on finalization status
Catching-up validators replay historical blocks to get to the current
tip. For each block with non-empty body.rejected_deploys, the buffer-
population path extracts the rejected sigs' DeployData and adds them to
the local rejected-deploy buffer for re-proposal. Without a status
check, this admits sigs that have already been re-proposed and
finalized elsewhere in the chain, or sigs past their deployLifespan.
Two failure modes:
- Double-execution of already-finalized work. A rejected sig is added to
the local buffer; on the validator's next proposal round, the buffer
read includes the deploy; the new block contains the deploy; dedup
picks the new proposal over the older finalized copy within merge
scope; the merge produces a re-execution of canonical work against a
different pre-state. Effects diverge. Consensus forks.
- Past-lifespan noise. The buffer read filter drops past-lifespan sigs
at proposal time, but the entries still accumulate and churn through
storage.
Fix: before admitting each sig to the buffer, run the deploy
finalization status resolver. Admit only if the state is Pending. Skip
Finalized / Failed / Expired — those sigs are terminally resolved in
the local canonical view and must not be re-proposed.
The gate is unconditional — not "catchup mode" flagged. A live merge
that re-emits a canonically-finalized sig would be equally unsafe; the
same gate defends against both.
Implementation:
- Extracted BlockAPI::deploy_finalization_status's algorithm into a
pure function `deploy_finalization_status::resolve(dag, block_store,
deploy_lifespan, sig)`. The async BlockAPI method now reduces to a
thin wrapper that unwraps the engine cell and delegates. This makes
the resolver callable from compute_parents_post_state without
threading an EngineCell through the merge layer.
- Added should_admit_to_rejected_buffer helper in interpreter_util.rs
that calls resolve and applies the admit rule. Conservative
skip-on-error: transient storage failures skip the sig with a warn
log; consistency errors skip with a warn log. Never admit on error —
admit-on-error would reintroduce the double-execution bug under
flaky storage.
- Wired the helper into compute_parents_post_state's buffer-populate
block as a single predicate call, replacing the direct push.
Tests:
- Pure-resolver direct call: resolve_pure_function_returns_pending_for_unknown_sig
verifies the extracted function is callable from a non-engine-cell
context.
Deferred to later test work:
- Integration test exercising the gate-skips-finalized path (needs a
fixture that produces merge rejection AND later finalization of the
same sig — overlaps with equivocation + merge-rejection work).
- Full multi-node catchup simulation.
Consensus safety: the gate is a strict reduction of what enters the
buffer. Never adds sigs that weren't there; only drops sigs with a
terminal status in the current canonical view. Deterministic per
validator's DAG view.
Performance: O(deployLifespan) block reads per admit decision. For
typical rejection rates (0-3 per merge, lifespan ~50) this is sub-ms.
Full catchup of 1000 historical blocks with average 2 rejections each
adds ~100K block reads cumulatively — seconds of wall time.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(api): walk finalized-ancestor BFS in deploy_finalization_status
The resolver walked `main_parent_chain` from LFB backward — a linear
walk that only visits a block's first (main) parent at each step. In
a multi-parent DAG, a deploy's effects can reach canonical state via
a secondary-parent merge; the main-parent chain alone misses those
blocks, so the sig is reported Pending even after it finalized.
Fix: BFS from LFB through every parent slot (main + secondary) bounded
by deploy_lifespan depth. `visited` dedups the frontier because
multi-parent ancestries share common ancestors.
Phase G's catchup gate uses the same resolver, so it inherits the fix
automatically.
Regression test: `resolve_finds_sig_in_secondary_parent_branch`
builds a minimal DAG (genesis → A, B siblings → C with A as main,
B as secondary) and places the deploy sig only in B. The test fails
with Pending on the main-parent walk and passes with Finalized on
the BFS, locking in the semantics.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(validate): honor rejected_in_scope exemption in repeat_deploy
The repeat-deploy check rejected any block whose body.deploys contained
a sig already present in an ancestor's body.deploys. This predates the
rejected-deploy-buffer recovery pipeline (Phase D): when a deploy is
rejected by a descendant merge within deploy_lifespan, the buffer
re-proposes it in a later block — a legitimate re-inclusion, not a
repeat. Without this exemption, every recovery-path block fails
validation with InvalidRepeatDeploy, the proposer retries the same
deploys, and the shard deadlocks on heartbeat propose attempts under
any merge-rejection workload.
Fix: filter sigs present in s.rejected_in_scope out of the check set
before the BFS. CasperSnapshot already computes rejected_in_scope by
walking body.rejected_deploys in the current proposal's parent scope;
prepare_user_deploys uses the same signal on the proposer side. The
validator now mirrors the proposer.
Regression test: repeat_deploy_validation_allows_recovered_deploy_from_\
rejected_in_scope builds the exact DAG shape the existing
"should not accept" test uses, then pre-populates rejected_in_scope
with the deploy's sig. Pre-fix returns Invalid(InvalidRepeatDeploy);
post-fix returns Valid.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(merge): recover dedup collateral via rejected-deploy buffer
When dag_merger's deploy de-duplication discards a chain because some
deploy in it has a fresher copy elsewhere, deploys unique to the
discarded chain were silently dropped — not added to the rejected-deploy
buffer, not in rejected_in_scope, and the deployer had no signal.
Collect collateral-lost deploys (those unique to a dropped chain) into
the rejected-user list so the buffer can recover them in a subsequent
block, mirroring how conflict-rejected deploys recover.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(api): require canonical-descendant rejection to invalidate clean inclusion
deploy_finalization_status::resolve was invalidating a clean finalized
inclusion if any rejection at a strictly higher height was observed. In
multi-parent DAGs, a rejection in a sibling block at the same or higher
height does not affect a deploy's effects in a canonical block on a
different chain. Recovery cycles via the rejected-deploy buffer can also
produce rejection events in non-canonical sibling blocks (validators
racing to recover the same deploy), and the height-only check turned
those into a positive feedback loop where the deploy stayed Pending
while the buffer kept re-proposing.
Track each rejection's block hash alongside its height and require the
rejection block to be a canonical-chain descendant of the clean block
(via is_in_main_chain) before invalidating. Same-block rejections (the
clean inclusion and rejection share a block — e.g., a recovery proposal
whose merge step also dedup-rejected an older copy in scope) are
excluded explicitly.
Co-Authored-By: Claude <noreply@anthropic.com>
* test(helper): initialize tracing subscriber in TestNode::create_network
casper/tests/mod.rs defines an init_logger() guarded by Once, but it had
no callers in the test tree. Production code with tracing::debug!/info!/
warn! calls produced no output during tests, making diagnostic logs
useless when investigating failures.
Wire init_logger() into TestNode::create_network so any test that builds
a network gets a tracing subscriber wired up with EnvFilter respecting
RUST_LOG. Behavior is unchanged when RUST_LOG is unset (default ERROR
level filter).
Co-Authored-By: Claude <noreply@anthropic.com>
* docs(dag): document invalid-block LMM safety argument
Address PR #488 review request to document why removing the `invalid`
guard from latest-message-map updates (commit 61b7394) is safe for
fork choice and finalization. The behavior change matched the Scala
source-of-truth (BlockDagKeyValueStorage.scala / MultiParentCasperImpl.scala)
but the safety argument was implicit; the reviewer asked for an explicit
explanation.
Three comments added:
- block_dag_key_value_storage.rs::insert::new_latest_messages — primary
safety-argument anchor at the storage site the reviewer flagged.
Covers the four-point argument (fork choice unaffected because parent
selection filters via valid_latest_msgs; slashing requires invalid
blocks in LMM; justification_follows requires every bonded validator;
pre-fix guard had no Scala counterpart and silently disabled
slashing).
- multi_parent_casper_impl.rs::create_block_data justifications block —
strengthened existing comment to explicitly cite parent selection's
filter at line ~160 as the reason fork choice is unaffected.
- multi_parent_casper_impl.rs max_seq_nums block — strengthened
comment to explain the equivocator-reset attack the unfiltered
read prevents (filtering would let an equivocator reset their
seq-number floor).
No behavior change. Casper test suite still 347/347.
Co-Authored-By: Claude <noreply@anthropic.com>
* perf(api): batch deploy_finalization_status resolver, tighten canonical-descendant invalidation
Address PR #488 review #4 (BFS-per-deploy performance) and a related
correctness gap surfaced while writing the regression test for the
refactor.
## Batched resolver (Review #4)
The catchup-heavy hot path in `compute_parents_post_state` previously
called `deploy_finalization_status::resolve` once per rejected deploy
sig. Each call did its own BFS over the finalized window, so a merge
with N rejections did N independent walks of the same M-block scope —
O(N · M). Reviewer named this as the catchup case "where this gate
matters most" and suggested batching.
Refactor lifts the per-sig BFS state into a `ResolverState` struct and
splits the resolver into shared helpers (`run_prelude`,
`bfs_finalized_window`, `finalize_sig_state`). New `resolve_batch(sigs)`
does a single BFS pass that updates per-sig state for every sig found
in body.deploys / body.rejected_deploys. Cost drops to O(M + N).
Existing single-sig `resolve(sig)` becomes a thin wrapper over the
shared helpers. Both entry points have identical error semantics:
prelude inconsistencies (sig indexed at a block that no longer claims
the sig in body.deploys) propagate as `Err` so corruption is surfaced
honestly rather than silently masquerading as `pending_unknown`. The
batch caller in `compute_parents_post_state` wraps the call in a
"skip on Err" fallback that admits nothing for the merge step, so
behavior at the catchup gate is unchanged when the corruption case
hits — but now it is loud rather than hidden.
Call site in `interpreter_util.rs::compute_parents_post_state`
replaces per-sig `should_admit_to_rejected_buffer` with one batched
`compute_rejected_buffer_admits` precompute, then dictionary lookups
during the per-block iteration. For a 50-rejected merge with M=200,
this is 200 block fetches instead of 10,000.
## Canonical-descendant invalidation gap (surfaced by parity test)
Writing a multi-parent parity test for the refactor exposed a
pre-existing gap between the resolver's intent and its implementation:
- Intent (per the inline comment): a rejection invalidates a clean
inclusion only when the rejection is on the canonical chain.
- Implementation: `is_in_main_chain(clean_block, reject_block)` —
walks reject_block's main-parent ancestry checking for clean_block.
This is necessary but not sufficient. A non-canonical sibling B'
with main parent A still passes this check, even though B' itself
is not on LFB's main-parent chain.
Concrete reproduction (now in
`resolve_and_resolve_batch_agree_across_states`): four-block DAG
genesis → A → {B, S} → C with C as LFB. B is canonical (main parent
of C), S is a non-canonical sibling (reachable only via C's
secondary-parent slot). Sig is in A.body.deploys (clean) and
S.body.rejected_deploys (sibling rejection). Pre-fix: resolver
reports Pending because is_in_main_chain(A, S) is true. Post-fix:
resolver reports Finalized because S itself is not on C's main-parent
chain.
Severity: false-negative, not unsafe. `Pending` for sigs that are in
canonical state. Polling clients keep polling unnecessarily; the
catchup gate admits already-canonical sigs to the buffer, where
dedup handles them harmlessly. But under PR #488's recovery
workload — competing recovery proposals on bridge contracts — this
produces user-visible "stuck Pending" behavior for finalized
deploys. Worth fixing alongside the batching work.
Fix: in `finalize_sig_state`, the canonical-descendant check now
also requires `is_in_main_chain(reject_block, lfb)` to be true.
One extra `is_in_main_chain` call per sig with both clean and
reject events. Implementation now matches intent.
## Tests
New `resolve_and_resolve_batch_agree_across_states` test in
`casper/tests/api/deploy_finalization_status_test.rs` builds a
production-shape multi-parent DAG covering five resolver branches
(clean via secondary parent, failed canonical, clean+canonical
rejection, clean+sibling rejection, unknown). Single-sig `resolve`
and batched `resolve_batch` results compared for parity on every
sig. Test failed before the canonical-descendant fix
(`clean_canonical_reject_sibling` case) and passes after. Casper
suite 348/348 (was 347/347 pre-refactor, +1 for the parity test).
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(validate): gate repeat_deploy recovery exemption on Finalized status
Address remaining concern from PR #488 review #5
(`validate.rs:347`). Commit `3ee91fb5` fixed the functional bug
(legitimate recovery being blocked by `InvalidRepeatDeploy`) but left
a defense-in-depth gap that the reviewer's original prose flagged:
"if a deploy is in `rejected_in_scope` and appears in the block but is
NOT a legitimate re-proposal (e.g., a malicious validator re-includes
a deploy that was rejected for a valid reason), the repeat-deploy
check won't catch it."
Both the block-side filter (current) and the ancestor-side filter
(reviewer's suggested wording) operate on the same global
`rejected_in_scope` set, so they give identical results in every case
— including the double-execution scenario where a sig has BOTH a
clean canonical inclusion AND `rejected_in_scope` membership (e.g.,
because a sibling-fork rejection landed on the canonical chain via a
merge). Under either filter shape, such a sig is exempted from the
repeat check and the validator allows re-execution of an
already-finalized deploy.
The catchup gate (`should_admit_to_rejected_buffer`) is the primary
defense — it calls `deploy_finalization_status::resolve` before
admitting a sig to the rejected-deploy buffer and skips terminal
states. But the validator-side check is meant to be a second line of
defense for the case where the gate is bypassed (bug, race, colluding
proposer); under the current implementation it is missing for this
exact scenario.
## Fix
Gate the recovery exemption on the sig's current finalization status.
A sig in `rejected_in_scope` is exempted from the repeat check ONLY
when its status is NOT `Finalized`:
- `Pending` / `Failed` / `Expired`: no clean canonical inclusion;
re-inclusion is the only way to land effects in canonical state.
Exempt from check (recovery legitimate).
- `Finalized`: clean canonical inclusion that survived all
canonical-descendant rejections; effects ARE already canonical.
Re-inclusion is double-execution. Do NOT exempt; let the ancestor
scan flag the repeat.
- Resolver error: conservative-fail — keep the sig in the check set
so an inconsistency surfaces as `InvalidRepeatDeploy` rather than
being silently exempted.
## Tests
Two regression tests on the same code path:
- `repeat_deploy_validation_allows_recovered_deploy_from_rejected_in_scope`:
restructured to model TRUE recovery — the deploy lives only in a
non-canonical / non-finalized ancestor (status `Pending`), so the
exemption applies and the block validates. Passes both pre- and
post-fix.
- `repeat_deploy_blocks_double_execution_when_finalized_and_in_rejected_in_scope`:
new gap test. Same `rejected_in_scope` membership, but the sig
has a clean canonical inclusion in genesis (LFB), so status is
`Finalized`. Pre-fix: filter exempts the sig and validation
returns `Valid` (gap reproduced). Post-fix: filter declines the
exemption, ancestor scan finds the canonical inclusion, and
validation returns `InvalidRepeatDeploy` as it should.
Casper suite: 349/349 (was 348/348, +1 for the new gap test).
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(block_creator): exempt rejected_in_scope sigs from self-chain dedup
When the merge engine rejects a deploy that was originally signed into
the proposer's own self-chain, prepare_user_deploys correctly admits it
from the rejected-deploy buffer (its rejected_in_scope exemption), but
collect_self_chain_deploy_sigs immediately drops it again because the
sig appears in the proposer's prior block. Mirror the same exemption in
the self-chain dedup filter.
Adds an end-to-end recovery-cycle test using same-key vault depletion
as a deterministic conflict source: two deploys whose combined
precharge exceeds the source vault's balance, triggering
fold_rejection. The test is arranged so validator 0's prior block
contains the rejected sig and validator 0 is the recovery proposer —
the only configuration where the self-chain filter is load-bearing.
Verified 10/10 pass with fix, 10/10 fail without.
Removes finalization_does_not_guarantee_canonical_state — an ignored
test whose precondition assertion (rejected_deploys empty) is
contradicted by the multi-parent DAG design and is superseded by the
new recovery-cycle coverage.
Co-Authored-By: Claude <noreply@anthropic.com>
* docs(casper): document rejected-deploy buffer and rejected_in_scope exemption
Updates the CasperSnapshot struct definition in casper/README.md (it had
drifted: missing rejected_in_scope, stale collection types, swapped
invalid_blocks key/value).
Documents the rejected-deploy buffer recovery cycle in both
casper/README.md and casper/CONSENSUS_PROTOCOL.md:
- prepare_user_deploys pulls recovered sigs from
KeyValueRejectedDeployBuffer
- the in-scope dedup filter exempts sigs in rejected_in_scope so
those recovery candidates aren't immediately dropped
- the same exemption applies to collect_self_chain_deploy_sigs
- merge-engine fallback discards land in the buffer rather than
being silently dropped
Co-Authored-By: Claude <noreply@anthropic.com>
* test(merge): cover dedup orphan recovery path
Adds a code-level regression test for the rejected-deploy buffer's
dedup-orphan path: when `dag_merger::merge` drops a chain via the
freshness rule (block_number, byte-lex hash), any deploy unique to the
dropped chain lands in `collateral_lost_pairs` and is admitted to the
buffer alongside conflict-rejected sigs. Fixture builds two siblings
with the shared deploy_x and validator-unique markers V/W (event-log
linked through a shared channel), then asserts exactly one of {sig_V,
sig_W} reaches the buffer.
Removes the ignored `concurrent_registry_inserts_should_not_conflict`
test — its `rejected.is_empty()` precondition contradicts multi-parent
DAG semantics and is superseded by the recovery-cycle and dedup-orphan
coverage in batch2.
Co-Authored-By: Claude <noreply@anthropic.com>
* test(merge): cover slash recovery via multi-parent merge and re-issuance
Two tests in slash_recovery_spec.rs:
* slash_for_equivocator_survives_multi_parent_merge — end-to-end
through TestNode. Three validators; node 0 equivocates; nodes 1 and
2 each propose a SlashDeploy-bearing block; node 1 merges both as
parents. Asserts post-merge equivocator stake at the bond floor and
the merge proposer's stake unchanged.
* e1c_re_issues_merge_rejected_slash — focused on the re-issuance
loop in block_creator::create. A synthetic RejectedSlash is written
into the parents-post-state cache so the merge proposer's
compute_parents_post_state returns it as if the merge engine had
rejected a slash chain. Different issuer_public_key keeps it past
filter_recoverable. Asserts a SlashDeploy entry for the
equivocator's invalid_block lands in the proposed body — TDD-verified
by commenting out the loop.
Co-Authored-By: Claude <noreply@anthropic.com>
* test(merge): cover multi-validator buffer convergence dedup
Two validators independently buffer the same conflict-rejected sig
and each re-propose it in a recovery block alongside a
validator-unique marker deploy. The markers consume from a channel
the recovered sig produces on, putting [deploy_x, marker] in a single
event-log chain inside each block. Distinct marker sigs keep the two
chains' deploys_with_cost sets unequal so conflict_set_merger's
HashSet collapse cannot merge them; dag_merger::merge's
freshness-based dedup is then the sole mechanism that prevents
surfacing the shared sig as a conflict-rejected duplicate.
Asserts the shared sig stays out of rejected_user_deploys and exactly
one of {marker_v0, marker_v1} is orphaned (the loser's unique deploy).
TDD-verified: commenting out the dedup retain logic causes the orphan
assertion to fail.
Co-Authored-By: Claude <noreply@anthropic.com>
* chore: drop PR/review reference from test comment
Replaces a "PR #488 review #4" reference with the substantive
explanation already in the surrounding comment. PR numbers and review
thread positions are not stable across squash/rebase or PR re-creation,
so they don't belong in committed source.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(buffer): purge expired sigs from rejected-deploy buffer
prepare_user_deploys removes block-expired and time-expired sigs from
deploy_storage but not from KeyValueRejectedDeployBuffer. The read
path filters expired sigs out of `valid_unique` so they aren't
re-proposed, but on-disk LMDB entries persist. Combined with no
admission size cap on the buffer, a sustained-load adversary that
keeps generating conflicts can grow the buffer unbounded. Extend the
expired-removal sweep to call buffer.remove(expired_list) alongside
the storage cleanup.
Adds should_remove_expired_deploys_from_rejected_deploy_buffer to
exercise the regression-relevant case (sigs in buffer but not in
storage — the realistic state after a sig has been conflict-rejected
and the original storage entry has aged out via prior sweeps). TDD
red-green confirmed: commenting out the buffer.remove call fails the
test.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(merge): dedup rejected slashes by equivocator only
filter_recoverable previously keyed dedup on (invalid_block_hash,
issuer_public_key), but the proposer re-signs every E1c slash under
their own pk. Different-issuer slashes for the same equivocator
therefore all survived dedup, and the proposer emitted multiple
SlashDeploys for the same equivocator into the merge block body —
saved by PoS idempotency, but inflating block size and wasting
execution on redundant slashes.
Two surfaces of the same bug:
1. Proposer V3 has E in invalid_latest_messages (own-detected slash)
AND the merge engine surfaces a RejectedSlash for E from a
different original issuer V2. Today both land in body; after this
fix only the own-detected slash lands.
2. Multiple validators independently propose slash chains for the
same equivocator E and all chains are merge-rejected. Today every
rejected slash survives dedup; after this fix exactly one survives
per equivocator.
Drops the issuer_public_key out of the dedup key entirely (it's
provenance, not identity for dedup purposes), keys both sides on
invalid_block_hash, and sorts survivors deterministically for
body-hash determinism across replays.
Adds two tests to rejected_slash.rs covering each surface
(same_equivocator_across_issuers_dedups_to_one,
own_detection_drops_rejected_from_other_issuer); replaces the
previous dedup_key_discriminates_by_issuer test which endorsed the
buggy behavior. TDD red-green confirmed: disabling either filter
fails the corresponding tests.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(slash): exclude already-slashed validators via active_validators
prepare_slashing_deploys filtered invalid_latest_messages by
bonds_map.stake > 0 only. With bond floor = 0 (test default), an
already-slashed validator has stake = 0 and is filtered out — the
existing tests rely on this. With bond floor > 0 (production),
already-slashed validators retain stake at the floor, satisfy the
> 0 check, and the proposer emits a redundant SlashDeploy in every
block until the equivocator's invalid latest message ages out of the
DAG view. Saved by PoS slash idempotency, but inflates body size and
wastes execution.
OnChainCasperState::active_validators is the canonical
"validators eligible to participate" set, queried from PoS at the
parent post-state. PoS removes slashed validators from it regardless
of bond floor. synchrony_constraint_checker.rs already uses this
pattern. Adopt the same check here.
Extracts the filter into a private filter_slashable_invalid_messages
helper and adds three inline unit tests covering each branch of the
filter, including the regression-relevant case where stake > 0 but
the validator is no longer in active_validators. TDD red-green
confirmed: disabling the active_validators check fails the
already-slashed test while leaving the other two green.
Co-Authored-By: Claude <noreply@anthropic.com>
* chore: cargo fmt all PR-touched files
Runs rustfmt on every file modified in this PR that drifted from the
formatter. Pure formatting — no code changes. Verified: full casper
suite (356 integration + 66 lib unit) and block-storage suite (48)
all pass post-format with zero warnings on cargo check.
Co-Authored-By: Claude <noreply@anthropic.com>
* chore: drop deferred-work reference from test header comment
The previous file header had a paragraph describing "deep end-to-end
coverage (multi-equivocation, cost-starvation simulation,
merge-rejection-then-recovery) ... tracked separately." That kind of
deferred-work pointer doesn't belong in committed source — there's no
durable target for "separately" to point at, and the language ages
poorly. The remaining sentence describes what the file covers, which
is the only thing the comment needs to say.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(slash): key SlashDeploy seed on invalid_block_hash
generate_slash_deploy_random_seed previously hashed only
(SYSTEM_DEPLOY_PREFIX || validator_pk || seq_num). Every SlashDeploy
in the same block — own-detected and recovered — received an identical
rng seed. The slash contract opens `new rl, poSCh, ...` whose
unforgeable channel names are derived from this rng, so two slashes
in the same block alias these channels in the tuplespace, and the
return-channel routing the system_deploy infrastructure uses to
extract each slash's PoS response keys on the same name. The author's
earlier comment ("completely sure that collision cannot happen")
assumed one slash per block — an assumption that held when the LMM
filtered invalid blocks (Section 3 turned that off) and slash recovery
didn't exist (Section 4 added it). PR #488 makes the path hot.
Adds invalid_block_hash to the seed at the source. Replay determinism
is preserved: invalid_block_hash is part of the SlashDeploy struct and
persists in the block body, so validators re-running historical
slashes reconstruct the same rng state. Updates the three call sites
(prepare_slashing_deploys, the recovered-slash loop in
block_creator::create, and replay_system_deploy_internal in
replay_runtime).
Adds two inline regression tests:
- slash_seed_differs_per_invalid_block_hash: two distinct equivocators
in the same block-and-proposer context produce distinct rng seeds.
- slash_seed_is_deterministic_for_same_inputs: same inputs always
produce the same seed (replay determinism).
TDD red-green confirmed: removing invalid_block_hash from the seed
input fails the differ-per-hash test.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(resolver): bubble is_in_main_chain errors instead of silently fudging
The canonical-descendant invalidation rule called
dag.is_in_main_chain(...).unwrap_or(false) at two sites in
finalize_sig_state. A transient LMDB read failure would silently
collapse to "rejection is NOT a canonical descendant" → invalidation
rule does NOT fire → state stays Finalized. Wrong direction
(false-positive Finalized) and silent.
state is consensus-relevant: the repeat_deploy validator
(validate.rs:347) reads it via the rejected_in_scope exemption gated
on state != Finalized. Validator A's is_in_main_chain succeeds while
validator B's hits a transient I/O blip → A: state=Pending → exempt;
B: state=Finalized → no exempt → InvalidRepeatDeploy. Two validators
reach different verdicts on the same block.
Changes finalize_sig_state to return ApiErr<DeployFinalizationStatus>
and propagates the is_in_main_chain Result via ?. The two callers
(resolve, resolve_batch) already returned ApiErr and propagate the
new ?. Behavior on the happy path is unchanged — all 356 integration
+ 68 lib unit tests pass post-fix.
Matches the surrounding error-handling pattern in this file (every
other I/O call already propagates via ?).
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(proposer): include merge-rejected slashes in empty-block skip
The skip predicate ran before compute_parents_post_state, so a
heartbeat-disabled proposer (allow_empty_blocks=false, the production
default) with no user deploys and no own-detected slashes would skip
without seeing rejected slashes from the parent merge. Move the merge
above the skip and add !recovered_rejected_slashes.is_empty() to the
predicate.
Co-Authored-By: Claude <noreply@anthropic.com>
* docs(casper): document slash recovery and multi-slash blocks
Existing CONSENSUS_PROTOCOL.md slashing flow predates PR #488. Adds
multi-parent merge & recovery loop, multi-slash blocks (per-equivocator
seed), empty-block skip predicate, and source-file-map entries for
merging/rejected_slash.rs and the slashing module group.
Co-Authored-By: Claude <noreply@anthropic.com>
* chore: cargo fmt sweep across casper crate
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
…exemption (#497) prepare_user_deploys exempts deploys in `rejected_in_scope` from the in-scope filter so genuinely rejected deploys can be re-proposed. Without a canonical-descendant gate, the exemption also fires when the rejection sits in a non-canonical sibling while the deploy's effects are already in canonical state — producing a recovery block that downstream validators correctly flag as `InvalidRepeatDeploy`. On FTT=0 shards this triggers mutual slashing. Mirror the validator-side `repeat_deploy` gate at the proposer: resolve the candidate sigs in batch and decline the exemption when status is `Finalized`. Resolver failure → decline conservatively. Tests: - validator-side defense regression (already passes pre-fix) - proposer-side gate (RED pre-fix, GREEN post-fix)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Periodic promotion of
rust/staging→rust/dev. Three PRs merged into staging since the last promotion:Test plan
rust/staginghead (squash-merge commits already passed CI on their respective PRs)rust/devafter mergeCo-Authored-By: Claude noreply@anthropic.com