Promote rust/staging to rust/dev by spreston8 · Pull Request #495 · F1R3FLY-io/f1r3node

spreston8 · 2026-05-01T18:31:29Z

Summary

Periodic promotion of rust/staging → rust/dev. Three PRs merged into staging since the last promotion:

fix(merge): close merge-stale-diff bug class #488 — fix(merge): close merge-stale-diff bug class
feat(node): self-contained binary — embed Rholang resources, defaults.conf, eliminate kamon #492 — feat(node): self-contained binary — embed Rholang resources, defaults.conf, eliminate kamon
fix: handle ApprovedBlock in GenesisValidator for late-joiner recovery #489 — fix: handle ApprovedBlock in GenesisValidator for late-joiner recovery

Test plan

CI green on rust/staging head (squash-merge commits already passed CI on their respective PRs)
No regressions in integration tests on rust/dev after merge

Co-Authored-By: Claude noreply@anthropic.com

#489) * fix: handle ApprovedBlock in GenesisValidator for late-joiner recovery When a genesis validator joins boot's connections after the UnapprovedBlock broadcasts but before boot reaches required_signatures, it has nothing to sign. Boot then sends ApprovedBlock to all peers including the late validator, but GenesisValidator::handle had no ApprovedBlock arm, so the message hit `_ => Ok(())` and was silently dropped. The validator stayed in GenesisValidator state forever, logging "Casper engine present but Casper not initialized yet." Add a CasperMessage::ApprovedBlock arm that transitions to Initializing. Initializing::init already proactively re-requests ApprovedBlock from bootstrap (the comment at initializing.rs:239-242 anticipated this race), and Initializing::handle accepts the response, validates it, and transitions to Running — the same path a late-joining non-genesis node already takes. Reproduction (pre-fix): F1R3FLY_NODE_IMAGE=...:local pytest test_consensus_safety.py::test_validator_failure_recovery --keep-running hit failure on attempt 3 of a 20-attempt loop (~33% rate). Verification (post-fix): same loop ran 20/20 passes; full integration suite improved from 78 pass / 10 fail / 6 error to 89 pass / 5 fail / 0 error, eliminating all 9 startup-timeout failures and the 6 cascading ws_shard fixture errors. Co-Authored-By: Claude <noreply@anthropic.com> * review: drop is_repeated guard in late-ApprovedBlock handler, add regression test PR #489 review flagged that handle_approved_block_late shared seen_candidates with handle_unapproved_block, conflating two unrelated dedup concerns. If a validator failed to sign an UnapprovedBlock (no transition out of GenesisValidator) and the same hash later arrived as ApprovedBlock, the guard would block recovery. Drop the guard. A successful transition_to_initializing replaces the engine, so subsequent ApprovedBlock messages route to Initializing::handle, not back here. Concurrent duplicates during the brief transition window are serialized by the engine_cell write, and Initializing::init's ApprovedBlockRequest is idempotent at bootstrap. Add transitions_to_initializing_on_late_approved_block targeting the GenesisValidator::handle(ApprovedBlock) path directly: send the message with no prior UnapprovedBlock, assert that Initializing::init's ApprovedBlockRequest reaches the transport layer. Co-Authored-By: Claude <noreply@anthropic.com> * test: serialize genesis-counter-incrementing tests to fix flaky assertion approve_block_protocol_test asserts a delta on a process-global metrics counter ("genesis") between a baseline read and a post-action read. That counter is incremented from add_approval — anywhere a valid UnapprovedBlock signature is processed. The approve-block tests are all #[serial], but two other test files exercise the same code path without the marker: - genesis_validator_spec::respond_on_unapproved_block_messages_with_block_approval (sends UnapprovedBlock to a GenesisValidator → block_approver → add_approval) - block_approver_protocol_test (calls unapproved_block_packet_handler directly across six tests) Without serialization, those tests can run in parallel with an approve_block_protocol_test in flight and corrupt its delta — the TODO entry tracked this as a ~1-in-3 flake. Mark all genesis_validator_spec and block_approver_protocol_test #[tokio::test]s as #[serial] so they share serial_test's mutex with approve_block_protocol_test. Verified across three consecutive full casper test runs (343/343 each). Drop the now-resolved TODO entry. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>

….conf, eliminate kamon (#492) * feat(node): self-contained binary — embed Rholang resources, defaults.conf, eliminate kamon Bake the genesis-ceremony Rholang sources and the HOCON defaults into the node binary at compile time so a node only needs --data-dir and ports to run. Production CWD requirement (workspace tree at runtime), the `DEFAULT_DIR` env var, and the kamon.conf parallel-config-file shim are all gone. Changes: - Embed all 11 .rho/.rhox genesis resources via include_str! in a new casper/src/rust/genesis/contracts/embedded_rho.rs. Rewrite the call sites in standard_deploys.rs to use a small `embedded_source` helper. Delete CompiledRholangSource::apply / apply_with_env / load_source and CompiledRholangTemplate::load_template (the file-loading machinery that walked an 8-path search ladder relative to CWD). Production code no longer touches the filesystem for Rholang sources; missing-asset bugs become build errors. - Embed defaults.conf via include_str! + hocon::HoconLoader::load_str. Drop the `default_dir: &Path` parameter from configuration::builder::build and the `DEFAULT_DIR` env var read in main.rs. Override semantics unchanged: --config-file <path> and <data_dir>/rnode.conf auto-load still apply on top of the embedded baseline. - Eliminate kamon.conf entirely. The InfluxDB and Zipkin reporters in node/src/rust/diagnostics/ are hand-rolled Rust — they only borrowed the JVM-Kamon HOCON schema for migration compatibility. Move the two fields the Rust code actually consumed (tick_interval and the InfluxDB endpoint) into NodeConf::Metrics under defaults.conf. Delete KamonConf, configuration/kamon.rs, kamon.conf, the load_kamon_config parsing path, and KamonConf plumbing through main.rs / mod.rs / diagnostics. - Delete unreferenced JVM logging artifacts: logback.xml, logging-template/* (never read by any Rust code). - Test-side filesystem .rho loading is preserved via a new casper/tests/util/rholang/test_rho_loader.rs (load_test_rho), kept out of the production binary. The 20 test files that loaded .rho fixtures from disk are migrated. - Dockerfile: drop the COPY steps that staged node/src/main/resources/, casper/src/main/resources/, and rholang/examples/ into the runtime image. The binary is now fully self-contained. Verified: cargo build --release --tests --workspace clean cargo test --release -p casper --test mod 345 passed / 0 failed ./target/release/node run --standalone reaches isReady:true from /tmp with no DEFAULT_DIR env var and no workspace tree on disk * docs(node): align README + Helm chart with self-contained binary Tail-end documentation and config-template cleanup that the self-contained-binary commit (d129086) implied but didn't touch. - README.md: clarify that the built-in defaults.conf is embedded into the binary at compile time via include_str! (not a runtime file lookup), so operators understand that the on-disk file is the source of the embed rather than a runtime dependency. - docs/node/README.md: update the config build pipeline description. Step 1 now describes `HoconLoader::new().load_str(EMBEDDED_DEFAULTS)` with a note that no `node/src/main/resources/` directory is required at runtime. Step 2 is reworded for accuracy. - node/src/main/resources/defaults.conf: drop the stale `kamon-influxdb` reference from the metrics-section comment. - Helm chart cleanup: * docker/helm/f1r3fly/configs/common/logback.xml: deleted (the JVM-Logback config was never read by the Rust binary; the enclosing `node/src/main/resources/logback.xml` was deleted in d129086, but the Helm chart still mounted a sibling copy). * docker/helm/f1r3fly/templates/statefulsets.yaml: drop the /var/lib/rnode/logback.xml subPath mount that pointed at the now-removed ConfigMap entry. * docker/helm/f1r3fly/templates/{deployable,observer}-rnode-configmaps.yaml: rewrite the Kamon-era comment in the embedded defaults.conf template; metrics endpoints (tick-interval, influxdb-endpoint) now live under the same `metrics` section in NodeConf. No code changes; binary rebuilt to verify the embedded HOCON still parses cleanly. Justfile already used --config-file overrides correctly (no DEFAULT_DIR env var, no CWD requirement); no recipe changes needed.

* fix(merge): expand rejection to DAG descendants to prevent stale diffs When conflict resolution rejected a deploy chain, diffs from descendant blocks (computed against the rejected chain's post-state) were still applied to the LCA base, producing internally inconsistent merged state. Reproduced at code level via stale_diff_application_corrupts_merged_state. Rejection expansion: after conflict resolution, walk DAG descendants of rejected blocks within merge scope and reject affected branches whole. Conservative-only — no event-log refinement, since event logs miss the indirect dependencies that cause the bug. Deploy de-duplication: preemptive dedup on (source_block_number desc, source_block_hash byte-lex asc). Dormant until the rejected-deploy recovery mechanism ships. Foundations: - source_block_hash and source_block_number on DeployChainIndex - block_number threaded through BlockIndex::new and its callers - ConflictSetMerger::merge split into resolve_conflicts + compute_merged_state so DagMerger can interpose expansion Also: - Hand-rolled Hash impl on DeployChainIndex matching PartialEq (the derived Hash covered all fields, violating the hash/eq contract) - Removed now-dead hash_code and pre_state_hash fields - KeyValueRejectedDeployBuffer skeleton (will be wired in a follow-up) - Two pre-existing proof tests marked #[ignore]: * concurrent_registry_inserts_should_not_conflict — assertion contradicts multi-parent DAG semantics; awaits rewrite * finalization_does_not_guarantee_canonical_state — flaky precondition under the two-bridge merge setup Co-Authored-By: Claude <noreply@anthropic.com> * fix(merge): recover rejected deploys via per-node buffer When the merge algorithm drops a deploy from the canonical merged state, its data is now placed in a new RejectedDeployBuffer so the block creator can re-propose it in a subsequent block. Previously rejected deploys were silently lost even though their effects never made it into canonical state. Buffer: KeyValueRejectedDeployBuffer mirrors KeyValueDeployStorage in shape and LMDB backing (new "rejected_deploy_buffer" store registered in RNodeKeyValueStoreManager; shares deploy_storage sizing). Merge-time populate: dag_merger::merge now returns (sig, source_block_hash) pairs. compute_parents_post_state groups by source block, fetches each block once, extracts the Signed<DeployData>, and inserts into the buffer. Scope awareness: CasperSnapshot carries a new rejected_in_scope DashSet, populated alongside deploys_in_scope during the ancestor BFS. The cache key covers both sets under one (generation, LFB) tuple. A lightweight rejected_deploy_sigs decoder on KeyValueBlockStore returns the sig list without decoding the full block body. Re-inclusion filter: prepare_user_deploys unions DeployStorage with RejectedDeployBuffer and re-includes any valid deploy that is both in deploys_in_scope and rejected_in_scope — its effects never landed, so proposing it again is correct. Finalization cleanup: record_directly_finalized purges from both pools. Sigs in body.deploys of a finalized block are removed from both storage and buffer; sigs in body.rejected_deploys of a finalized block are also removed from the buffer (definitively lost, not recoverable from here). Co-Authored-By: Claude <noreply@anthropic.com> * fix(dag): restore invalid-block latest-message update and bonded-validator justifications Four divergences from the source-of-truth Scala implementation had disabled slashing visibility in the Rust node: - new_latest_messages gated on !invalid, so equivocation blocks never became a sender's latest message. - The sender-advance branch gated on !invalid for the same reason. - Block-creator justifications used valid_latest_metas (filtered), excluding equivocators from the justification set and causing justification_follows to reject otherwise-valid blocks. - max_seq_nums used the filtered set too, omitting equivocators' sequence numbers downstream. With these restored, invalid_latest_messages fires as intended, prepare_slashing_deploys issues slashes for equivocators, and the pre-existing multi_parent_casper_should_succeed_at_slashing test passes. Flips dag_storage_should_not_replace_latest_message_with_invalid_block_from_same_sender to dag_storage_should_advance_latest_message_to_invalid_block_from_same_sender with inverted assertions reflecting the corrected behavior. Co-Authored-By: Claude <noreply@anthropic.com> * feat(merge): recover merge-rejected slashes via block-creator dedup When the merge rejects a deploy chain that contains a slash, the slash effect is silently lost to cost-optimal rejection — SYS_SLASH_DEPLOY_COST is 0 so any conflicting chain with cost >0 wins, and the equivocator remains bonded. Attackers can sustain cheap conflicts to starve slashing indefinitely. The fix surfaces the rejected slash metadata from the merge step and has the block creator re-issue any slash not already covered by its own invalid_latest_messages view. The slash then lands in the merge block's own body.system_deploys, bypassing cost-optimal rejection on the parents. The merge pipeline stays pure — no runtime threading, no new validation surface. Slash re-issuance flows through the existing SlashDeploy execution path, so determinism invariants are unchanged. - dag_merger::merge now returns (state, rejected_user_pairs, rejected_slash_pairs), splitting rejected pairs by is_slash_deploy_id. Close-block and heartbeat system deploys remain intentionally dropped. - compute_parents_post_state extracts RejectedSlash metadata by reading each distinct source block's body.system_deploys once. All slashes within a block share a synthetic sig, so one rejected chain represents every slash in the source block — iterating body.system_deploys produces the right recovery set. - New casper/src/rust/merging/rejected_slash.rs defines RejectedSlash and filter_recoverable, with the dedup key being (invalid_block_hash, issuer_public_key). Unit tests cover: own-slash covers merge-rejected duplicate (Attack 6), merge-rejected survives when uncovered by own (Attack 1), mixed coverage with multiple equivocators (Attack 4), issuer discrimination on same equivocator (Attack 7), and empty-input regression guard. - block_creator::create calls compute_parents_post_state once before system-deploy construction to surface the rejected slashes, dedups against own slashing_deploys, and appends non-duplicates as fresh SlashDeploys signed under the proposer's identity. The downstream compute_deploys_checkpoint call hits the parents-post-state cache so the merge is not re-run. - ParentsPostStateCacheVal extended to (StateHash, Vec<Bytes>, Vec<RejectedSlash>) so cache hits return the full 3-tuple. - Regression assertion in bridge_query_survives_multi_parent_merge confirms non-slash merges surface an empty rejected_slashes list. Co-Authored-By: Claude <noreply@anthropic.com> * feat(api): deploy_finalization_status query by deploy sig Adds a canonical-state finalization status API for deploys, replacing block-hash polling. After the merge fix, a block can finalize while some of its deploys' effects were dropped by merge rejection — polling by block hash returns true even though canonical state disagrees. Polling by deploy sig via this API correctly reports the effect's presence in canonical state. States follow the design decision: - Finalized — sig in a finalized block's body.deploys with is_failed=false, and not in any finalized descendant's body.rejected_deploys - Failed — sig in a finalized block with is_failed=true (explicit runtime failure) - Pending — sig alive: in deploy storage, in a non-finalized block, in the rejected-deploy buffer awaiting re-proposal, or rejected after finalization and awaiting canonical recovery - Expired — valid_after_block_number + deployLifespan elapsed without canonical inclusion Response carries `state`, `rejection_count`, and `latest_block_hash` (optional). Architecture: single-pass canonical-chain walk from LFB backward for deployLifespan blocks. For each block: check body.deploys for a clean or failed match, check body.rejected_deploys for a sig match. Track the highest-height observation for `latest_block_hash`, count rejection occurrences for `rejection_count`, and resolve the terminal state from the observations. Uses the lightweight rejected_deploy_sigs decoder to avoid full body decode on the rejection-check arm. Defensive error handling: - Storage errors during first-seen block fetch → propagated as API error - Missing block body when sig is indexed → warn log + Pending_unknown - Sig indexed but absent from body.deploys → API error (state inconsistency) - LFB with no block_number entry → API error (invariant violation) - Blocks missing from store during scan → warn log + continue (scan robustness over hard failure; result may be incomplete) Trait addition: `Casper::casper_shard_conf() -> &CasperShardConf` to give BlockAPI access to deployLifespan. Impls added on MultiParentCasperImpl and both NoOpsCasperEffect test stubs. gRPC surface: - DeployServiceCommon.proto: DeployFinalizationStatusQuery message, DeployFinalizationStateProto enum, DeployFinalizationStatusInfo message (with optional latestBlockHash for explicit absent/present) - DeployServiceV1.proto: rpc deployFinalizationStatus + DeployFinalizationStatusResponse - node/src/rust/api/deploy_grpc_service_v1.rs: server handler delegating to BlockAPI HTTP surface: - node/src/rust/api/web_api.rs: WebApi trait method + DeployFinalizationStatusJson with Option<String> for latest_block_hash so JSON serializes null when absent - node/src/rust/web/web_api_routes.rs: GET /api/deploy-finalization-status/{deploy_sig_hex} Tests: - casper lib tests (2): state enum construction, state distinctness - casper integration smoke test (1): unknown_sig_returns_pending_with_empty_fields exercises the full EngineCell → BlockAPI path Performance: zero background cost; O(deployLifespan) block-sig reads per query, dominated by proto decode on the lightweight rejected_deploy_sigs decoder. Sub-millisecond for typical lifespans. Consensus safety: read-only API, no new attack surface, no new storage, no new trait methods beyond the shard_conf getter. Deep end-to-end tests (Finalized, Failed, Expired, nonzero rejection count) require real equivocation + merge-rejection fixtures and are deferred. Co-Authored-By: Claude <noreply@anthropic.com> * feat(casper): gate rejected-deploy buffer population on finalization status Catching-up validators replay historical blocks to get to the current tip. For each block with non-empty body.rejected_deploys, the buffer- population path extracts the rejected sigs' DeployData and adds them to the local rejected-deploy buffer for re-proposal. Without a status check, this admits sigs that have already been re-proposed and finalized elsewhere in the chain, or sigs past their deployLifespan. Two failure modes: - Double-execution of already-finalized work. A rejected sig is added to the local buffer; on the validator's next proposal round, the buffer read includes the deploy; the new block contains the deploy; dedup picks the new proposal over the older finalized copy within merge scope; the merge produces a re-execution of canonical work against a different pre-state. Effects diverge. Consensus forks. - Past-lifespan noise. The buffer read filter drops past-lifespan sigs at proposal time, but the entries still accumulate and churn through storage. Fix: before admitting each sig to the buffer, run the deploy finalization status resolver. Admit only if the state is Pending. Skip Finalized / Failed / Expired — those sigs are terminally resolved in the local canonical view and must not be re-proposed. The gate is unconditional — not "catchup mode" flagged. A live merge that re-emits a canonically-finalized sig would be equally unsafe; the same gate defends against both. Implementation: - Extracted BlockAPI::deploy_finalization_status's algorithm into a pure function `deploy_finalization_status::resolve(dag, block_store, deploy_lifespan, sig)`. The async BlockAPI method now reduces to a thin wrapper that unwraps the engine cell and delegates. This makes the resolver callable from compute_parents_post_state without threading an EngineCell through the merge layer. - Added should_admit_to_rejected_buffer helper in interpreter_util.rs that calls resolve and applies the admit rule. Conservative skip-on-error: transient storage failures skip the sig with a warn log; consistency errors skip with a warn log. Never admit on error — admit-on-error would reintroduce the double-execution bug under flaky storage. - Wired the helper into compute_parents_post_state's buffer-populate block as a single predicate call, replacing the direct push. Tests: - Pure-resolver direct call: resolve_pure_function_returns_pending_for_unknown_sig verifies the extracted function is callable from a non-engine-cell context. Deferred to later test work: - Integration test exercising the gate-skips-finalized path (needs a fixture that produces merge rejection AND later finalization of the same sig — overlaps with equivocation + merge-rejection work). - Full multi-node catchup simulation. Consensus safety: the gate is a strict reduction of what enters the buffer. Never adds sigs that weren't there; only drops sigs with a terminal status in the current canonical view. Deterministic per validator's DAG view. Performance: O(deployLifespan) block reads per admit decision. For typical rejection rates (0-3 per merge, lifespan ~50) this is sub-ms. Full catchup of 1000 historical blocks with average 2 rejections each adds ~100K block reads cumulatively — seconds of wall time. Co-Authored-By: Claude <noreply@anthropic.com> * fix(api): walk finalized-ancestor BFS in deploy_finalization_status The resolver walked `main_parent_chain` from LFB backward — a linear walk that only visits a block's first (main) parent at each step. In a multi-parent DAG, a deploy's effects can reach canonical state via a secondary-parent merge; the main-parent chain alone misses those blocks, so the sig is reported Pending even after it finalized. Fix: BFS from LFB through every parent slot (main + secondary) bounded by deploy_lifespan depth. `visited` dedups the frontier because multi-parent ancestries share common ancestors. Phase G's catchup gate uses the same resolver, so it inherits the fix automatically. Regression test: `resolve_finds_sig_in_secondary_parent_branch` builds a minimal DAG (genesis → A, B siblings → C with A as main, B as secondary) and places the deploy sig only in B. The test fails with Pending on the main-parent walk and passes with Finalized on the BFS, locking in the semantics. Co-Authored-By: Claude <noreply@anthropic.com> * fix(validate): honor rejected_in_scope exemption in repeat_deploy The repeat-deploy check rejected any block whose body.deploys contained a sig already present in an ancestor's body.deploys. This predates the rejected-deploy-buffer recovery pipeline (Phase D): when a deploy is rejected by a descendant merge within deploy_lifespan, the buffer re-proposes it in a later block — a legitimate re-inclusion, not a repeat. Without this exemption, every recovery-path block fails validation with InvalidRepeatDeploy, the proposer retries the same deploys, and the shard deadlocks on heartbeat propose attempts under any merge-rejection workload. Fix: filter sigs present in s.rejected_in_scope out of the check set before the BFS. CasperSnapshot already computes rejected_in_scope by walking body.rejected_deploys in the current proposal's parent scope; prepare_user_deploys uses the same signal on the proposer side. The validator now mirrors the proposer. Regression test: repeat_deploy_validation_allows_recovered_deploy_from_\ rejected_in_scope builds the exact DAG shape the existing "should not accept" test uses, then pre-populates rejected_in_scope with the deploy's sig. Pre-fix returns Invalid(InvalidRepeatDeploy); post-fix returns Valid. Co-Authored-By: Claude <noreply@anthropic.com> * fix(merge): recover dedup collateral via rejected-deploy buffer When dag_merger's deploy de-duplication discards a chain because some deploy in it has a fresher copy elsewhere, deploys unique to the discarded chain were silently dropped — not added to the rejected-deploy buffer, not in rejected_in_scope, and the deployer had no signal. Collect collateral-lost deploys (those unique to a dropped chain) into the rejected-user list so the buffer can recover them in a subsequent block, mirroring how conflict-rejected deploys recover. Co-Authored-By: Claude <noreply@anthropic.com> * fix(api): require canonical-descendant rejection to invalidate clean inclusion deploy_finalization_status::resolve was invalidating a clean finalized inclusion if any rejection at a strictly higher height was observed. In multi-parent DAGs, a rejection in a sibling block at the same or higher height does not affect a deploy's effects in a canonical block on a different chain. Recovery cycles via the rejected-deploy buffer can also produce rejection events in non-canonical sibling blocks (validators racing to recover the same deploy), and the height-only check turned those into a positive feedback loop where the deploy stayed Pending while the buffer kept re-proposing. Track each rejection's block hash alongside its height and require the rejection block to be a canonical-chain descendant of the clean block (via is_in_main_chain) before invalidating. Same-block rejections (the clean inclusion and rejection share a block — e.g., a recovery proposal whose merge step also dedup-rejected an older copy in scope) are excluded explicitly. Co-Authored-By: Claude <noreply@anthropic.com> * test(helper): initialize tracing subscriber in TestNode::create_network casper/tests/mod.rs defines an init_logger() guarded by Once, but it had no callers in the test tree. Production code with tracing::debug!/info!/ warn! calls produced no output during tests, making diagnostic logs useless when investigating failures. Wire init_logger() into TestNode::create_network so any test that builds a network gets a tracing subscriber wired up with EnvFilter respecting RUST_LOG. Behavior is unchanged when RUST_LOG is unset (default ERROR level filter). Co-Authored-By: Claude <noreply@anthropic.com> * docs(dag): document invalid-block LMM safety argument Address PR #488 review request to document why removing the `invalid` guard from latest-message-map updates (commit 61b7394) is safe for fork choice and finalization. The behavior change matched the Scala source-of-truth (BlockDagKeyValueStorage.scala / MultiParentCasperImpl.scala) but the safety argument was implicit; the reviewer asked for an explicit explanation. Three comments added: - block_dag_key_value_storage.rs::insert::new_latest_messages — primary safety-argument anchor at the storage site the reviewer flagged. Covers the four-point argument (fork choice unaffected because parent selection filters via valid_latest_msgs; slashing requires invalid blocks in LMM; justification_follows requires every bonded validator; pre-fix guard had no Scala counterpart and silently disabled slashing). - multi_parent_casper_impl.rs::create_block_data justifications block — strengthened existing comment to explicitly cite parent selection's filter at line ~160 as the reason fork choice is unaffected. - multi_parent_casper_impl.rs max_seq_nums block — strengthened comment to explain the equivocator-reset attack the unfiltered read prevents (filtering would let an equivocator reset their seq-number floor). No behavior change. Casper test suite still 347/347. Co-Authored-By: Claude <noreply@anthropic.com> * perf(api): batch deploy_finalization_status resolver, tighten canonical-descendant invalidation Address PR #488 review #4 (BFS-per-deploy performance) and a related correctness gap surfaced while writing the regression test for the refactor. ## Batched resolver (Review #4) The catchup-heavy hot path in `compute_parents_post_state` previously called `deploy_finalization_status::resolve` once per rejected deploy sig. Each call did its own BFS over the finalized window, so a merge with N rejections did N independent walks of the same M-block scope — O(N · M). Reviewer named this as the catchup case "where this gate matters most" and suggested batching. Refactor lifts the per-sig BFS state into a `ResolverState` struct and splits the resolver into shared helpers (`run_prelude`, `bfs_finalized_window`, `finalize_sig_state`). New `resolve_batch(sigs)` does a single BFS pass that updates per-sig state for every sig found in body.deploys / body.rejected_deploys. Cost drops to O(M + N). Existing single-sig `resolve(sig)` becomes a thin wrapper over the shared helpers. Both entry points have identical error semantics: prelude inconsistencies (sig indexed at a block that no longer claims the sig in body.deploys) propagate as `Err` so corruption is surfaced honestly rather than silently masquerading as `pending_unknown`. The batch caller in `compute_parents_post_state` wraps the call in a "skip on Err" fallback that admits nothing for the merge step, so behavior at the catchup gate is unchanged when the corruption case hits — but now it is loud rather than hidden. Call site in `interpreter_util.rs::compute_parents_post_state` replaces per-sig `should_admit_to_rejected_buffer` with one batched `compute_rejected_buffer_admits` precompute, then dictionary lookups during the per-block iteration. For a 50-rejected merge with M=200, this is 200 block fetches instead of 10,000. ## Canonical-descendant invalidation gap (surfaced by parity test) Writing a multi-parent parity test for the refactor exposed a pre-existing gap between the resolver's intent and its implementation: - Intent (per the inline comment): a rejection invalidates a clean inclusion only when the rejection is on the canonical chain. - Implementation: `is_in_main_chain(clean_block, reject_block)` — walks reject_block's main-parent ancestry checking for clean_block. This is necessary but not sufficient. A non-canonical sibling B' with main parent A still passes this check, even though B' itself is not on LFB's main-parent chain. Concrete reproduction (now in `resolve_and_resolve_batch_agree_across_states`): four-block DAG genesis → A → {B, S} → C with C as LFB. B is canonical (main parent of C), S is a non-canonical sibling (reachable only via C's secondary-parent slot). Sig is in A.body.deploys (clean) and S.body.rejected_deploys (sibling rejection). Pre-fix: resolver reports Pending because is_in_main_chain(A, S) is true. Post-fix: resolver reports Finalized because S itself is not on C's main-parent chain. Severity: false-negative, not unsafe. `Pending` for sigs that are in canonical state. Polling clients keep polling unnecessarily; the catchup gate admits already-canonical sigs to the buffer, where dedup handles them harmlessly. But under PR #488's recovery workload — competing recovery proposals on bridge contracts — this produces user-visible "stuck Pending" behavior for finalized deploys. Worth fixing alongside the batching work. Fix: in `finalize_sig_state`, the canonical-descendant check now also requires `is_in_main_chain(reject_block, lfb)` to be true. One extra `is_in_main_chain` call per sig with both clean and reject events. Implementation now matches intent. ## Tests New `resolve_and_resolve_batch_agree_across_states` test in `casper/tests/api/deploy_finalization_status_test.rs` builds a production-shape multi-parent DAG covering five resolver branches (clean via secondary parent, failed canonical, clean+canonical rejection, clean+sibling rejection, unknown). Single-sig `resolve` and batched `resolve_batch` results compared for parity on every sig. Test failed before the canonical-descendant fix (`clean_canonical_reject_sibling` case) and passes after. Casper suite 348/348 (was 347/347 pre-refactor, +1 for the parity test). Co-Authored-By: Claude <noreply@anthropic.com> * fix(validate): gate repeat_deploy recovery exemption on Finalized status Address remaining concern from PR #488 review #5 (`validate.rs:347`). Commit `3ee91fb5` fixed the functional bug (legitimate recovery being blocked by `InvalidRepeatDeploy`) but left a defense-in-depth gap that the reviewer's original prose flagged: "if a deploy is in `rejected_in_scope` and appears in the block but is NOT a legitimate re-proposal (e.g., a malicious validator re-includes a deploy that was rejected for a valid reason), the repeat-deploy check won't catch it." Both the block-side filter (current) and the ancestor-side filter (reviewer's suggested wording) operate on the same global `rejected_in_scope` set, so they give identical results in every case — including the double-execution scenario where a sig has BOTH a clean canonical inclusion AND `rejected_in_scope` membership (e.g., because a sibling-fork rejection landed on the canonical chain via a merge). Under either filter shape, such a sig is exempted from the repeat check and the validator allows re-execution of an already-finalized deploy. The catchup gate (`should_admit_to_rejected_buffer`) is the primary defense — it calls `deploy_finalization_status::resolve` before admitting a sig to the rejected-deploy buffer and skips terminal states. But the validator-side check is meant to be a second line of defense for the case where the gate is bypassed (bug, race, colluding proposer); under the current implementation it is missing for this exact scenario. ## Fix Gate the recovery exemption on the sig's current finalization status. A sig in `rejected_in_scope` is exempted from the repeat check ONLY when its status is NOT `Finalized`: - `Pending` / `Failed` / `Expired`: no clean canonical inclusion; re-inclusion is the only way to land effects in canonical state. Exempt from check (recovery legitimate). - `Finalized`: clean canonical inclusion that survived all canonical-descendant rejections; effects ARE already canonical. Re-inclusion is double-execution. Do NOT exempt; let the ancestor scan flag the repeat. - Resolver error: conservative-fail — keep the sig in the check set so an inconsistency surfaces as `InvalidRepeatDeploy` rather than being silently exempted. ## Tests Two regression tests on the same code path: - `repeat_deploy_validation_allows_recovered_deploy_from_rejected_in_scope`: restructured to model TRUE recovery — the deploy lives only in a non-canonical / non-finalized ancestor (status `Pending`), so the exemption applies and the block validates. Passes both pre- and post-fix. - `repeat_deploy_blocks_double_execution_when_finalized_and_in_rejected_in_scope`: new gap test. Same `rejected_in_scope` membership, but the sig has a clean canonical inclusion in genesis (LFB), so status is `Finalized`. Pre-fix: filter exempts the sig and validation returns `Valid` (gap reproduced). Post-fix: filter declines the exemption, ancestor scan finds the canonical inclusion, and validation returns `InvalidRepeatDeploy` as it should. Casper suite: 349/349 (was 348/348, +1 for the new gap test). Co-Authored-By: Claude <noreply@anthropic.com> * fix(block_creator): exempt rejected_in_scope sigs from self-chain dedup When the merge engine rejects a deploy that was originally signed into the proposer's own self-chain, prepare_user_deploys correctly admits it from the rejected-deploy buffer (its rejected_in_scope exemption), but collect_self_chain_deploy_sigs immediately drops it again because the sig appears in the proposer's prior block. Mirror the same exemption in the self-chain dedup filter. Adds an end-to-end recovery-cycle test using same-key vault depletion as a deterministic conflict source: two deploys whose combined precharge exceeds the source vault's balance, triggering fold_rejection. The test is arranged so validator 0's prior block contains the rejected sig and validator 0 is the recovery proposer — the only configuration where the self-chain filter is load-bearing. Verified 10/10 pass with fix, 10/10 fail without. Removes finalization_does_not_guarantee_canonical_state — an ignored test whose precondition assertion (rejected_deploys empty) is contradicted by the multi-parent DAG design and is superseded by the new recovery-cycle coverage. Co-Authored-By: Claude <noreply@anthropic.com> * docs(casper): document rejected-deploy buffer and rejected_in_scope exemption Updates the CasperSnapshot struct definition in casper/README.md (it had drifted: missing rejected_in_scope, stale collection types, swapped invalid_blocks key/value). Documents the rejected-deploy buffer recovery cycle in both casper/README.md and casper/CONSENSUS_PROTOCOL.md: - prepare_user_deploys pulls recovered sigs from KeyValueRejectedDeployBuffer - the in-scope dedup filter exempts sigs in rejected_in_scope so those recovery candidates aren't immediately dropped - the same exemption applies to collect_self_chain_deploy_sigs - merge-engine fallback discards land in the buffer rather than being silently dropped Co-Authored-By: Claude <noreply@anthropic.com> * test(merge): cover dedup orphan recovery path Adds a code-level regression test for the rejected-deploy buffer's dedup-orphan path: when `dag_merger::merge` drops a chain via the freshness rule (block_number, byte-lex hash), any deploy unique to the dropped chain lands in `collateral_lost_pairs` and is admitted to the buffer alongside conflict-rejected sigs. Fixture builds two siblings with the shared deploy_x and validator-unique markers V/W (event-log linked through a shared channel), then asserts exactly one of {sig_V, sig_W} reaches the buffer. Removes the ignored `concurrent_registry_inserts_should_not_conflict` test — its `rejected.is_empty()` precondition contradicts multi-parent DAG semantics and is superseded by the recovery-cycle and dedup-orphan coverage in batch2. Co-Authored-By: Claude <noreply@anthropic.com> * test(merge): cover slash recovery via multi-parent merge and re-issuance Two tests in slash_recovery_spec.rs: * slash_for_equivocator_survives_multi_parent_merge — end-to-end through TestNode. Three validators; node 0 equivocates; nodes 1 and 2 each propose a SlashDeploy-bearing block; node 1 merges both as parents. Asserts post-merge equivocator stake at the bond floor and the merge proposer's stake unchanged. * e1c_re_issues_merge_rejected_slash — focused on the re-issuance loop in block_creator::create. A synthetic RejectedSlash is written into the parents-post-state cache so the merge proposer's compute_parents_post_state returns it as if the merge engine had rejected a slash chain. Different issuer_public_key keeps it past filter_recoverable. Asserts a SlashDeploy entry for the equivocator's invalid_block lands in the proposed body — TDD-verified by commenting out the loop. Co-Authored-By: Claude <noreply@anthropic.com> * test(merge): cover multi-validator buffer convergence dedup Two validators independently buffer the same conflict-rejected sig and each re-propose it in a recovery block alongside a validator-unique marker deploy. The markers consume from a channel the recovered sig produces on, putting [deploy_x, marker] in a single event-log chain inside each block. Distinct marker sigs keep the two chains' deploys_with_cost sets unequal so conflict_set_merger's HashSet collapse cannot merge them; dag_merger::merge's freshness-based dedup is then the sole mechanism that prevents surfacing the shared sig as a conflict-rejected duplicate. Asserts the shared sig stays out of rejected_user_deploys and exactly one of {marker_v0, marker_v1} is orphaned (the loser's unique deploy). TDD-verified: commenting out the dedup retain logic causes the orphan assertion to fail. Co-Authored-By: Claude <noreply@anthropic.com> * chore: drop PR/review reference from test comment Replaces a "PR #488 review #4" reference with the substantive explanation already in the surrounding comment. PR numbers and review thread positions are not stable across squash/rebase or PR re-creation, so they don't belong in committed source. Co-Authored-By: Claude <noreply@anthropic.com> * fix(buffer): purge expired sigs from rejected-deploy buffer prepare_user_deploys removes block-expired and time-expired sigs from deploy_storage but not from KeyValueRejectedDeployBuffer. The read path filters expired sigs out of `valid_unique` so they aren't re-proposed, but on-disk LMDB entries persist. Combined with no admission size cap on the buffer, a sustained-load adversary that keeps generating conflicts can grow the buffer unbounded. Extend the expired-removal sweep to call buffer.remove(expired_list) alongside the storage cleanup. Adds should_remove_expired_deploys_from_rejected_deploy_buffer to exercise the regression-relevant case (sigs in buffer but not in storage — the realistic state after a sig has been conflict-rejected and the original storage entry has aged out via prior sweeps). TDD red-green confirmed: commenting out the buffer.remove call fails the test. Co-Authored-By: Claude <noreply@anthropic.com> * fix(merge): dedup rejected slashes by equivocator only filter_recoverable previously keyed dedup on (invalid_block_hash, issuer_public_key), but the proposer re-signs every E1c slash under their own pk. Different-issuer slashes for the same equivocator therefore all survived dedup, and the proposer emitted multiple SlashDeploys for the same equivocator into the merge block body — saved by PoS idempotency, but inflating block size and wasting execution on redundant slashes. Two surfaces of the same bug: 1. Proposer V3 has E in invalid_latest_messages (own-detected slash) AND the merge engine surfaces a RejectedSlash for E from a different original issuer V2. Today both land in body; after this fix only the own-detected slash lands. 2. Multiple validators independently propose slash chains for the same equivocator E and all chains are merge-rejected. Today every rejected slash survives dedup; after this fix exactly one survives per equivocator. Drops the issuer_public_key out of the dedup key entirely (it's provenance, not identity for dedup purposes), keys both sides on invalid_block_hash, and sorts survivors deterministically for body-hash determinism across replays. Adds two tests to rejected_slash.rs covering each surface (same_equivocator_across_issuers_dedups_to_one, own_detection_drops_rejected_from_other_issuer); replaces the previous dedup_key_discriminates_by_issuer test which endorsed the buggy behavior. TDD red-green confirmed: disabling either filter fails the corresponding tests. Co-Authored-By: Claude <noreply@anthropic.com> * fix(slash): exclude already-slashed validators via active_validators prepare_slashing_deploys filtered invalid_latest_messages by bonds_map.stake > 0 only. With bond floor = 0 (test default), an already-slashed validator has stake = 0 and is filtered out — the existing tests rely on this. With bond floor > 0 (production), already-slashed validators retain stake at the floor, satisfy the > 0 check, and the proposer emits a redundant SlashDeploy in every block until the equivocator's invalid latest message ages out of the DAG view. Saved by PoS slash idempotency, but inflates body size and wastes execution. OnChainCasperState::active_validators is the canonical "validators eligible to participate" set, queried from PoS at the parent post-state. PoS removes slashed validators from it regardless of bond floor. synchrony_constraint_checker.rs already uses this pattern. Adopt the same check here. Extracts the filter into a private filter_slashable_invalid_messages helper and adds three inline unit tests covering each branch of the filter, including the regression-relevant case where stake > 0 but the validator is no longer in active_validators. TDD red-green confirmed: disabling the active_validators check fails the already-slashed test while leaving the other two green. Co-Authored-By: Claude <noreply@anthropic.com> * chore: cargo fmt all PR-touched files Runs rustfmt on every file modified in this PR that drifted from the formatter. Pure formatting — no code changes. Verified: full casper suite (356 integration + 66 lib unit) and block-storage suite (48) all pass post-format with zero warnings on cargo check. Co-Authored-By: Claude <noreply@anthropic.com> * chore: drop deferred-work reference from test header comment The previous file header had a paragraph describing "deep end-to-end coverage (multi-equivocation, cost-starvation simulation, merge-rejection-then-recovery) ... tracked separately." That kind of deferred-work pointer doesn't belong in committed source — there's no durable target for "separately" to point at, and the language ages poorly. The remaining sentence describes what the file covers, which is the only thing the comment needs to say. Co-Authored-By: Claude <noreply@anthropic.com> * fix(slash): key SlashDeploy seed on invalid_block_hash generate_slash_deploy_random_seed previously hashed only (SYSTEM_DEPLOY_PREFIX || validator_pk || seq_num). Every SlashDeploy in the same block — own-detected and recovered — received an identical rng seed. The slash contract opens `new rl, poSCh, ...` whose unforgeable channel names are derived from this rng, so two slashes in the same block alias these channels in the tuplespace, and the return-channel routing the system_deploy infrastructure uses to extract each slash's PoS response keys on the same name. The author's earlier comment ("completely sure that collision cannot happen") assumed one slash per block — an assumption that held when the LMM filtered invalid blocks (Section 3 turned that off) and slash recovery didn't exist (Section 4 added it). PR #488 makes the path hot. Adds invalid_block_hash to the seed at the source. Replay determinism is preserved: invalid_block_hash is part of the SlashDeploy struct and persists in the block body, so validators re-running historical slashes reconstruct the same rng state. Updates the three call sites (prepare_slashing_deploys, the recovered-slash loop in block_creator::create, and replay_system_deploy_internal in replay_runtime). Adds two inline regression tests: - slash_seed_differs_per_invalid_block_hash: two distinct equivocators in the same block-and-proposer context produce distinct rng seeds. - slash_seed_is_deterministic_for_same_inputs: same inputs always produce the same seed (replay determinism). TDD red-green confirmed: removing invalid_block_hash from the seed input fails the differ-per-hash test. Co-Authored-By: Claude <noreply@anthropic.com> * fix(resolver): bubble is_in_main_chain errors instead of silently fudging The canonical-descendant invalidation rule called dag.is_in_main_chain(...).unwrap_or(false) at two sites in finalize_sig_state. A transient LMDB read failure would silently collapse to "rejection is NOT a canonical descendant" → invalidation rule does NOT fire → state stays Finalized. Wrong direction (false-positive Finalized) and silent. state is consensus-relevant: the repeat_deploy validator (validate.rs:347) reads it via the rejected_in_scope exemption gated on state != Finalized. Validator A's is_in_main_chain succeeds while validator B's hits a transient I/O blip → A: state=Pending → exempt; B: state=Finalized → no exempt → InvalidRepeatDeploy. Two validators reach different verdicts on the same block. Changes finalize_sig_state to return ApiErr<DeployFinalizationStatus> and propagates the is_in_main_chain Result via ?. The two callers (resolve, resolve_batch) already returned ApiErr and propagate the new ?. Behavior on the happy path is unchanged — all 356 integration + 68 lib unit tests pass post-fix. Matches the surrounding error-handling pattern in this file (every other I/O call already propagates via ?). Co-Authored-By: Claude <noreply@anthropic.com> * fix(proposer): include merge-rejected slashes in empty-block skip The skip predicate ran before compute_parents_post_state, so a heartbeat-disabled proposer (allow_empty_blocks=false, the production default) with no user deploys and no own-detected slashes would skip without seeing rejected slashes from the parent merge. Move the merge above the skip and add !recovered_rejected_slashes.is_empty() to the predicate. Co-Authored-By: Claude <noreply@anthropic.com> * docs(casper): document slash recovery and multi-slash blocks Existing CONSENSUS_PROTOCOL.md slashing flow predates PR #488. Adds multi-parent merge & recovery loop, multi-slash blocks (per-equivocator seed), empty-block skip predicate, and source-file-map entries for merging/rejected_slash.rs and the slashing module group. Co-Authored-By: Claude <noreply@anthropic.com> * chore: cargo fmt sweep across casper crate Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>

…exemption (#497) prepare_user_deploys exempts deploys in `rejected_in_scope` from the in-scope filter so genuinely rejected deploys can be re-proposed. Without a canonical-descendant gate, the exemption also fires when the rejection sits in a non-canonical sibling while the deploy's effects are already in canonical state — producing a recovery block that downstream validators correctly flag as `InvalidRepeatDeploy`. On FTT=0 shards this triggers mutual slashing. Mirror the validator-side `repeat_deploy` gate at the proposer: resolve the candidate sigs in batch and decline the exemption when status is `Finalized`. Resolver failure → decline conservatively. Tests: - validator-side defense regression (already passes pre-fix) - proposer-side gate (RED pre-fix, GREEN post-fix)

spreston8 and others added 4 commits April 28, 2026 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promote rust/staging to rust/dev#495

Promote rust/staging to rust/dev#495
spreston8 wants to merge 4 commits intorust/devfrom
rust/staging

spreston8 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spreston8 commented May 1, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant