Open
Conversation
- Revert ReplayRSpace rig() to Scala dual-indexing: index each COMM under all its IOEvents (consume + all produces), not just the triggering one, so COMMs are findable from either side during replay - Change reducer eval_par term ordering to receives-first: consumes store continuations and register joins before produces try to match, preventing COMM_MATCH_FAIL cascades from missing continuations - Fix Produce equality in ReplayRSpace matches() and was_repeated_enough_times() to compare by hash only, matching Scala's Produce.equals() override — Rust-only fields (is_deterministic, output_value, failed) are not part of the hash and caused false negatives in comm.produces.contains() checks - Fix hot_store to_map() to include channels with only continuations (no data), which were previously excluded from the iteration - Remove broken workarounds that masked the root cause: deferred produces, fallback produce matching, put_join/remove_join simulation in locked_consume COMM path - Add per-bind peek support in eval_receive (BTreeSet<i32> peeks instead of single bool), matching Scala's per-channel peek semantics - Add comprehensive replay diagnostics gated behind tracing targets: COMM trigger-side mismatch detection, COST_TRACE_OP with channel hashes and eval nesting depth, COMM_MATCH_FAIL detailed state dumps, hot store per-channel mutation trace, checkpoint state hashing, validator produce/consume instrumentation - Update cost_accounting_spec expected values for persistent produce contracts affected by receives-first evaluation order - Update reduce_spec tests to use structural assertions instead of exact random_state byte comparison (bytes depend on eval ordering) - Add exploratory deploy API tests
The receives-first evaluation ordering in d9124e0 changed state hashes and introduced a regression where UnknownRootError during block replay was treated as an internal error instead of an invalid block. Bug fixes: - Treat UnknownRootError as invalid block (Right(None)) instead of internal error (Left) by adding InvalidPreStateHash replay failure variant — blocks referencing non-existent state roots are invalid, not errored, and retrying is pointless - Update hardcoded expected hash in runtime_spec to match receives-first evaluation ordering Test coverage for d9124e0 gaps: - Receives-first COMM ordering: validates COMM fires correctly when send and receive coexist in a single Par - Genesis pre-state hash replay: verifies replay_compute_state succeeds using the block's own pre_state_hash (not a hardcoded constant) - spawn_runtime_at consistency: confirms data written by a deploy is readable via spawn_runtime_at at the new state but not at the old - Mixed peek multi-bind: validates per-bind peek preserves peeked channel data while consuming non-peeked channel data - locally_free clearing: verifies COMM fires despite stale locally_free bits on substituted channels inside new blocks Infrastructure: - Add macOS LMDB semaphore cleanup to prevent ENOSPC after repeated test runs (atexit handler unlinks orphaned /MDB{r,w} semaphores)
- Remove unused Consume and IOEvent imports from replay_rspace_tests.rs - Remove dead merge_rand / par_merge_rand / split_rand variables left over from weakened random_state assertions in reduce_spec.rs - Gate ALLOCATOR_TRIM_TOTAL_METRIC import behind cfg(linux) to match its usage site in block_processor.rs
RUST_MIN_STACK only affects the main thread and std::thread::Builder threads. Tokio workers use the system default (2MB on Linux), which is insufficient for receives-first evaluation's deep COMM cascades (50+ levels). This caused the bootstrap node to crash with SIGSEGV (exit code 139) during CI integration tests.
…Phase 1) Replace Arc<Mutex<Cost>> with AtomicI64 + CAS loop for lock-free cost accounting. The old implementation had a TOCTOU bug: it locked, read, deducted, unlocked, re-locked, and checked — two concurrent branches could both deduct past zero between the two lock acquisitions. The CAS loop atomically checks remaining budget and deducts in a single compare_exchange_weak, eliminating the race window. This is a prerequisite for concurrent Par evaluation (Phase 6) where multiple futures charge costs simultaneously. Also normalizes consume-triggered COMM cost accounting to produce- triggered semantics in charging_rspace.rs, ensuring gas costs are deterministic regardless of which side fires a COMM. This makes the total cost order-independent (commutative) for concurrent evaluation. Phase 1 of 6: Maximally parallel RSpace via lock removal and interior mutability. Next: remove hot store outer Mutex (Phase 2).
…hase 2) The InMemHotStore wrapped its DashMaps in Arc<Mutex<HotStoreState>>, serializing ALL hot store operations through a single global lock. This defeated DashMap's per-shard read-write locks which already provide thread-safe concurrent access. Changes: - Remove Mutex<HotStoreState> wrapper from InMemHotStore - Remove Mutex<HistoryStoreCache> wrapper (also uses DashMaps) - Access DashMaps directly via self.state.data.get() etc. - Use DashMap::clear() for clear() instead of field reassignment - Update tests to access state directly without lock().unwrap() The HotStore trait already used &self (not &mut self), so no trait changes were needed. All 294 rspace++ tests pass. Phase 2 of 6: Maximally parallel RSpace via lock removal and interior mutability. Next: ISpace trait &self + interior mutability (Phase 3).
All 12 mutable methods in the ISpace trait now take &self instead of &mut self, enabling shared concurrent access without an outer Mutex. Interior mutability changes in RSpace and ReplayRSpace: - event_log: Vec<Event> → Arc<Mutex<Vec<Event>>> - produce_counter: BTreeMap → Arc<Mutex<BTreeMap>> - history_repository: Arc<Box<...>> → Arc<Mutex<Arc<Box<...>>>> - store: Arc<Box<...>> → Arc<Mutex<Arc<Box<...>>>> Added get_store() and get_history_repository() helper methods for lock-free read access to the inner Arc clones. Updated all callers across rspace++, rholang, and casper crates. All 294 rspace++ tests and 39 casper rholang tests pass. Phase 3 of 6: Maximally parallel RSpace via lock removal and interior mutability. Next: per-channel-group locks for joins (Phase 4).
Replace the global Mutex on the RSpace store with fine-grained per-channel-group locks. Each channel group (determined by the join pattern) gets its own Mutex, keyed by a sorted hash of the channels. Changes to RSpace and ReplayRSpace: - store: Arc<Mutex<Arc<Box<HotStore>>>> → Arc<RwLock<Arc<Box<HotStore>>>> (RwLock allows concurrent reads; write only during checkpoint/reset) - history_repository: same Mutex → RwLock change - New channel_locks: DashMap<u64, Arc<Mutex<()>>> for per-group locking - locked_produce: acquires per-group lock for each join group iteration - locked_consume: acquires per-group lock for the consume's channel set - Deadlock prevention via sorted channel hash ordering For channels WITHOUT joins (the vast majority — private unforgeable names used by single send/receive pairs), the per-group lock has minimal contention. Only multi-channel join patterns serialize. All 294 rspace++ tests and 20 casper runtime tests pass. Phase 4 of 6: Maximally parallel RSpace via lock removal and interior mutability. Next: remove interpreter-level space lock (Phase 5).
Remove the Arc<tokio::sync::Mutex<dyn ISpace>> wrapper from RhoISpace and RhoReplayISpace. Since ISpace methods now use &self (Phase 3) and per-channel-group locks handle concurrency (Phase 4), the global interpreter Mutex is redundant. Changes: - RhoISpace: Arc<Mutex<Box<dyn ISpace>>> → Arc<Box<dyn ISpace>> - Remove 19 .try_lock().unwrap() call sites across reduce.rs, rho_runtime.rs, contract_call.rs, interpreter.rs, and lib.rs - Access self.space methods directly via &self - Remove corresponding drop(space_locked) calls The ISpace is now accessed without ANY global lock. Concurrent produce/consume operations on independent channels no longer serialize. All 115 rholang tests and 20 casper runtime tests pass. Phase 5 of 6: Maximally parallel RSpace via lock removal and interior mutability. Next: FuturesUnordered in eval_inner (Phase 6).
… (Phase 6) Three changes that complete the concurrent evaluation architecture: 1. Content-hash candidate ordering: Replace thread_rng() shuffle in RSpace/ReplayRSpace with deterministic sorting by source.hash (Blake2b256Hash). Cryptographic hashes are uniformly distributed (fair, no starvation) while being deterministic (same data → same order). Required for consensus — random shuffle caused different COMMs on different runs. 2. FuturesUnordered in eval_inner: Replace sequential for-loop with concurrent FuturesUnordered. Receives are listed first in the terms vector to bias early continuation registration. Per-channel-group locks (Phase 4) ensure join atomicity. 3. Two-phase dispatch: Solve body evaluation interleaving that caused -644 COST_MISMATCH between RSpace (play) and ReplayRSpace (replay). Phase 1: produce/consume run concurrently, COMM body dispatch is DEFERRED to a shared queue. Phase 2: after all Phase 1 futures complete, deferred bodies dispatch sequentially. Nested eval calls create their own Phase 1/Phase 2 cycle. Also: - Fix non-deterministic DashMap iteration in replay_rspace produce lookup (use .get() instead of .clone().into_iter().find()) - Add with_isolated_runtime_manager for LMDB state isolation Results (full util::rholang suite, 5 consecutive runs): - Sequential for-loop: 3/3 pass (no concurrency) - FuturesUnordered (no defer): 2/5 pass (body interleaving → -644) - FuturesUnordered + 2-phase: 5/5 pass (concurrent + deterministic) Phase 6 of 6: Maximally parallel RSpace via lock removal and interior mutability. All phases complete.
Comprehensive design document covering the 6-phase refactor from globally-serialized to per-channel-parallel RSpace evaluation. Includes architectural diagrams, pseudocode, Scala equivalence mapping, sequence diagrams for the body interleaving problem and two-phase dispatch solution, and determinism guarantees.
…ent process bodies for parallel evaluation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR resolves COST_MISMATCH errors observed on observer nodes during block replay by implementing per-bind peek semantics, along with several supporting fixes for state hash consistency, exploratory deploy reliability, and replay correctness. It also transforms the RSpace from a globally-serialized tuple space into a concurrent architecture with per-channel parallelism (see docs/rspace/concurrent-rspace-architecture.md).
Changes
Core: Concurrent Par evaluation (rholang/src/rust/interpreter/reduce.rs)
Core: Concurrent RSpace (rspace++, rholang)
Core: Per-bind peek semantics (p_input_normalizer.rs, reduce.rs, rspace++)
Core: Genesis pre-state hash handling (block_approver_protocol.rs, initializing.rs, interpreter_util.rs)
Core: spawn_runtime_at() (runtime_manager.rs)
Core: Exploratory deploy fix (runtime.rs)
API: Trie stats endpoint (web_api.rs, shared_handlers.rs)
API: Block propose retry (block_api.rs)
Infrastructure: macOS LMDB semaphore cleanup (resources.rs)
Diagnostics
Test coverage