event-sourcing cleanup + snapshot-based catch-up#2011
Draft
AlexCheema wants to merge 1 commit intomainfrom
Draft
event-sourcing cleanup + snapshot-based catch-up#2011AlexCheema wants to merge 1 commit intomainfrom
AlexCheema wants to merge 1 commit intomainfrom
Conversation
Joining a long-running cluster used to replay the entire event log 1000 events at a time over NACK round-trips, which took ~15 min for 1M events. New nodes now bootstrap from a master-served snapshot of State and only replay the small tail. - Split the Event union into durable Event (state-modifying, persisted, ordered) and TransientEvent (per-request streaming/notification: ChunkGenerated, InputChunkReceived, TracesCollected, TracesMerged, TaskAcknowledged) routed over a separate TRANSIENT_EVENTS topic. Transients no longer touch the durable log or NACK machinery. - Eliminate direct event reactions in event-log consumers. Worker drops the InstanceDeleted/CustomModelCard reactions in favour of reconciliation loops on state.instances and state.custom_model_cards. API drops the InstanceDeleted stream-close reaction in favour of a state.tasks x state.instances reconciliation. Promote custom_model_cards to State with proper apply handlers. - Add SnapshotChunk wire type, RequestSnapshot command, and SNAPSHOT_RESPONSES topic. Master encodes its in-memory State on demand (zstd JSON), slices into ~512 KiB chunks, publishes per request. Receiver verifies SHA-256 and reassembles. - Worker/API request a snapshot at startup, apply, fast-forward the EventRouter buffer to last_event_applied_idx + 1, then receive only the tail via NACK. Falls back to full replay on timeout. - Tighter responsiveness: master inactivity timeout 30s -> 5s, plan tick 10s -> 1s, worker neighbour ping 10s -> 2s. macmon emits every 1s so 5s gives 5x heartbeat headroom. - Bug fixes found while building this: Worker.shutdown crashed if called during _fetch_snapshot before the task group entered; Pydantic v2 default-encodes bytes as UTF-8 strings which dies on zstd output, so SnapshotChunk uses an explicit base64 str field with from_data/data helpers. - Throughput: drop hardcoded 1000-event RequestEventLog cap (replaced with safety valve); tighten NACK base/cap from 0.5s..10s to 0.05s..1s (snapshots carry the bulk now). Bench (3 hosts, 100K events on master): with snapshot: ~5s consistent across all joiners (8KB transfer) without snapshot: still incomplete after 14 min (9-69% per host) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stateand replay only the small tail.What changed
Stage 1 — split transient events from the durable log
TransientEventunion:ChunkGenerated | InputChunkReceived | TracesCollected | TracesMerged | TaskAcknowledged.SNAPSHOT_RESPONSESandTRANSIENT_EVENTStopics; newTransientRouterfor fire-and-forget pub/sub.RunnerSupervisorlearn to demultiplex. Runner→supervisor mp_channel widened toEvent | TransientEvent.Stage 2 — reconciliation in place of event reactions
custom_model_cards: Mapping[ModelId, ModelCard]promoted toState; new apply handlers.InstanceDeleted/CustomModelCard*reactions replaced by_reconcile_instance_backoffand_reconcile_custom_cards.InstanceDeletedstream-close replaced by_reconcile_streamsdriven bystate.tasks × state.instances.KeyedBackoff.tracked_keys()added to support safe iter-during-mutation.Stages 3–5 — snapshot transfer protocol
SnapshotChunkwire type with explicitdata_b64: str(Pydantic v2 default-encodesbytesas UTF-8, which dies on zstd output).RequestSnapshotcommand. Master encodesself.stateon demand (zstd-compressedmodel_dump_json), slices into ~512 KiB chunks, publishes onSNAPSHOT_RESPONSESwithrequester_node_idfiltering and SHA-256 verification.SnapshotReceiverreassembles chunks, ignores stale sessions / wrong recipients, validates checksum, decodesState.RequestSnapshotat startup, apply the result to local state, then callEventRouter.set_buffer_start(idx + 1)so live events drain in order. Falls back to full event-log replay on timeout.OrderedBuffer.fast_forward_to(idx)discards pending events covered by the snapshot.Stage 6 — throughput tuning
RequestEventLogresponse (1000) kept for gossipsub burst protection — comment now reflects that snapshots make this branch a fallback.Liveness
_planinactivity timeout 30s → 5s, tick 10s → 1s. Aligned with macmon's 1s emit cadence (5× heartbeat headroom)._poll_connection_updates10s → 2s.Bugs fixed along the way
Worker.shutdown()crashed if called during_fetch_snapshotbefore the task group entered (the pre-Stage-5 ordering put the fetch outsideasync with self._tg). Snapshot fetch now runs as a child task inside the group.bytes→JSON encoding broke zstd payloads. Replaced with explicit base64 round-trip.RunnerSupervisor's syntheticChunkGeneratedon runner crash now flows over the transient channel.Bench
3 hosts (
jamesmaster,mike/s13/s14joiners) over LAN libp2p, 100K events seeded on master:EXO_DISABLE_SNAPSHOT_FETCH=1)At normal cluster scale (1500 events) the join time is also ~5 s — dominated entirely by gossipsub election convergence, not event transport. Drop/rejoin chaos test against the new 5s liveness window: master detects departures in 5–6 s consistently.
Test plan
uv run basedpyright0 errorsuv run ruff checkcleannix fmtapplieduv run pytest405 pass / 1 skippedidx 133in 34 msNotes
Statein memory, so the master encodes on demand.events.binfiles get rotated on master startup before any read; archives are write-only).get_node_id_keypairreturnsKeypair.generate()instead of persisting). Out of scope.🤖 Generated with Claude Code