DiLoCo: persist work-unit dispatch state across server restart (#105)#107
Merged
Conversation
The server is the authority for which dataset rows each worker has consumed (the worker keeps no dataset-progress state of its own, by design), but its work-queue state was never written to server_state.pt — so a server restart rebuilt empty queues and re-issued already-trained units within the epoch. The "intentionally not persisted (#46)" comments cited the wrong issue (#46 is unrelated, torchao) and described a decision that was never actually sanctioned: the queue was simply never persisted in the committed work-unit-dispatch feature. Persist the per-(dataset_id, shuffle_seed) issued/completed bitmaps + counters and the _dataset_lengths integrity snapshot into server_state.pt (keyed on disk by a "dataset_id|seed" string, bitmaps as bytes), and restore them in load_state. A worker re-registering its dataset reuses the restored queue, so issuance resumes at the next un-issued unit instead of unit 0. Cross-experiment safety is intrinsic: a changed dataset hashes to a new dataset_id -> fresh key -> stale queues sit inert; the 409 length-mismatch guard handles same-id/different-length; a hard reset is "restart from weights + purge". Also flush state on graceful shutdown: stop() now saves, and run() handles SIGTERM (how the scheduler stops a server job) in addition to SIGINT, so a clean stop doesn't lose units issued since the last autosave. Removes the incorrect #46 citations from server.py and the two docs; rewrites the work-unit-dispatch design-doc restart section and the diloco.md crash-recovery note to describe the persisted behavior as-is. Tests: rewrites the (now-inverted) persistence tests to assert round-trip + resume-at-next-unit + malformed-key resilience; full diloco suite green (317). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… review) The restore loop only wrapped key parsing in try/except, so a valid-key entry missing a field (total_units/issued) would KeyError and abort the whole load — bricking a restart over one bad entry — and a bitmap whose length disagreed with total_units would surface later as an IndexError during issuance. Wrap the full per-entry reconstruction and validate bitmap length against total_units; skip-and-warn on any bad entry so good queues in the same map still load. Adds a test injecting a missing-field entry and a length-inconsistent bitmap alongside a good queue, asserting the bad ones are skipped and the good one survives. Caught in the PR #107 review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #105.
Problem
The DiLoCo server is the authority for which dataset rows each worker has consumed — the worker keeps no dataset-progress state of its own, by design, because the server was supposed to track and persist it. But the server's work-queue state was never written to
server_state.pt, so a server restart rebuilt empty queues and re-issued already-trained units within the epoch.The
"intentionally not persisted (#46)"comments were doubly wrong: issue #46 is unrelated (torchao quantization), and the queue was never actually persisted in the committed work-unit-dispatch feature (#80) — it was an unfinished/untested piece with a fabricated rationale citation, not a sanctioned decision.Change
(dataset_id, shuffle_seed)issued/completedbitmaps + counters and the_dataset_lengthsintegrity snapshot inserver_state.pt(keyed on disk by a"dataset_id|seed"string; bitmaps asbytes). A worker re-registering its dataset reuses the restored queue (_handle_register_datasetalready reuses any existing queue for the key), so issuance resumes at the next un-issued unit, not unit 0.dataset_id→ fresh key → stale queues are never matched and sit inert; the 409 length-mismatch guard handles same-id/different-length; a hard reset is "restart from model weights + purgeoutput_dir". No expiry machinery needed.stop()now saves, andrun()handles SIGTERM (how the scheduler stops a server job) in addition to SIGINT — so a clean stop doesn't lose units issued since the last autosave.#46citations fromserver.pyand the two docs; rewrites the design-doc restart section and thediloco.mdcrash-recovery note to describe the persisted behavior as-is.Audit note
This came out of a full audit of what the server does/doesn't persist (requested on #105). Conclusion: the optimizer/model side (weights + outer-optimizer momentum +
_sync_round+ param ordering +known_workers) was already covered; the work-unit dispatch subsystem was the one correctness gap — everything else unpersisted is transient in-flight (re-synced by workers) or pure stats.Testing
dataset_lengths+ persisted-file keys, resume-at-next-unit after restart, and malformed-key resilience.tests/unit/ml/diloco/green (317).🤖 Generated with Claude Code