DiLoCo: persist work-unit dispatch state across server restart (#105) by jdinalt · Pull Request #107 · jdinalt/forgather

jdinalt · 2026-05-31T03:50:19Z

Closes #105.

Problem

The DiLoCo server is the authority for which dataset rows each worker has consumed — the worker keeps no dataset-progress state of its own, by design, because the server was supposed to track and persist it. But the server's work-queue state was never written to server_state.pt, so a server restart rebuilt empty queues and re-issued already-trained units within the epoch.

The "intentionally not persisted (#46)" comments were doubly wrong: issue #46 is unrelated (torchao quantization), and the queue was never actually persisted in the committed work-unit-dispatch feature (#80) — it was an unfinished/untested piece with a fabricated rationale citation, not a sanctioned decision.

Change

Persist + restore the per-(dataset_id, shuffle_seed) issued/completed bitmaps + counters and the _dataset_lengths integrity snapshot in server_state.pt (keyed on disk by a "dataset_id|seed" string; bitmaps as bytes). A worker re-registering its dataset reuses the restored queue (_handle_register_dataset already reuses any existing queue for the key), so issuance resumes at the next un-issued unit, not unit 0.
Cross-experiment safety is intrinsic (the original ghost-queue worry): a changed dataset hashes to a new dataset_id → fresh key → stale queues are never matched and sit inert; the 409 length-mismatch guard handles same-id/different-length; a hard reset is "restart from model weights + purge output_dir". No expiry machinery needed.
Flush on graceful shutdown: stop() now saves, and run() handles SIGTERM (how the scheduler stops a server job) in addition to SIGINT — so a clean stop doesn't lose units issued since the last autosave.
Removes the incorrect #46 citations from server.py and the two docs; rewrites the design-doc restart section and the diloco.md crash-recovery note to describe the persisted behavior as-is.

Audit note

This came out of a full audit of what the server does/doesn't persist (requested on #105). Conclusion: the optimizer/model side (weights + outer-optimizer momentum + _sync_round + param ordering + known_workers) was already covered; the work-unit dispatch subsystem was the one correctness gap — everything else unpersisted is transient in-flight (re-synced by workers) or pure stats.

Testing

Rewrote the (now-inverted) persistence tests: round-trip of bitmaps/counters/dataset_lengths + persisted-file keys, resume-at-next-unit after restart, and malformed-key resilience.
Full tests/unit/ml/diloco/ green (317).

🤖 Generated with Claude Code

The server is the authority for which dataset rows each worker has consumed (the worker keeps no dataset-progress state of its own, by design), but its work-queue state was never written to server_state.pt — so a server restart rebuilt empty queues and re-issued already-trained units within the epoch. The "intentionally not persisted (#46)" comments cited the wrong issue (#46 is unrelated, torchao) and described a decision that was never actually sanctioned: the queue was simply never persisted in the committed work-unit-dispatch feature. Persist the per-(dataset_id, shuffle_seed) issued/completed bitmaps + counters and the _dataset_lengths integrity snapshot into server_state.pt (keyed on disk by a "dataset_id|seed" string, bitmaps as bytes), and restore them in load_state. A worker re-registering its dataset reuses the restored queue, so issuance resumes at the next un-issued unit instead of unit 0. Cross-experiment safety is intrinsic: a changed dataset hashes to a new dataset_id -> fresh key -> stale queues sit inert; the 409 length-mismatch guard handles same-id/different-length; a hard reset is "restart from weights + purge". Also flush state on graceful shutdown: stop() now saves, and run() handles SIGTERM (how the scheduler stops a server job) in addition to SIGINT, so a clean stop doesn't lose units issued since the last autosave. Removes the incorrect #46 citations from server.py and the two docs; rewrites the work-unit-dispatch design-doc restart section and the diloco.md crash-recovery note to describe the persisted behavior as-is. Tests: rewrites the (now-inverted) persistence tests to assert round-trip + resume-at-next-unit + malformed-key resilience; full diloco suite green (317). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… review) The restore loop only wrapped key parsing in try/except, so a valid-key entry missing a field (total_units/issued) would KeyError and abort the whole load — bricking a restart over one bad entry — and a bitmap whose length disagreed with total_units would surface later as an IndexError during issuance. Wrap the full per-entry reconstruction and validate bitmap length against total_units; skip-and-warn on any bad entry so good queues in the same map still load. Adds a test injecting a missing-field entry and a length-inconsistent bitmap alongside a good queue, asserting the bad ones are skipped and the good one survives. Caught in the PR #107 review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jdinalt and others added 2 commits May 31, 2026 03:49

jdinalt mentioned this pull request May 31, 2026

DiLoCo: work-queue roster grows unbounded across epochs (now persisted) #108

Open

jdinalt merged commit 883842e into dev May 31, 2026
1 check passed

jdinalt deleted the feature/diloco-workqueue-persistence branch May 31, 2026 04:07

jdinalt mentioned this pull request May 31, 2026

DiLoCo: server forgets consumed work-units across a restart (work-queue state not persisted) #105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DiLoCo: persist work-unit dispatch state across server restart (#105)#107

DiLoCo: persist work-unit dispatch state across server restart (#105)#107
jdinalt merged 2 commits into
devfrom
feature/diloco-workqueue-persistence

jdinalt commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdinalt commented May 31, 2026

Problem

Change

Audit note

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant