Skip to content

DiLoCo: persist work-unit dispatch state across server restart (#105)#107

Merged
jdinalt merged 2 commits into
devfrom
feature/diloco-workqueue-persistence
May 31, 2026
Merged

DiLoCo: persist work-unit dispatch state across server restart (#105)#107
jdinalt merged 2 commits into
devfrom
feature/diloco-workqueue-persistence

Conversation

@jdinalt

@jdinalt jdinalt commented May 31, 2026

Copy link
Copy Markdown
Owner

Closes #105.

Problem

The DiLoCo server is the authority for which dataset rows each worker has consumed — the worker keeps no dataset-progress state of its own, by design, because the server was supposed to track and persist it. But the server's work-queue state was never written to server_state.pt, so a server restart rebuilt empty queues and re-issued already-trained units within the epoch.

The "intentionally not persisted (#46)" comments were doubly wrong: issue #46 is unrelated (torchao quantization), and the queue was never actually persisted in the committed work-unit-dispatch feature (#80) — it was an unfinished/untested piece with a fabricated rationale citation, not a sanctioned decision.

Change

  • Persist + restore the per-(dataset_id, shuffle_seed) issued/completed bitmaps + counters and the _dataset_lengths integrity snapshot in server_state.pt (keyed on disk by a "dataset_id|seed" string; bitmaps as bytes). A worker re-registering its dataset reuses the restored queue (_handle_register_dataset already reuses any existing queue for the key), so issuance resumes at the next un-issued unit, not unit 0.
  • Cross-experiment safety is intrinsic (the original ghost-queue worry): a changed dataset hashes to a new dataset_id → fresh key → stale queues are never matched and sit inert; the 409 length-mismatch guard handles same-id/different-length; a hard reset is "restart from model weights + purge output_dir". No expiry machinery needed.
  • Flush on graceful shutdown: stop() now saves, and run() handles SIGTERM (how the scheduler stops a server job) in addition to SIGINT — so a clean stop doesn't lose units issued since the last autosave.
  • Removes the incorrect #46 citations from server.py and the two docs; rewrites the design-doc restart section and the diloco.md crash-recovery note to describe the persisted behavior as-is.

Audit note

This came out of a full audit of what the server does/doesn't persist (requested on #105). Conclusion: the optimizer/model side (weights + outer-optimizer momentum + _sync_round + param ordering + known_workers) was already covered; the work-unit dispatch subsystem was the one correctness gap — everything else unpersisted is transient in-flight (re-synced by workers) or pure stats.

Testing

  • Rewrote the (now-inverted) persistence tests: round-trip of bitmaps/counters/dataset_lengths + persisted-file keys, resume-at-next-unit after restart, and malformed-key resilience.
  • Full tests/unit/ml/diloco/ green (317).

🤖 Generated with Claude Code

jdinalt and others added 2 commits May 31, 2026 03:49
The server is the authority for which dataset rows each worker has
consumed (the worker keeps no dataset-progress state of its own, by
design), but its work-queue state was never written to server_state.pt —
so a server restart rebuilt empty queues and re-issued already-trained
units within the epoch. The "intentionally not persisted (#46)" comments
cited the wrong issue (#46 is unrelated, torchao) and described a decision
that was never actually sanctioned: the queue was simply never persisted
in the committed work-unit-dispatch feature.

Persist the per-(dataset_id, shuffle_seed) issued/completed bitmaps +
counters and the _dataset_lengths integrity snapshot into server_state.pt
(keyed on disk by a "dataset_id|seed" string, bitmaps as bytes), and
restore them in load_state. A worker re-registering its dataset reuses the
restored queue, so issuance resumes at the next un-issued unit instead of
unit 0. Cross-experiment safety is intrinsic: a changed dataset hashes to a
new dataset_id -> fresh key -> stale queues sit inert; the 409
length-mismatch guard handles same-id/different-length; a hard reset is
"restart from weights + purge".

Also flush state on graceful shutdown: stop() now saves, and run() handles
SIGTERM (how the scheduler stops a server job) in addition to SIGINT, so a
clean stop doesn't lose units issued since the last autosave.

Removes the incorrect #46 citations from server.py and the two docs;
rewrites the work-unit-dispatch design-doc restart section and the
diloco.md crash-recovery note to describe the persisted behavior as-is.

Tests: rewrites the (now-inverted) persistence tests to assert round-trip +
resume-at-next-unit + malformed-key resilience; full diloco suite green
(317).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… review)

The restore loop only wrapped key parsing in try/except, so a valid-key
entry missing a field (total_units/issued) would KeyError and abort the
whole load — bricking a restart over one bad entry — and a bitmap whose
length disagreed with total_units would surface later as an IndexError
during issuance. Wrap the full per-entry reconstruction and validate
bitmap length against total_units; skip-and-warn on any bad entry so good
queues in the same map still load.

Adds a test injecting a missing-field entry and a length-inconsistent
bitmap alongside a good queue, asserting the bad ones are skipped and the
good one survives. Caught in the PR #107 review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jdinalt jdinalt merged commit 883842e into dev May 31, 2026
1 check passed
@jdinalt jdinalt deleted the feature/diloco-workqueue-persistence branch May 31, 2026 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant