Skip to content

DiLoCo: work-queue roster grows unbounded across epochs (now persisted) #108

@jdinalt

Description

@jdinalt

Summary

Surfaced by the PR #107 review (work-unit dispatch persistence). The server's _work_queues is keyed by (dataset_id, shuffle_seed), and the worker's effective shuffle_seed = base_seed + epoch — so each epoch registers a new queue, and queues are never pruned. With #107 they are now also persisted into every server_state.pt.

Over a long multi-epoch run this accretes one stale (dataset_id, seed) queue per completed epoch: it bloats _work_queues in memory, re-serializes on every save_every_n_rounds cadence (under _work_queues_lock, lengthening the hold), grows each checkpoint, and slows load on restart.

Severity

Low. Each queue is small (default K=1024 → ~128 B issued + 128 B completed + a small by_worker dict + a few dataset-identity strings, well under 1 KB), so hundreds of epochs ≈ <1 MB. Not a remote DoS — registration is control-plane authenticated. But it's unbounded in principle and undesirable for the long runs this feature targets.

Note

Stale queues are already correct (a different dataset hashes to a new dataset_id → inert; superseded seeds are simply never requested again) — this is purely a growth/hygiene concern, not correctness.

Proposed options (pick one)

  • Prune superseded queues at save time: for each dataset_id, keep only the highest-seed (current-epoch) queue plus any not-yet-exhausted lower-seed queues; drop fully-exhausted queues whose seed is below the max. A worker never goes backward an epoch, so this is safe.
  • Global cap / LRU on persisted queue count with a log() of what was dropped.
  • Explicit retention knob (e.g. keep last N epochs per dataset).

Same general hygiene concern as the _known_workers roster cap noted in the PR #106 review — could be addressed together.

Relevant code

  • src/forgather/ml/diloco/server.py_work_queues (~689), save_state snapshot (~3040), _handle_register_dataset (~2331).
  • shuffle_seed = base_seed + epoch: composable_iterable_dataset.py (_effective_buffer_seed).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions