Summary
Surfaced by the PR #107 review (work-unit dispatch persistence). The server's _work_queues is keyed by (dataset_id, shuffle_seed), and the worker's effective shuffle_seed = base_seed + epoch — so each epoch registers a new queue, and queues are never pruned. With #107 they are now also persisted into every server_state.pt.
Over a long multi-epoch run this accretes one stale (dataset_id, seed) queue per completed epoch: it bloats _work_queues in memory, re-serializes on every save_every_n_rounds cadence (under _work_queues_lock, lengthening the hold), grows each checkpoint, and slows load on restart.
Severity
Low. Each queue is small (default K=1024 → ~128 B issued + 128 B completed + a small by_worker dict + a few dataset-identity strings, well under 1 KB), so hundreds of epochs ≈ <1 MB. Not a remote DoS — registration is control-plane authenticated. But it's unbounded in principle and undesirable for the long runs this feature targets.
Note
Stale queues are already correct (a different dataset hashes to a new dataset_id → inert; superseded seeds are simply never requested again) — this is purely a growth/hygiene concern, not correctness.
Proposed options (pick one)
- Prune superseded queues at save time: for each
dataset_id, keep only the highest-seed (current-epoch) queue plus any not-yet-exhausted lower-seed queues; drop fully-exhausted queues whose seed is below the max. A worker never goes backward an epoch, so this is safe.
- Global cap / LRU on persisted queue count with a
log() of what was dropped.
- Explicit retention knob (e.g. keep last N epochs per dataset).
Same general hygiene concern as the _known_workers roster cap noted in the PR #106 review — could be addressed together.
Relevant code
src/forgather/ml/diloco/server.py — _work_queues (~689), save_state snapshot (~3040), _handle_register_dataset (~2331).
shuffle_seed = base_seed + epoch: composable_iterable_dataset.py (_effective_buffer_seed).
Summary
Surfaced by the PR #107 review (work-unit dispatch persistence). The server's
_work_queuesis keyed by(dataset_id, shuffle_seed), and the worker's effectiveshuffle_seed = base_seed + epoch— so each epoch registers a new queue, and queues are never pruned. With #107 they are now also persisted into everyserver_state.pt.Over a long multi-epoch run this accretes one stale
(dataset_id, seed)queue per completed epoch: it bloats_work_queuesin memory, re-serializes on everysave_every_n_roundscadence (under_work_queues_lock, lengthening the hold), grows each checkpoint, and slows load on restart.Severity
Low. Each queue is small (default K=1024 → ~128 B issued + 128 B completed + a small
by_workerdict + a few dataset-identity strings, well under 1 KB), so hundreds of epochs ≈ <1 MB. Not a remote DoS — registration is control-plane authenticated. But it's unbounded in principle and undesirable for the long runs this feature targets.Note
Stale queues are already correct (a different dataset hashes to a new
dataset_id→ inert; superseded seeds are simply never requested again) — this is purely a growth/hygiene concern, not correctness.Proposed options (pick one)
dataset_id, keep only the highest-seed (current-epoch) queue plus any not-yet-exhausted lower-seed queues; drop fully-exhausted queues whose seed is below the max. A worker never goes backward an epoch, so this is safe.log()of what was dropped.Same general hygiene concern as the
_known_workersroster cap noted in the PR #106 review — could be addressed together.Relevant code
src/forgather/ml/diloco/server.py—_work_queues(~689),save_statesnapshot (~3040),_handle_register_dataset(~2331).shuffle_seed = base_seed + epoch:composable_iterable_dataset.py(_effective_buffer_seed).