DiLoCo: work-queue roster grows unbounded across epochs (now persisted)

## Summary
Surfaced by the PR #107 review (work-unit dispatch persistence). The server's `_work_queues` is keyed by `(dataset_id, shuffle_seed)`, and the worker's effective `shuffle_seed = base_seed + epoch` — so **each epoch registers a new queue**, and queues are never pruned. With #107 they are now also persisted into every `server_state.pt`.

Over a long multi-epoch run this accretes one stale `(dataset_id, seed)` queue per completed epoch: it bloats `_work_queues` in memory, re-serializes on every `save_every_n_rounds` cadence (under `_work_queues_lock`, lengthening the hold), grows each checkpoint, and slows load on restart.

## Severity
Low. Each queue is small (default K=1024 → ~128 B issued + 128 B completed + a small `by_worker` dict + a few dataset-identity strings, well under 1 KB), so hundreds of epochs ≈ <1 MB. Not a remote DoS — registration is control-plane authenticated. But it's unbounded in principle and undesirable for the long runs this feature targets.

## Note
Stale queues are already *correct* (a different dataset hashes to a new `dataset_id` → inert; superseded seeds are simply never requested again) — this is purely a growth/hygiene concern, not correctness.

## Proposed options (pick one)
- **Prune superseded queues at save time**: for each `dataset_id`, keep only the highest-seed (current-epoch) queue plus any not-yet-exhausted lower-seed queues; drop fully-exhausted queues whose seed is below the max. A worker never goes backward an epoch, so this is safe.
- **Global cap / LRU** on persisted queue count with a `log()` of what was dropped.
- **Explicit retention knob** (e.g. keep last N epochs per dataset).

Same general hygiene concern as the `_known_workers` roster cap noted in the PR #106 review — could be addressed together.

### Relevant code
- `src/forgather/ml/diloco/server.py` — `_work_queues` (~689), `save_state` snapshot (~3040), `_handle_register_dataset` (~2331).
- `shuffle_seed = base_seed + epoch`: `composable_iterable_dataset.py` (`_effective_buffer_seed`).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DiLoCo: work-queue roster grows unbounded across epochs (now persisted) #108

Summary

Severity

Note

Proposed options (pick one)

Relevant code

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

DiLoCo: work-queue roster grows unbounded across epochs (now persisted) #108

Description

Summary

Severity

Note

Proposed options (pick one)

Relevant code

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions