Skip to content

diloco: make worker settings server-authoritative via /info#99

Merged
jdinalt merged 2 commits into
devfrom
feature/diloco-server-authoritative-settings
May 29, 2026
Merged

diloco: make worker settings server-authoritative via /info#99
jdinalt merged 2 commits into
devfrom
feature/diloco-server-authoritative-settings

Conversation

@jdinalt
Copy link
Copy Markdown
Owner

@jdinalt jdinalt commented May 29, 2026

Summary

First of three sequential PRs collapsing DiLoCo client/server configuration duplication (see the others: unified meta construction, then the model-definition bundle).

A DiLoCo worker had to be hand-configured to match the server on sync_every, bf16_comm, dylu, and num_fragments — re-specified on the worker CLI, in the callback templates, and in the webui. These must match across the group for the sync barrier / outer step / fragment barriers to be coherent, so a divergent value is only ever a foot-gun. The server already advertises them via GET /info; this PR makes the worker consume them and removes the redundant client-side surface entirely (no override — it can be added back later if a concrete need appears).

Changes

Server (/info)

  • expected_client_settings is now complete and non-null: sync_every is advertised as dylu_base_sync_every whether or not DyLU is on, plus num_fragments_default and heartbeat_timeout.
  • Added settings_authority: "server" and a coarse model_hash (sorted param name/shape fingerprint) — the latter is scaffolding consumed by the upcoming model-bundle PR.

Worker

  • DiLoCoClient.get_info().
  • DiLoCoCallback drops the four must-match constructor args / DILOCO_* reads. on_train_begin fetches /info (leader-only, broadcast to DDP followers), applies the settings verbatim and logs them. A server too old to advertise sync_every is fatal (no silent default). heartbeat_interval stays client-local and is validated against the server's heartbeat_timeout.

Surface removal

  • forgather diloco worker drops --sync-every / --no-bf16 / --dylu / --num-fragments and stops emitting the matching DILOCO_* env vars.
  • Scheduler no longer forwards the four settings.
  • callbacks/diloco.yaml + mixins/diloco.yaml drop the four knobs.
  • SubmitModal removes the four editable inputs and shows the server's values read-only instead (stable layout).
  • docs/trainers/diloco.md updated to match. The low-level DiLoCoWorker programmatic API is unchanged.

Testing

  • Extended /info shape + model_hash determinism tests.
  • Callback adoption matrix + heartbeat-validation + too-old-server-is-fatal tests.
  • Scheduler no longer forwards the four settings.
  • Full tests/unit/ml/diloco/ + tests/unit/forgather_server/ green (974 passed). forgather ls clean. Webui builds clean.

🤖 Generated with Claude Code

jdinalt and others added 2 commits May 29, 2026 08:07
sync_every, bf16_comm, dylu and num_fragments must match across every
worker in a group for the sync barrier / outer step / fragment barriers
to be coherent. They were redundantly re-specified on the worker CLI, in
the callback templates, and in the webui — a foot-gun that let a worker
silently diverge from the group. The server already advertises them via
GET /info; now the worker actually consumes them and there is no client
override (a divergent value is never useful).

Server (/info):
- expected_client_settings is now complete and non-null: sync_every is
  advertised as dylu_base_sync_every whether or not DyLU is on, plus
  num_fragments_default and heartbeat_timeout. Added settings_authority
  and a coarse model_hash (param name/shape fingerprint, consumed by the
  upcoming model-bundle work).

Worker:
- DiLoCoClient.get_info().
- DiLoCoCallback drops the four must-match constructor args / DILOCO_*
  reads; on_train_begin fetches /info (leader-only, broadcast to DDP
  followers), applies the settings verbatim and logs them. A server that
  doesn't advertise sync_every (too old) is fatal, not silently
  defaulted. heartbeat_interval stays client-local and is validated
  against the server's heartbeat_timeout.

Surface removal:
- `forgather diloco worker` drops --sync-every / --no-bf16 / --dylu /
  --num-fragments and no longer emits the matching DILOCO_* env vars.
- scheduler no longer forwards the four settings.
- callbacks/diloco.yaml + mixins/diloco.yaml drop the four knobs.
- SubmitModal removes the four editable inputs and shows the server's
  values read-only instead (stable layout).
- docs/trainers/diloco.md updated to match.

Tests: extended /info shape + model_hash determinism; callback adoption
matrix + heartbeat validation; scheduler no longer forwards the four.
All DiLoCo + forgather_server suites green (974 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review of the previous commit (and the server modal) surfaced that three
of the four settings made server-authoritative had no server-side knob —
/info just hardcoded them. That made streaming-fragment mode and
full-precision comm unreachable, and forced sync_every through the
misnamed dylu_base_sync_every.

- DiLoCoServer gains sync_every / bf16_comm / num_fragments constructor
  args; /info advertises the real values (sync_every = dylu_base when
  DyLU is on, else the dedicated sync_every).
- `forgather diloco server` gains --sync-every / --num-fragments /
  --no-bf16. Threaded through the spawn chain
  (build_diloco_server_command -> launcher -> scheduler) and surfaced in
  the DiLoCoServerModal as a "Worker settings (group-wide)" section.
  bf16_comm stays centralized on the server (one wire precision for the
  whole group), per the design call.

- Fix DDP follower deadlock: the callback now broadcasts an error
  sentinel from the leader on /info failure so every rank raises
  together, instead of followers blocking forever in broadcast_object_list
  when the leader aborts (e.g. a too-old server with null sync_every).

- Docs/messages: diloco-architecture.md and diloco.md document the new
  server flags and stop pointing at the removed worker flags; the worker
  DyLU-under-pipeline error points at the server --dylu; fixed a
  misleading test_routes_diloco fixture.

Tests: server /info reflects configured group settings (+ DyLU-base
case); build_diloco_server_command emits the new flags. Full
tests/unit/ml/diloco + tests/unit/forgather_server green (978 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jdinalt jdinalt merged commit 24a7f15 into dev May 29, 2026
1 check passed
@jdinalt jdinalt deleted the feature/diloco-server-authoritative-settings branch May 29, 2026 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant