diloco: make worker settings server-authoritative via /info#99
Merged
Conversation
sync_every, bf16_comm, dylu and num_fragments must match across every worker in a group for the sync barrier / outer step / fragment barriers to be coherent. They were redundantly re-specified on the worker CLI, in the callback templates, and in the webui — a foot-gun that let a worker silently diverge from the group. The server already advertises them via GET /info; now the worker actually consumes them and there is no client override (a divergent value is never useful). Server (/info): - expected_client_settings is now complete and non-null: sync_every is advertised as dylu_base_sync_every whether or not DyLU is on, plus num_fragments_default and heartbeat_timeout. Added settings_authority and a coarse model_hash (param name/shape fingerprint, consumed by the upcoming model-bundle work). Worker: - DiLoCoClient.get_info(). - DiLoCoCallback drops the four must-match constructor args / DILOCO_* reads; on_train_begin fetches /info (leader-only, broadcast to DDP followers), applies the settings verbatim and logs them. A server that doesn't advertise sync_every (too old) is fatal, not silently defaulted. heartbeat_interval stays client-local and is validated against the server's heartbeat_timeout. Surface removal: - `forgather diloco worker` drops --sync-every / --no-bf16 / --dylu / --num-fragments and no longer emits the matching DILOCO_* env vars. - scheduler no longer forwards the four settings. - callbacks/diloco.yaml + mixins/diloco.yaml drop the four knobs. - SubmitModal removes the four editable inputs and shows the server's values read-only instead (stable layout). - docs/trainers/diloco.md updated to match. Tests: extended /info shape + model_hash determinism; callback adoption matrix + heartbeat validation; scheduler no longer forwards the four. All DiLoCo + forgather_server suites green (974 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review of the previous commit (and the server modal) surfaced that three of the four settings made server-authoritative had no server-side knob — /info just hardcoded them. That made streaming-fragment mode and full-precision comm unreachable, and forced sync_every through the misnamed dylu_base_sync_every. - DiLoCoServer gains sync_every / bf16_comm / num_fragments constructor args; /info advertises the real values (sync_every = dylu_base when DyLU is on, else the dedicated sync_every). - `forgather diloco server` gains --sync-every / --num-fragments / --no-bf16. Threaded through the spawn chain (build_diloco_server_command -> launcher -> scheduler) and surfaced in the DiLoCoServerModal as a "Worker settings (group-wide)" section. bf16_comm stays centralized on the server (one wire precision for the whole group), per the design call. - Fix DDP follower deadlock: the callback now broadcasts an error sentinel from the leader on /info failure so every rank raises together, instead of followers blocking forever in broadcast_object_list when the leader aborts (e.g. a too-old server with null sync_every). - Docs/messages: diloco-architecture.md and diloco.md document the new server flags and stop pointing at the removed worker flags; the worker DyLU-under-pipeline error points at the server --dylu; fixed a misleading test_routes_diloco fixture. Tests: server /info reflects configured group settings (+ DyLU-base case); build_diloco_server_command emits the new flags. Full tests/unit/ml/diloco + tests/unit/forgather_server green (978 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First of three sequential PRs collapsing DiLoCo client/server configuration duplication (see the others: unified meta construction, then the model-definition bundle).
A DiLoCo worker had to be hand-configured to match the server on
sync_every,bf16_comm,dylu, andnum_fragments— re-specified on the worker CLI, in the callback templates, and in the webui. These must match across the group for the sync barrier / outer step / fragment barriers to be coherent, so a divergent value is only ever a foot-gun. The server already advertises them viaGET /info; this PR makes the worker consume them and removes the redundant client-side surface entirely (no override — it can be added back later if a concrete need appears).Changes
Server (
/info)expected_client_settingsis now complete and non-null:sync_everyis advertised asdylu_base_sync_everywhether or not DyLU is on, plusnum_fragments_defaultandheartbeat_timeout.settings_authority: "server"and a coarsemodel_hash(sorted param name/shape fingerprint) — the latter is scaffolding consumed by the upcoming model-bundle PR.Worker
DiLoCoClient.get_info().DiLoCoCallbackdrops the four must-match constructor args /DILOCO_*reads.on_train_beginfetches/info(leader-only, broadcast to DDP followers), applies the settings verbatim and logs them. A server too old to advertisesync_everyis fatal (no silent default).heartbeat_intervalstays client-local and is validated against the server'sheartbeat_timeout.Surface removal
forgather diloco workerdrops--sync-every/--no-bf16/--dylu/--num-fragmentsand stops emitting the matchingDILOCO_*env vars.callbacks/diloco.yaml+mixins/diloco.yamldrop the four knobs.SubmitModalremoves the four editable inputs and shows the server's values read-only instead (stable layout).docs/trainers/diloco.mdupdated to match. The low-levelDiLoCoWorkerprogrammatic API is unchanged.Testing
/infoshape +model_hashdeterminism tests.tests/unit/ml/diloco/+tests/unit/forgather_server/green (974 passed).forgather lsclean. Webui builds clean.🤖 Generated with Claude Code