DiLoCo: serve model definition from server; workers build empty + pull weights (#53)#102
Merged
Conversation
…l weights (#53) A DiLoCo worker no longer takes a model path. The server serves the model *definition* (config + custom modeling/configuration .py closure + tokenizer, never weights) from the checkpoint dir it was started from; the worker stages it into its own output dir, builds the model empty on the meta device, and fills it from the parameter sync at register. This removes the shared- filesystem requirement and the mismatched-architecture foot-gun. Server: - GET /model_def streams a deterministic tar of the non-weight files with an X-Forgather-Model-Hash header; control-plane only (bearer-required, never on the bulk listener). load_state persists self._loaded_checkpoint_dir and folds the bundle's content hash into the advertised model_hash so a stale staged definition (config/code change, not just shape) invalidates cleanly. - New forgather.ml.diloco.model_def: single source of truth for the include/exclude policy, deterministic packing/hashing, and traversal-safe extraction. Walks the whole tree for .py so split two-file custom models (configuration_x.py + modeling_x.py) both ship. Client / staging: - DiLoCoClient.fetch_model_def validates the hash header against /info and extracts traversal-safe. - New forgather.ml.diloco.model_stage.stage_model_def stages into <output_dir>/diloco_model_def with a .forgather_model_hash stamp (reuse on match, re-fetch on mismatch) and file_lock_build to serialize DDP ranks. Template wiring: - New models/from_diloco_server.yaml: a cached !singleton stages the bundle and both the tokenizer and model config consume it, so the first consumer (the tokenizer, at dataset preprocessing) triggers one fetch and the model build reuses it. construct_model_on=meta under DiLoCo. The launch wrappers stay unchanged (they already set DILOCO_SERVER/DILOCO_WORKER_ID). - --model-id-or-path is ignored under DiLoCo (help text updated); SubmitModal drops the now-obsolete server-output_dir pre-fill. Tests: bundle policy + hashing + traversal-safety; /model_def endpoint + auth + persisted dir + folded hash; stage cache hit/miss/invalidation; and an end-to-end stage→build-empty-on-meta against models/tiny (multi-file model, tokenizer-first ordering, single fetch). Docs: diloco.md + diloco-architecture.md endpoint tables and a model-definition-staging section. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… operator docs Local review pass (no high-severity findings). Address low-severity items: - Server caches the packed /model_def bundle (build once under a lock; the loaded checkpoint dir is content-stable for the server's lifetime), so concurrent worker fetches don't each re-walk the dir or hold separate in-memory tar copies. - Template renders ns.diloco_server_addr as "" (not the literal "None") under the enable_diloco render-time diagnostic, matching the comment. - docs/trainers/diloco.md: operator note that the server is a trusted code-distribution authority (trust_remote_code) and the checkpoint dir should hold only model-definition files (every .py is shipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… caching Add a "Model-definition staging" subsection under Lifecycle and Data Flow: - where it sits (trainer construction, before DiLoCoWorker.start()) and how it differs from register/_apply_global_params (skeleton+tokenizer vs weights); - the components table (model_def.py / GET /model_def / fetch_model_def / stage_model_def / from_diloco_server.yaml); - a step-by-step sequence of what happens when the config loads the model, including the tokenizer-first ordering that makes the shared singleton load-bearing; - the caching/invalidation model: worker-side stamp + fast path, atomic crash-safe swap, file-lock for DDP ranks, the server-side packed-bundle cache, and the folded model_hash semantics. Also list model_def.py / model_stage.py in the Source Layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ch doc Per review: keep this PR scoped to "DiLoCo workers don't save/restore checkpoints." The empty-meta construction and the "checkpoint state but never model weights" behavior need trainer changes (a configurable checkpoint-component interface + external-weight construction) that touch sensitive shared infrastructure across all five get_state_components() implementations — deferred to a dedicated PR 4. - Revert the construct_model_on=meta template conditional: it downgraded to 'default' with a misleading "no checkpoint for meta" warning on every fresh run (true empty-meta needs the deferred work). DiLoCo now builds on the operator's default strategy and pulls weights from the server. - Soften the meta over-claims in from_diloco_server.yaml and diloco.md to "builds the model empty (from config, no weights)". - Record the full PR-4 design in diloco-architecture.md (new "Planned: configurable checkpoint state + empty-meta construction" section): motivation, the _resolve_checkpoint/_prepare_model interaction, the get_active_state_components() filter at the two consumers (5 impls untouched), construction derived from the component set, DiLoCo config defaults, pipeline-trainer impact, touch-points, and scope. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ded) model_id_or_path was added to the base lm_training_project only to give DiLoCo workers a way to point at the model directory — overloading the mechanism finetune_v2.yaml already defined. This PR removes that need (the worker fetches the definition from the server), and the base project's non-DiLoCo path constructs from model_project (a from_project build), so the arg was non-functional here. Remove from templatelib/examples/projects/lm_training_project.yaml: - the ns.model_id_or_path resolution block in [globals], - the --model-id-or-path dynamic arg, - the debug-echo line (replaced with ns.diloco_server_addr). finetune_v2.yaml is unaffected: it defines its own --model-id-or-path arg and sets ns.model_id_or_path itself (verified by render). Docs that mention --model-id-or-path are finetune-specific (walkthrough) or the DiLoCo "no model path" statement (diloco.md) — both still accurate, no change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
None of the DiLoCo work has shipped and the arg never existed in this base template (it was a bootstrap to test model-dir wiring before this PR), so the comments shouldn't reference it or explain a removal. Drop the NB note and the model_id_or_path mention in the worker-id-suffix comment; the remaining text just describes what the template does. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jdinalt
added a commit
that referenced
this pull request
May 30, 2026
…ion) Per project policy: permanent docs on dev must not narrate revision-to- revision deltas or PR history (dev accumulates WIP across many PRs and makes no dev-only compatibility guarantee). Rewrite the diloco-architecture.md "Checkpoint state selection + empty-meta construction" section as plain present-tense documentation — drop the "PR #102 was scoped to… / this follow-up made it real / as implemented" framing and the stale "manifest-only checkpoints are valid" line (it's the explicit MODEL_EXCLUDED_MARKER now). Also record the policy in CLAUDE.md: feature branches base on origin/dev, main is the release branch, and docs describe the feature as-is (WIP design docs may track PR continuity but must carry a TODO-remove). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Final PR of the 3-PR DiLoCo sequence (after #99 server-authoritative settings, #100 unified meta checkpoint loading). Closes the "workers construct empty model and pull weights from server" goal (#53).
A DiLoCo worker no longer takes a model path. The server serves the model definition (config + the custom modeling/configuration
.pyclosure + tokenizer — never weights) from the checkpoint dir it was started from. The worker stages it into its own output dir, builds the model empty on the meta device (PR #100's path), and fills it from the parameter sync at register. This removes the shared-filesystem requirement and makes a mismatched-architecture worker impossible.How it works
GET /model_def: streams a deterministic uncompressed tar of the non-weight files with anX-Forgather-Model-Hashheader. Control-plane only (bearer-required; never on the bulk listener).load_statepersistsself._loaded_checkpoint_dirand folds the bundle's content hash into the advertisedmodel_hashso a config/code change (not just a shape change) invalidates a worker's cache cleanly.forgather.ml.diloco.model_def): single source of truth for include/exclude, deterministic packing/hashing, and traversal-safe extraction. Walks the whole tree for.py, so a split two-file custom model (configuration_x.py+modeling_x.py) both ship andtrust_remote_coderesolves each.forgather.ml.diloco.model_stage.stage_model_def): a cached, output-dir-scoped fetch into<output_dir>/diloco_model_def/with a.forgather_model_hashstamp (reuse on match, re-fetch on mismatch) andfile_lock_buildto serialize DDP ranks. No offline fallback — fail loud.models/from_diloco_server.yamlwires the staging as a!singletonconsumed by the tokenizer, the model config, and the model factory. The first consumer (the tokenizer, at dataset preprocessing — which resolves into a real object before the model is built) triggers the one fetch; the rest reuse the cache.construct_model_on=metaunder DiLoCo. Launch wrappers unchanged.--model-id-or-pathis ignored under DiLoCo; SubmitModal drops the obsolete server-output_dir pre-fill.Design notes (from review during implementation)
~/.cache— co-located with the run, no cross-server cache-key collisions.output_dirJinja var, so the CLI/scheduler never need to resolve the output dir.Tests
/model_defendpoint + auth + persisted dir + folded hash.models/tiny(multi-file model, tokenizer-first ordering, single fetch, param set matches server).Full diloco suite (303) + meta-init suite (18) green;
forgather ls/ppclean on DiLoCo and non-DiLoCo configs; webui rebuilt.🤖 Generated with Claude Code