DiLoCo: serve model definition from server; workers build empty + pull weights (#53) by jdinalt · Pull Request #102 · jdinalt/forgather

jdinalt · 2026-05-30T08:31:42Z

Summary

Final PR of the 3-PR DiLoCo sequence (after #99 server-authoritative settings, #100 unified meta checkpoint loading). Closes the "workers construct empty model and pull weights from server" goal (#53).

A DiLoCo worker no longer takes a model path. The server serves the model definition (config + the custom modeling/configuration .py closure + tokenizer — never weights) from the checkpoint dir it was started from. The worker stages it into its own output dir, builds the model empty on the meta device (PR #100's path), and fills it from the parameter sync at register. This removes the shared-filesystem requirement and makes a mismatched-architecture worker impossible.

How it works

Server GET /model_def: streams a deterministic uncompressed tar of the non-weight files with an X-Forgather-Model-Hash header. Control-plane only (bearer-required; never on the bulk listener). load_state persists self._loaded_checkpoint_dir and folds the bundle's content hash into the advertised model_hash so a config/code change (not just a shape change) invalidates a worker's cache cleanly.
Bundle policy (forgather.ml.diloco.model_def): single source of truth for include/exclude, deterministic packing/hashing, and traversal-safe extraction. Walks the whole tree for .py, so a split two-file custom model (configuration_x.py + modeling_x.py) both ship and trust_remote_code resolves each.
Staging (forgather.ml.diloco.model_stage.stage_model_def): a cached, output-dir-scoped fetch into <output_dir>/diloco_model_def/ with a .forgather_model_hash stamp (reuse on match, re-fetch on mismatch) and file_lock_build to serialize DDP ranks. No offline fallback — fail loud.
Template: models/from_diloco_server.yaml wires the staging as a !singleton consumed by the tokenizer, the model config, and the model factory. The first consumer (the tokenizer, at dataset preprocessing — which resolves into a real object before the model is built) triggers the one fetch; the rest reuse the cache. construct_model_on=meta under DiLoCo. Launch wrappers unchanged. --model-id-or-path is ignored under DiLoCo; SubmitModal drops the obsolete server-output_dir pre-fill.

Design notes (from review during implementation)

Staging lives in the worker's output dir, not a global ~/.cache — co-located with the run, no cross-server cache-key collisions.
The launch wrappers stay dumb; staging is an in-template lazy callable that closes over the render-time output_dir Jinja var, so the CLI/scheduler never need to resolve the output dir.
A shared singleton (not a model_init-only fetch) is required because the tokenizer resolves before the model — proven by a tokenizer-first ordering test.

Tests

Bundle policy + hashing + traversal-safety.
/model_def endpoint + auth + persisted dir + folded hash.
Stage cache hit / miss / invalidation.
End-to-end stage → build-empty-on-meta against models/tiny (multi-file model, tokenizer-first ordering, single fetch, param set matches server).

Full diloco suite (303) + meta-init suite (18) green; forgather ls/pp clean on DiLoCo and non-DiLoCo configs; webui rebuilt.

🤖 Generated with Claude Code

…l weights (#53) A DiLoCo worker no longer takes a model path. The server serves the model *definition* (config + custom modeling/configuration .py closure + tokenizer, never weights) from the checkpoint dir it was started from; the worker stages it into its own output dir, builds the model empty on the meta device, and fills it from the parameter sync at register. This removes the shared- filesystem requirement and the mismatched-architecture foot-gun. Server: - GET /model_def streams a deterministic tar of the non-weight files with an X-Forgather-Model-Hash header; control-plane only (bearer-required, never on the bulk listener). load_state persists self._loaded_checkpoint_dir and folds the bundle's content hash into the advertised model_hash so a stale staged definition (config/code change, not just shape) invalidates cleanly. - New forgather.ml.diloco.model_def: single source of truth for the include/exclude policy, deterministic packing/hashing, and traversal-safe extraction. Walks the whole tree for .py so split two-file custom models (configuration_x.py + modeling_x.py) both ship. Client / staging: - DiLoCoClient.fetch_model_def validates the hash header against /info and extracts traversal-safe. - New forgather.ml.diloco.model_stage.stage_model_def stages into <output_dir>/diloco_model_def with a .forgather_model_hash stamp (reuse on match, re-fetch on mismatch) and file_lock_build to serialize DDP ranks. Template wiring: - New models/from_diloco_server.yaml: a cached !singleton stages the bundle and both the tokenizer and model config consume it, so the first consumer (the tokenizer, at dataset preprocessing) triggers one fetch and the model build reuses it. construct_model_on=meta under DiLoCo. The launch wrappers stay unchanged (they already set DILOCO_SERVER/DILOCO_WORKER_ID). - --model-id-or-path is ignored under DiLoCo (help text updated); SubmitModal drops the now-obsolete server-output_dir pre-fill. Tests: bundle policy + hashing + traversal-safety; /model_def endpoint + auth + persisted dir + folded hash; stage cache hit/miss/invalidation; and an end-to-end stage→build-empty-on-meta against models/tiny (multi-file model, tokenizer-first ordering, single fetch). Docs: diloco.md + diloco-architecture.md endpoint tables and a model-definition-staging section. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… operator docs Local review pass (no high-severity findings). Address low-severity items: - Server caches the packed /model_def bundle (build once under a lock; the loaded checkpoint dir is content-stable for the server's lifetime), so concurrent worker fetches don't each re-walk the dir or hold separate in-memory tar copies. - Template renders ns.diloco_server_addr as "" (not the literal "None") under the enable_diloco render-time diagnostic, matching the comment. - docs/trainers/diloco.md: operator note that the server is a trusted code-distribution authority (trust_remote_code) and the checkpoint dir should hold only model-definition files (every .py is shipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… caching Add a "Model-definition staging" subsection under Lifecycle and Data Flow: - where it sits (trainer construction, before DiLoCoWorker.start()) and how it differs from register/_apply_global_params (skeleton+tokenizer vs weights); - the components table (model_def.py / GET /model_def / fetch_model_def / stage_model_def / from_diloco_server.yaml); - a step-by-step sequence of what happens when the config loads the model, including the tokenizer-first ordering that makes the shared singleton load-bearing; - the caching/invalidation model: worker-side stamp + fast path, atomic crash-safe swap, file-lock for DDP ranks, the server-side packed-bundle cache, and the folded model_hash semantics. Also list model_def.py / model_stage.py in the Source Layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ch doc Per review: keep this PR scoped to "DiLoCo workers don't save/restore checkpoints." The empty-meta construction and the "checkpoint state but never model weights" behavior need trainer changes (a configurable checkpoint-component interface + external-weight construction) that touch sensitive shared infrastructure across all five get_state_components() implementations — deferred to a dedicated PR 4. - Revert the construct_model_on=meta template conditional: it downgraded to 'default' with a misleading "no checkpoint for meta" warning on every fresh run (true empty-meta needs the deferred work). DiLoCo now builds on the operator's default strategy and pulls weights from the server. - Soften the meta over-claims in from_diloco_server.yaml and diloco.md to "builds the model empty (from config, no weights)". - Record the full PR-4 design in diloco-architecture.md (new "Planned: configurable checkpoint state + empty-meta construction" section): motivation, the _resolve_checkpoint/_prepare_model interaction, the get_active_state_components() filter at the two consumers (5 impls untouched), construction derived from the component set, DiLoCo config defaults, pipeline-trainer impact, touch-points, and scope. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ded) model_id_or_path was added to the base lm_training_project only to give DiLoCo workers a way to point at the model directory — overloading the mechanism finetune_v2.yaml already defined. This PR removes that need (the worker fetches the definition from the server), and the base project's non-DiLoCo path constructs from model_project (a from_project build), so the arg was non-functional here. Remove from templatelib/examples/projects/lm_training_project.yaml: - the ns.model_id_or_path resolution block in [globals], - the --model-id-or-path dynamic arg, - the debug-echo line (replaced with ns.diloco_server_addr). finetune_v2.yaml is unaffected: it defines its own --model-id-or-path arg and sets ns.model_id_or_path itself (verified by render). Docs that mention --model-id-or-path are finetune-specific (walkthrough) or the DiLoCo "no model path" statement (diloco.md) — both still accurate, no change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

None of the DiLoCo work has shipped and the arg never existed in this base template (it was a bootstrap to test model-dir wiring before this PR), so the comments shouldn't reference it or explain a removal. Drop the NB note and the model_id_or_path mention in the worker-id-suffix comment; the remaining text just describes what the template does. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ion) Per project policy: permanent docs on dev must not narrate revision-to- revision deltas or PR history (dev accumulates WIP across many PRs and makes no dev-only compatibility guarantee). Rewrite the diloco-architecture.md "Checkpoint state selection + empty-meta construction" section as plain present-tense documentation — drop the "PR #102 was scoped to… / this follow-up made it real / as implemented" framing and the stale "manifest-only checkpoints are valid" line (it's the explicit MODEL_EXCLUDED_MARKER now). Also record the policy in CLAUDE.md: feature branches base on origin/dev, main is the release branch, and docs describe the feature as-is (WIP design docs may track PR continuity but must carry a TODO-remove). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jdinalt and others added 6 commits May 30, 2026 08:31

jdinalt mentioned this pull request May 30, 2026

DiLoCo: worker-id from dynamic-args desyncs output-dir suffix (and drops view correlation) #103

Open

jdinalt merged commit 1f9e35e into dev May 30, 2026
1 check passed

jdinalt deleted the feature/diloco-model-bundle branch May 30, 2026 20:25

jdinalt mentioned this pull request May 30, 2026

trainer: configurable checkpoint state components + empty-meta construction #104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DiLoCo: serve model definition from server; workers build empty + pull weights (#53)#102

DiLoCo: serve model definition from server; workers build empty + pull weights (#53)#102
jdinalt merged 6 commits into
devfrom
feature/diloco-model-bundle

jdinalt commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdinalt commented May 30, 2026

Summary

How it works

Design notes (from review during implementation)

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant