Skip to content

DiLoCo: serve model definition from server; workers build empty + pull weights (#53)#102

Merged
jdinalt merged 6 commits into
devfrom
feature/diloco-model-bundle
May 30, 2026
Merged

DiLoCo: serve model definition from server; workers build empty + pull weights (#53)#102
jdinalt merged 6 commits into
devfrom
feature/diloco-model-bundle

Conversation

@jdinalt
Copy link
Copy Markdown
Owner

@jdinalt jdinalt commented May 30, 2026

Summary

Final PR of the 3-PR DiLoCo sequence (after #99 server-authoritative settings, #100 unified meta checkpoint loading). Closes the "workers construct empty model and pull weights from server" goal (#53).

A DiLoCo worker no longer takes a model path. The server serves the model definition (config + the custom modeling/configuration .py closure + tokenizer — never weights) from the checkpoint dir it was started from. The worker stages it into its own output dir, builds the model empty on the meta device (PR #100's path), and fills it from the parameter sync at register. This removes the shared-filesystem requirement and makes a mismatched-architecture worker impossible.

How it works

  • Server GET /model_def: streams a deterministic uncompressed tar of the non-weight files with an X-Forgather-Model-Hash header. Control-plane only (bearer-required; never on the bulk listener). load_state persists self._loaded_checkpoint_dir and folds the bundle's content hash into the advertised model_hash so a config/code change (not just a shape change) invalidates a worker's cache cleanly.
  • Bundle policy (forgather.ml.diloco.model_def): single source of truth for include/exclude, deterministic packing/hashing, and traversal-safe extraction. Walks the whole tree for .py, so a split two-file custom model (configuration_x.py + modeling_x.py) both ship and trust_remote_code resolves each.
  • Staging (forgather.ml.diloco.model_stage.stage_model_def): a cached, output-dir-scoped fetch into <output_dir>/diloco_model_def/ with a .forgather_model_hash stamp (reuse on match, re-fetch on mismatch) and file_lock_build to serialize DDP ranks. No offline fallback — fail loud.
  • Template: models/from_diloco_server.yaml wires the staging as a !singleton consumed by the tokenizer, the model config, and the model factory. The first consumer (the tokenizer, at dataset preprocessing — which resolves into a real object before the model is built) triggers the one fetch; the rest reuse the cache. construct_model_on=meta under DiLoCo. Launch wrappers unchanged. --model-id-or-path is ignored under DiLoCo; SubmitModal drops the obsolete server-output_dir pre-fill.

Design notes (from review during implementation)

  • Staging lives in the worker's output dir, not a global ~/.cache — co-located with the run, no cross-server cache-key collisions.
  • The launch wrappers stay dumb; staging is an in-template lazy callable that closes over the render-time output_dir Jinja var, so the CLI/scheduler never need to resolve the output dir.
  • A shared singleton (not a model_init-only fetch) is required because the tokenizer resolves before the model — proven by a tokenizer-first ordering test.

Tests

  • Bundle policy + hashing + traversal-safety.
  • /model_def endpoint + auth + persisted dir + folded hash.
  • Stage cache hit / miss / invalidation.
  • End-to-end stage → build-empty-on-meta against models/tiny (multi-file model, tokenizer-first ordering, single fetch, param set matches server).

Full diloco suite (303) + meta-init suite (18) green; forgather ls/pp clean on DiLoCo and non-DiLoCo configs; webui rebuilt.

🤖 Generated with Claude Code

jdinalt and others added 6 commits May 30, 2026 08:31
…l weights (#53)

A DiLoCo worker no longer takes a model path. The server serves the model
*definition* (config + custom modeling/configuration .py closure + tokenizer,
never weights) from the checkpoint dir it was started from; the worker stages
it into its own output dir, builds the model empty on the meta device, and
fills it from the parameter sync at register. This removes the shared-
filesystem requirement and the mismatched-architecture foot-gun.

Server:
- GET /model_def streams a deterministic tar of the non-weight files with an
  X-Forgather-Model-Hash header; control-plane only (bearer-required, never
  on the bulk listener). load_state persists self._loaded_checkpoint_dir and
  folds the bundle's content hash into the advertised model_hash so a stale
  staged definition (config/code change, not just shape) invalidates cleanly.
- New forgather.ml.diloco.model_def: single source of truth for the
  include/exclude policy, deterministic packing/hashing, and traversal-safe
  extraction. Walks the whole tree for .py so split two-file custom models
  (configuration_x.py + modeling_x.py) both ship.

Client / staging:
- DiLoCoClient.fetch_model_def validates the hash header against /info and
  extracts traversal-safe.
- New forgather.ml.diloco.model_stage.stage_model_def stages into
  <output_dir>/diloco_model_def with a .forgather_model_hash stamp (reuse on
  match, re-fetch on mismatch) and file_lock_build to serialize DDP ranks.

Template wiring:
- New models/from_diloco_server.yaml: a cached !singleton stages the bundle
  and both the tokenizer and model config consume it, so the first consumer
  (the tokenizer, at dataset preprocessing) triggers one fetch and the model
  build reuses it. construct_model_on=meta under DiLoCo. The launch wrappers
  stay unchanged (they already set DILOCO_SERVER/DILOCO_WORKER_ID).
- --model-id-or-path is ignored under DiLoCo (help text updated); SubmitModal
  drops the now-obsolete server-output_dir pre-fill.

Tests: bundle policy + hashing + traversal-safety; /model_def endpoint +
auth + persisted dir + folded hash; stage cache hit/miss/invalidation; and an
end-to-end stage→build-empty-on-meta against models/tiny (multi-file model,
tokenizer-first ordering, single fetch).

Docs: diloco.md + diloco-architecture.md endpoint tables and a
model-definition-staging section.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… operator docs

Local review pass (no high-severity findings). Address low-severity items:
- Server caches the packed /model_def bundle (build once under a lock; the
  loaded checkpoint dir is content-stable for the server's lifetime), so
  concurrent worker fetches don't each re-walk the dir or hold separate
  in-memory tar copies.
- Template renders ns.diloco_server_addr as "" (not the literal "None")
  under the enable_diloco render-time diagnostic, matching the comment.
- docs/trainers/diloco.md: operator note that the server is a trusted
  code-distribution authority (trust_remote_code) and the checkpoint dir
  should hold only model-definition files (every .py is shipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… caching

Add a "Model-definition staging" subsection under Lifecycle and Data Flow:
- where it sits (trainer construction, before DiLoCoWorker.start()) and how
  it differs from register/_apply_global_params (skeleton+tokenizer vs weights);
- the components table (model_def.py / GET /model_def / fetch_model_def /
  stage_model_def / from_diloco_server.yaml);
- a step-by-step sequence of what happens when the config loads the model,
  including the tokenizer-first ordering that makes the shared singleton
  load-bearing;
- the caching/invalidation model: worker-side stamp + fast path, atomic
  crash-safe swap, file-lock for DDP ranks, the server-side packed-bundle
  cache, and the folded model_hash semantics.

Also list model_def.py / model_stage.py in the Source Layout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ch doc

Per review: keep this PR scoped to "DiLoCo workers don't save/restore
checkpoints." The empty-meta construction and the "checkpoint state but
never model weights" behavior need trainer changes (a configurable
checkpoint-component interface + external-weight construction) that touch
sensitive shared infrastructure across all five get_state_components()
implementations — deferred to a dedicated PR 4.

- Revert the construct_model_on=meta template conditional: it downgraded to
  'default' with a misleading "no checkpoint for meta" warning on every
  fresh run (true empty-meta needs the deferred work). DiLoCo now builds on
  the operator's default strategy and pulls weights from the server.
- Soften the meta over-claims in from_diloco_server.yaml and diloco.md to
  "builds the model empty (from config, no weights)".
- Record the full PR-4 design in diloco-architecture.md (new "Planned:
  configurable checkpoint state + empty-meta construction" section):
  motivation, the _resolve_checkpoint/_prepare_model interaction, the
  get_active_state_components() filter at the two consumers (5 impls
  untouched), construction derived from the component set, DiLoCo config
  defaults, pipeline-trainer impact, touch-points, and scope.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ded)

model_id_or_path was added to the base lm_training_project only to give
DiLoCo workers a way to point at the model directory — overloading the
mechanism finetune_v2.yaml already defined. This PR removes that need
(the worker fetches the definition from the server), and the base project's
non-DiLoCo path constructs from model_project (a from_project build), so
the arg was non-functional here.

Remove from templatelib/examples/projects/lm_training_project.yaml:
- the ns.model_id_or_path resolution block in [globals],
- the --model-id-or-path dynamic arg,
- the debug-echo line (replaced with ns.diloco_server_addr).

finetune_v2.yaml is unaffected: it defines its own --model-id-or-path arg
and sets ns.model_id_or_path itself (verified by render). Docs that mention
--model-id-or-path are finetune-specific (walkthrough) or the DiLoCo
"no model path" statement (diloco.md) — both still accurate, no change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
None of the DiLoCo work has shipped and the arg never existed in this base
template (it was a bootstrap to test model-dir wiring before this PR), so
the comments shouldn't reference it or explain a removal. Drop the NB note
and the model_id_or_path mention in the worker-id-suffix comment; the
remaining text just describes what the template does.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jdinalt jdinalt merged commit 1f9e35e into dev May 30, 2026
1 check passed
@jdinalt jdinalt deleted the feature/diloco-model-bundle branch May 30, 2026 20:25
jdinalt added a commit that referenced this pull request May 30, 2026
…ion)

Per project policy: permanent docs on dev must not narrate revision-to-
revision deltas or PR history (dev accumulates WIP across many PRs and makes
no dev-only compatibility guarantee). Rewrite the diloco-architecture.md
"Checkpoint state selection + empty-meta construction" section as plain
present-tense documentation — drop the "PR #102 was scoped to… / this
follow-up made it real / as implemented" framing and the stale "manifest-only
checkpoints are valid" line (it's the explicit MODEL_EXCLUDED_MARKER now).

Also record the policy in CLAUDE.md: feature branches base on origin/dev,
main is the release branch, and docs describe the feature as-is (WIP design
docs may track PR continuity but must carry a TODO-remove).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant