Skip to content

[codex] Add sparse filesystem weight broadcast#2607

Draft
samsja wants to merge 5 commits into
mainfrom
feat/sparse-filesystem-broadcast
Draft

[codex] Add sparse filesystem weight broadcast#2607
samsja wants to merge 5 commits into
mainfrom
feat/sparse-filesystem-broadcast

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented May 23, 2026

Summary

image image

Adds sparse checkpoint-format transfer as an opt-in mode of the existing filesystem weight broadcast backend.

  • Adds shared, trainer, orchestrator, and inference config support for weight_broadcast.type = "filesystem" with weight_broadcast.sparse = true.
  • Implements a trainer backend that writes full checkpoints on first/forced syncs and layerwise sparse delta directories otherwise, without gathering the full model state dict on rank 0.
  • Adds an inference worker that materializes sparse deltas into a private local checkpoint cache before reusing the existing vLLM checkpoint reload path.
  • Logs sparse broadcast metrics from the trainer backend; orchestrator W&B logging emits only sparse_broadcast_ratio so the main run shows the delta/full-weight size ratio directly.
  • Preserves full base checkpoints needed by surviving sparse delta chains during broadcast cleanup.
  • Keeps LoRA rejected for sparse filesystem broadcast until adapter semantics are defined.

Validation

  • CLI parse check against examples/reverse_text/rl.toml with --weight-broadcast.type filesystem --weight-broadcast.sparse true --weight-broadcast.full-sync-interval 10
  • uv run ruff check packages/prime-rl-configs/src/prime_rl/configs/trainer.py packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py packages/prime-rl-configs/src/prime_rl/configs/inference.py packages/prime-rl-configs/src/prime_rl/configs/rl.py packages/prime-rl-configs/src/prime_rl/utils/validation.py src/prime_rl/trainer/rl/broadcast/__init__.py src/prime_rl/trainer/rl/broadcast/sparse_filesystem.py src/prime_rl/trainer/rl/train.py src/prime_rl/inference/vllm/server.py tests/unit/test_configs.py
  • uv run pytest tests/unit/test_configs.py tests/unit/utils/test_sparse_weights.py tests/unit/orchestrator/test_scheduler.py -q
  • uv run ruff check src/prime_rl/orchestrator/scheduler.py tests/unit/orchestrator/test_scheduler.py
  • uv run pytest tests/unit/orchestrator/test_scheduler.py tests/unit/utils/test_sparse_weights.py -q
  • uv run ruff check src/prime_rl/utils/sparse_weights.py src/prime_rl/trainer/rl/broadcast/base.py src/prime_rl/trainer/rl/broadcast/sparse_filesystem.py src/prime_rl/trainer/rl/train.py src/prime_rl/orchestrator/scheduler.py tests/unit/utils/test_sparse_weights.py
  • uv run ruff check src/prime_rl/utils/sparse_weights.py src/prime_rl/trainer/rl/broadcast/sparse_filesystem.py src/prime_rl/inference/vllm/worker/sparse_filesystem.py src/prime_rl/trainer/rl/train.py packages/prime-rl-configs/src/prime_rl/configs/{trainer,orchestrator,inference,rl}.py tests/unit/utils/test_sparse_weights.py tests/unit/test_configs.py
  • uv run pytest tests/unit/utils/test_sparse_weights.py tests/unit/test_configs.py -q

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant