You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Security for the DiLoCo subsystem (server + worker + webui proxy) is out of scope for current PRs including #89 (pipeline groups). This issue tracks the eventual full-security PR so we don't lose the requirement.
Current security posture
The DiLoCo server (`src/forgather/ml/diloco/server.py`) runs on a trusted local network with no authentication on its HTTP endpoints. Anyone who can reach the bind address can:
Register / deregister workers (including spoofing worker_ids of legitimate workers)
Submit pseudo-gradients (poisoning the global model)
Trigger `/control/shutdown`, `/control/update_optimizer`, etc.
Read `/status`, `/info`, `/work/queues` (model fingerprints, training progress)
The webui's DiLoCo panel proxies these endpoints (`tools/forgather_server/routes/diloco.py`) and the forgather_server itself has auth, but the upstream DiLoCo server's wire is unprotected.
What a future security PR needs to cover
Server-side auth. Bearer-token or mTLS on every DiLoCo HTTP endpoint. The webui proxy's existing per-server auth-token plumbing (`diloco_server_registry.auth_token`) is the entry point; the server side needs to actually verify it.
Worker-side credentials. `DiLoCoWorker` / `DiLoCoClient` need to plumb a bearer through to every request (register, submit, heartbeat, deregister, work-unit dispatch, control).
Group-membership trust. Currently any worker can claim any `group_id` / `pp_rank` (issue DiLoCo + Pipeline Parallel: worker.start() fails on meta-device model #84). A future trust model should bind group membership to an authenticated identity — e.g. a per-job token issued by the forgather_server when it schedules the workers.
Audit log. Outer-optimizer steps, control actions, deregistrations should be loggable to a tamper-evident sink.
Webui surface. The DiLoCo registry's `auth_token` UI is already in place; the server needs to consume it.
Threats explicitly out of scope today
PR #89 (pipeline groups) and prior DiLoCo PRs assume a trusted-LAN model. Items the reviews surfaced that are security-adjacent but deliberately deferred until this issue lands:
A worker can claim any `group_id` / `pp_rank` without authentication (per-group identity is operator-conventional, not enforced).
Evicted-but-alive workers' submissions are dropped at the registry boundary (post-diloco: per-rank workers with server-aware groups (pipeline parallel) #89), but until then a determined attacker could spoof the worker_id of a recently-evicted member to inject pseudo-gradients into the next sync round.
The control-action endpoints (`/control/shutdown`, `/control/update_optimizer`, `/control/kick_worker`, etc.) accept unsigned requests.
The work-unit dispatch endpoint (`/work/request`) hands out work without checking the requester's identity.
When this needs to land
Before any DiLoCo deployment that exposes the server beyond a single operator's trusted LAN — public cloud training, multi-tenant clusters, or cross-org collaborations.
Tracking issue
Security for the DiLoCo subsystem (server + worker + webui proxy) is out of scope for current PRs including #89 (pipeline groups). This issue tracks the eventual full-security PR so we don't lose the requirement.
Current security posture
The DiLoCo server (`src/forgather/ml/diloco/server.py`) runs on a trusted local network with no authentication on its HTTP endpoints. Anyone who can reach the bind address can:
The webui's DiLoCo panel proxies these endpoints (`tools/forgather_server/routes/diloco.py`) and the forgather_server itself has auth, but the upstream DiLoCo server's wire is unprotected.
What a future security PR needs to cover
Threats explicitly out of scope today
PR #89 (pipeline groups) and prior DiLoCo PRs assume a trusted-LAN model. Items the reviews surfaced that are security-adjacent but deliberately deferred until this issue lands:
When this needs to land
Before any DiLoCo deployment that exposes the server beyond a single operator's trusted LAN — public cloud training, multi-tenant clusters, or cross-org collaborations.