Skip to content

DiLoCo: full security model (auth, mTLS, audit) — tracking #90

@jdinalt

Description

@jdinalt

Tracking issue

Security for the DiLoCo subsystem (server + worker + webui proxy) is out of scope for current PRs including #89 (pipeline groups). This issue tracks the eventual full-security PR so we don't lose the requirement.

Current security posture

The DiLoCo server (`src/forgather/ml/diloco/server.py`) runs on a trusted local network with no authentication on its HTTP endpoints. Anyone who can reach the bind address can:

  • Register / deregister workers (including spoofing worker_ids of legitimate workers)
  • Submit pseudo-gradients (poisoning the global model)
  • Trigger `/control/shutdown`, `/control/update_optimizer`, etc.
  • Read `/status`, `/info`, `/work/queues` (model fingerprints, training progress)

The webui's DiLoCo panel proxies these endpoints (`tools/forgather_server/routes/diloco.py`) and the forgather_server itself has auth, but the upstream DiLoCo server's wire is unprotected.

What a future security PR needs to cover

  • Server-side auth. Bearer-token or mTLS on every DiLoCo HTTP endpoint. The webui proxy's existing per-server auth-token plumbing (`diloco_server_registry.auth_token`) is the entry point; the server side needs to actually verify it.
  • Worker-side credentials. `DiLoCoWorker` / `DiLoCoClient` need to plumb a bearer through to every request (register, submit, heartbeat, deregister, work-unit dispatch, control).
  • Group-membership trust. Currently any worker can claim any `group_id` / `pp_rank` (issue DiLoCo + Pipeline Parallel: worker.start() fails on meta-device model #84). A future trust model should bind group membership to an authenticated identity — e.g. a per-job token issued by the forgather_server when it schedules the workers.
  • Audit log. Outer-optimizer steps, control actions, deregistrations should be loggable to a tamper-evident sink.
  • Webui surface. The DiLoCo registry's `auth_token` UI is already in place; the server needs to consume it.

Threats explicitly out of scope today

PR #89 (pipeline groups) and prior DiLoCo PRs assume a trusted-LAN model. Items the reviews surfaced that are security-adjacent but deliberately deferred until this issue lands:

  • A worker can claim any `group_id` / `pp_rank` without authentication (per-group identity is operator-conventional, not enforced).
  • Evicted-but-alive workers' submissions are dropped at the registry boundary (post-diloco: per-rank workers with server-aware groups (pipeline parallel) #89), but until then a determined attacker could spoof the worker_id of a recently-evicted member to inject pseudo-gradients into the next sync round.
  • The control-action endpoints (`/control/shutdown`, `/control/update_optimizer`, `/control/kick_worker`, etc.) accept unsigned requests.
  • The work-unit dispatch endpoint (`/work/request`) hands out work without checking the requester's identity.

When this needs to land

Before any DiLoCo deployment that exposes the server beyond a single operator's trusted LAN — public cloud training, multi-tenant clusters, or cross-org collaborations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions