DiLoCo: full security model (auth, mTLS, audit) — tracking

## Tracking issue

Security for the DiLoCo subsystem (server + worker + webui proxy) is **out of scope for current PRs** including #89 (pipeline groups). This issue tracks the eventual full-security PR so we don't lose the requirement.

## Current security posture

The DiLoCo server (\`src/forgather/ml/diloco/server.py\`) runs on a trusted local network with **no authentication** on its HTTP endpoints. Anyone who can reach the bind address can:

- Register / deregister workers (including spoofing worker_ids of legitimate workers)
- Submit pseudo-gradients (poisoning the global model)
- Trigger \`/control/shutdown\`, \`/control/update_optimizer\`, etc.
- Read \`/status\`, \`/info\`, \`/work/queues\` (model fingerprints, training progress)

The webui's DiLoCo panel proxies these endpoints (\`tools/forgather_server/routes/diloco.py\`) and the forgather_server itself has auth, but the upstream DiLoCo server's wire is unprotected.

## What a future security PR needs to cover

- **Server-side auth.** Bearer-token or mTLS on every DiLoCo HTTP endpoint. The webui proxy's existing per-server auth-token plumbing (\`diloco_server_registry.auth_token\`) is the entry point; the server side needs to actually verify it.
- **Worker-side credentials.** \`DiLoCoWorker\` / \`DiLoCoClient\` need to plumb a bearer through to every request (register, submit, heartbeat, deregister, work-unit dispatch, control).
- **Group-membership trust.** Currently any worker can claim any \`group_id\` / \`pp_rank\` (issue #84). A future trust model should bind group membership to an authenticated identity — e.g. a per-job token issued by the forgather_server when it schedules the workers.
- **Audit log.** Outer-optimizer steps, control actions, deregistrations should be loggable to a tamper-evident sink.
- **Webui surface.** The DiLoCo registry's \`auth_token\` UI is already in place; the server needs to consume it.

## Threats explicitly out of scope today

PR #89 (pipeline groups) and prior DiLoCo PRs assume a trusted-LAN model. Items the reviews surfaced that are **security-adjacent but deliberately deferred** until this issue lands:

- A worker can claim any \`group_id\` / \`pp_rank\` without authentication (per-group identity is operator-conventional, not enforced).
- Evicted-but-alive workers' submissions are dropped at the registry boundary (post-#89), but until then a determined attacker could spoof the worker_id of a recently-evicted member to inject pseudo-gradients into the next sync round.
- The control-action endpoints (\`/control/shutdown\`, \`/control/update_optimizer\`, \`/control/kick_worker\`, etc.) accept unsigned requests.
- The work-unit dispatch endpoint (\`/work/request\`) hands out work without checking the requester's identity.

## When this needs to land

Before any DiLoCo deployment that exposes the server beyond a single operator's trusted LAN — public cloud training, multi-tenant clusters, or cross-org collaborations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DiLoCo: full security model (auth, mTLS, audit) — tracking #90

Tracking issue

Current security posture

What a future security PR needs to cover

Threats explicitly out of scope today

When this needs to land

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

DiLoCo: full security model (auth, mTLS, audit) — tracking #90

Description

Tracking issue

Current security posture

What a future security PR needs to cover

Threats explicitly out of scope today

When this needs to land

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions