DiLoCo: full security model (auth, mTLS, audit) — closes #90#93
Merged
Conversation
The stdlib http.server / urllib analogues of the uvicorn / httpx helpers. Needed by DiLoCo (stdlib HTTP stack) for issue #90. Pure addition; no existing behavior changes. * ``stdlib_ssl_context(args, cfg)`` builds a server-side SSLContext ready for ``ctx.wrap_socket(sock, server_side=True)`` on ``http.server`` listening sockets. When a cluster CA bundle is present, configures ``CERT_OPTIONAL`` so the handshake validates any client cert that is presented (mTLS path). * ``urllib_ssl_context(cfg, verify=True)`` mirrors ``httpx_peer_kwargs``: builds one context carrying both the cluster CA bundle and this node's cert+key. ``verify=False`` returns an unverified context for SSH-tunneled remotes. Tests cover defaults, half-provisioned hosts, FileNotFoundError on missing cert/key, the verify=False opt-out path, and an end-to-end handshake between an ``http.server`` instance using the server context and a ``urllib.request`` call using the client context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #90. The DiLoCo HTTP wire ran cleartext + unauthenticated. Add the per-port bearer-token / TLS pattern dataset_server uses, but adapted to DiLoCo's stdlib http.server stack instead of FastAPI. * ``src/forgather/ml/diloco/auth.py`` (new): - ``add_auth_args`` / ``resolve_auth_token`` / ``format_auth_mode`` mirror dataset_server's CLI surface (--auth-token / --auth-token-file / --no-auth / --regen-token / --quiet-tokens). - Per-port token files at ``~/.config/forgather/diloco_server/<port>.token`` (mode 0600 in a 0700 dir). ``read_standalone_token`` does the loopback-only auto-discovery for clients. - ``verify_bearer(handler, expected)`` is a stdlib-friendly verifier for ``BaseHTTPRequestHandler``: constant-time compare, 401 + ``WWW-Authenticate: Bearer realm="forgather-diloco"`` on failure. * ``DiLoCoServer.__init__`` gains ``auth_token`` and ``ssl_context`` kwargs (both default None to keep existing in-process callers working). Request handler runs the bearer check at the top of every POST/GET dispatch; ``/health`` is added and intentionally exempt. * ``DiLoCoServer.run`` / ``start`` wrap the listening socket when an SSL context is configured; startup banner reports scheme + auth state. * ``forgather diloco server`` CLI: ``add_auth_args`` and ``add_server_tls_args`` plumbed in; ``_server_cmd`` resolves the token (persisting auto-generated/regenerated tokens), builds the SSL context via the new ``stdlib_ssl_context``, runs ``enforce_non_loopback_policy``, and passes both to DiLoCoServer. The ``weights_only=True`` torch.load hardening that #90 requires was already in place on both server and client (commits 19bcf83, 2e79aeb). Tests: 15 new in ``tests/unit/ml/diloco/test_server_auth.py`` cover missing/malformed/wrong-token 401s, /health open, no-auth mode, the verify_bearer unit, per-port token file mode 0600 + round-trip, and loopback-only auto-discovery. Full 245-test DiLoCo suite passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the server-side bearer/TLS infrastructure (commit 2) through to the client, worker, and DiLoCoCallback so the same auth surface works in-process and from the CLI. * ``DiLoCoClient.__init__`` gains ``token`` and ``verify_tls`` kwargs. When ``token=None`` the client falls back to (in order) ``FORGATHER_DILOCO_SERVER_TOKEN`` and the per-port loopback file — the locally-spawned case Just Works without extra plumbing. ``Authorization: Bearer <token>`` is attached to every request via a shared ``_headers`` helper; ``urlopen`` is invoked with ``context=self._ssl_ctx`` (a ``urllib_ssl_context`` for https URLs, None for cleartext). * ``DiLoCoWorker.__init__`` accepts and forwards ``auth_token`` / ``verify_tls`` to its embedded client. * ``DiLoCoCallback.__init__`` adds matching kwargs and a ``DILOCO_VERIFY_TLS`` env var fallback; the ``/status`` reachability probe at on_train_begin uses the same credentials. * ``forgather diloco status`` CLI grows ``--auth-token`` and ``--no-verify-tls`` flags so the operator can talk to a protected remote without environmental coupling. Tests: 5 new in ``test_client_auth.py`` exercise the full server↔client round-trip — matching token, missing token (401), loopback file discovery, env-var override, and the no-auth backwards-compat path. Existing ``test_diloco_callback.py::TestWorkerLifecycle`` assertions extended to cover the two new kwargs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the DiLoCo server is launched with a stdlib SSL context that loaded the cluster CA bundle (verify_mode = ssl.CERT_OPTIONAL), a client that presents a CA-signed cert at the TLS handshake is treated as cluster-authenticated and the bearer-token check is skipped. Same model as ``_PEER_ALLOWED_PATHS`` in ``tools/forgather_server/auth.py``. * New ``peer_cert_authenticated(handler)`` predicate reads ``handler.connection.getpeercert()``: a non-empty dict means the handshake validated a peer cert against the configured CA. Defensive defaults for cleartext sockets, missing ``connection``, and ``OSError`` from a torn-down socket. * New ``authenticate_request(handler, expected_token)`` is the combined entry point: mTLS first, then bearer fallback. The server's ``_authenticated`` helper switches to it. This means inter-cluster peer calls (forgather_server proxy → DiLoCo, DiLoCo → another forgather_server, etc.) don't have to share each server's per-port bearer — presenting the cluster cert is enough. Mirrors the inference_server / forgather_server pattern. Tests: 11 new in ``test_server_mtls.py``. End-to-end: a TLS server with a real provisioned cluster CA accepts a cluster-cert client without a bearer (200), rejects no-cert + no-bearer (401), and still serves no-cert + valid-bearer (200). Unit tests cover the peer-cert predicate against cleartext, missing-conn, empty-cert, populated-cert, and ``OSError`` edge cases, plus the combined authenticate_request resolution order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #90. Operators want the option to disable TLS+auth for bulk data transport (pseudo-gradients, model weights) on a trusted LAN where throughput is the bottleneck, while keeping registration, heartbeat, control, and status fully protected on the control port. The split also lets us cleanly bind the bulk plane to a different host/interface than the control plane in future work. * ``DiLoCoServer.__init__`` gains ``bulk_port``, ``bulk_ssl_context``, and ``bulk_auth_enabled``. When ``bulk_port`` is set, a second ``ThreadingHTTPServer`` runs in its own daemon thread with a ``role="bulk"`` handler that only serves the three bulk endpoints (and ``/health``). * The control-port handler refuses bulk paths with 404 + ``X-Forgather-Bulk-Url`` so misrouted clients can self-correct. Avoids two ways into the bulk plane (slow-but-secure vs fast-but-cleartext) which would let an attacker pick whichever was convenient. * ``/register`` advertises the bulk URL via the same response header. ``DiLoCoClient.register`` captures it; subsequent ``submit_pseudograd`` / ``submit_fragment_pseudograd`` / ``global_params`` calls route to the bulk URL automatically via ``_base_for_path``. Per-URL SSL-context selection (``_ssl_for_request``) lets the control URL be https and the bulk URL be http on the same client without crossed wires. * CLI: ``--bulk-port N``, plus ``--bulk-tls`` / ``--no-bulk-tls`` and ``--bulk-auth`` / ``--no-bulk-auth`` mutex groups. Defaults when ``--bulk-port`` is set: cleartext, no-auth (the user's stated "torch.distributed-equivalent" posture). RCE protection is independent of these knobs — every tensor blob deserializes via ``torch.load(..., weights_only=True)``, so an attacker on the open bulk plane can disrupt training but cannot take over the host. That guarantee was already in place from prior commits; this one preserves it across the new listener. Tests: 8 new in ``test_server_bulk_port.py`` cover get_bulk_url, control-port 404 + hint header, bulk-port serves /global_params without bearer, bulk-port refuses control endpoints, control still requires bearer, /register response carries X-Forgather-Bulk-Url, and end-to-end client routing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Append-only JSONL records at ``<output_dir>/diloco_audit.log`` for events worth reconstructing after the fact. Best-effort: write failures (disk full, permissions, missing dir) log a warning and keep the request going — the audit log is a record, not a guard. Instrumented sites: * ``register`` — worker_id, hostname, group_id, pp_rank, pp_world_size, num_registered. * ``deregister`` — worker_id (followed by an ``eviction`` record because deregister goes through _handle_worker_death). * ``eviction`` — trigger_worker_id, evicted list (group-aware), group_id, remaining workers. * ``outer_step`` — sync_round, contributors, missing_contributors. * ``control`` — action and the JSON payload of every ``/control/*`` call. No per-caller identity yet (phase 1 = job-level only; a future PR adding mTLS subject-bound identity will populate it). Records carry a UTC ISO-8601 timestamp (``+00:00`` suffix so plain sort works) and JSON-encode any non-stringable values via ``default=str``. Tokens are never logged — the regression guard in ``test_token_is_never_logged`` watches that. Tests: 7 new in ``test_audit_log.py`` cover register/deregister/ control records, ISO timestamp, no-token-in-log, write-failure graceful degradation, and the empty-output_dir no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the dataset_server-style local-spawn flow for DiLoCo (issue #90). When the webui spawns a DiLoCo server through the queue, the scheduler now: 1. Calls ``_resolve_diloco_server_token(port, regen)`` — a new helper mirroring ``_resolve_dataset_server_token``. Reads the per-port persisted file if non-empty; mints + writes a fresh ``secrets.token_hex(32)`` otherwise. ``regen=True`` rotates. 2. Persists ``auth_token`` on the resulting ``JobRecord`` so the webui proxy can find it by job (next commit's wiring). 3. Builds the spawn command with ``--auth-token-file <per-port>`` so the token never touches argv — the spawned process reads it from the same file the standalone CLI does. ``build_diloco_server_command`` and ``spawn_diloco_server_process`` gain matching kwargs: ``auth_token_file``, ``no_auth``, ``bulk_port``, ``bulk_tls``, ``bulk_auth``. The bulk-port knobs let operators configure the two-port bulk plane through the same job spec the queue already accepts. URL-scheme stamping (the cosmetic ``scheme`` field used by Job cards) is extended to ``diloco_server`` jobs so https/http renders correctly when the operator provisioned TLS. Tests: 8 new in ``test_scheduler_diloco_server_token.py``: mint+ persist contract, reuse across calls, regen rotate, empty-file treated as missing, plus CLI-builder assertions confirming ``--auth-token-file`` / ``--no-auth`` / ``--bulk-port*`` surface correctly. Full 665-test forgather_server suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #90. The DiLoCo registry's ``auth_token`` / ``verify_tls`` fields existed but were inert — the proxy used hardcoded ``verify=True`` and never attached an Authorization header. Wire both ends now. * ``routes/diloco.py``: - New ``_token_for_local(base)`` walks running diloco_server JobRecords and returns the persisted bearer for the matching port — same shape as dataset_server's helper. - New ``_auth_headers_for(base, request)`` applies the standard precedence: ``X-Diloco-Auth-Token`` override header → JobRecord auto-lookup → registry's ``find_token`` → empty. - New ``_verify_for(target, base)`` honors per-registry ``verify_tls=False`` and otherwise defers to ``forgather.tls.httpx_verify_for_url``. - All proxy callers (status/info/work-queues/work-queue/control) now thread ``request`` through and attach headers + verify accordingly. - Module docstring rewritten to describe the now-active auth surface; ``AddRegistryEntryRequest`` doc no longer claims the fields are ignored. * ``webui/src/components/DiLoCoPanel.tsx``: - ``AddExternalServerForm`` gains a masked ``auth_token`` input and a ``verify_tls`` checkbox (defaults to checked). - Registry rows display a 🔒 indicator when ``has_auth_token`` is true, so operators can see at a glance which entries are protected. - API surface unchanged — ``api.addDiLoCoRegistryEntry`` already accepts ``auth_token`` and ``verify_tls``. Tests: 6 new in ``test_routes_diloco_auth.py`` cover the override header precedence, JobRecord-vs-registry fallback, no-auth (no Authorization header sent), ``verify_tls=False`` propagation, and the control endpoint attaching bearer the same way GETs do. Existing 21-test ``test_routes_diloco.py`` suite stays green. WebUI builds clean (``npm run build`` → tsc + vite). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ``docs/operations/tls.md``: new "DiLoCo server" subsection covering the standalone CLI, per-port bearer token discovery, mTLS skip-bearer path, the ``--bulk-port`` two-port plane and its trusted-LAN defaults, the trade-off between throughput and security, and the audit log location. * ``docs/design/diloco-security.md`` (new): full design doc — control vs bulk plane, threat model, identity binding (job-level only in phase 1, mTLS subject-bound deferred), audit log format, spawn flow, wire-format additions, and the test surface. * ``docs/design/diloco-pipeline-groups.md``: cross-link to the new security doc. * ``tools/forgather_server/README.md``: new "DiLoCo server" section under the proxy threat-model area documenting the auth model, control-vs-bulk plane, and where to find the design notes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address findings from the three-agent review pass on PR #93. No correctness blockers; this commit folds in the recommended fixes. Security review (L1, L2): * Run ``_authenticated`` before ``_bulk_offloaded`` in do_POST / do_GET so unauthenticated callers don't learn the bulk-listener topology from the 404 hint. * Add ``_CONTROL_AUDIT_FIELDS`` per-action allowlist and route ``_audit("control", ...)`` payloads through ``_audit_control_data``. Today's actions only carry intent metadata; the allowlist is the forward-compat guardrail against a future control endpoint that ships secret material. Two regression tests pin it. Architecture review: * On reconnect, clear ``DiLoCoClient.bulk_url`` when /register's X-Forgather-Bulk-Url header is absent. A server that drops or reshapes its bulk listener no longer leaves clients dialing a dead URL. * Add ``fragment_outer_step`` audit event at both fragment-apply call sites (the submit path and the eviction-triggered path). ``docs/design/diloco-security.md`` updated. * Fix docstring drift: ``X-Forgather-Bulk-Port`` → ``X-Forgather-Bulk-Url`` in two server.py comments (the actual header was always correct). Code-quality review: * F1: warn when ``DiLoCoServer(auth_token="")`` is constructed explicitly with the empty-string token (silent auth-disable was too quiet for an obvious misconfiguration). * F3: add ``_log_auth_failure`` and call it on every 401 path — ``auth.py``'s ``logger`` was imported but unused, leaving 401s invisible in operator logs. Now they land at INFO with the client IP + path, no token leakage. * F4: explicit ``self._bulk_ssl_ctx = None`` initialisation in ``DiLoCoClient.__init__``; replaces the getattr-lazy pattern. * T4: validate the advertised bulk URL's scheme on intake; ignore anything that's not http/https with a WARNING log. Defense in depth against a misconfigured proxy or compromised server. New tests (4): * ``test_audit_log.py``: control-payload redaction + unknown-action empty-data fallback. * ``test_server_bulk_port.py``: stray bearer on no-auth bulk port is ignored, not rejected. * ``test_server_mtls.py``: full mTLS control + cleartext bulk matrix (the recommended production posture per the design doc). Documentation: * ``docs/operations/tls.md``: new "Migrating an existing no-auth deployment" subsection — loopback path, cross-host paths, and the deliberate stay-on-no-auth path. * ``docs/design/diloco-security.md``: audit-log table extended with ``fragment_outer_step`` and a note on the control allowlist. All 350 security-touching tests green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
Review response (commit a330f67)Three review agents (security, architecture, code-quality) returned. Verdicts: clean for merge with no Critical/High/Medium findings. Folded the recommended fixes into a single review-response commit on top. Security (L1, L2)
Architecture
Code quality
New tests (4 added; 350 total in security suite)
Docs
Deferred to follow-up (non-blocking)
Happy to file these as a tracking issue if you'd prefer them not get lost. 🤖 Generated with Claude Code |
Three bugs in the operator-facing surface I missed when wiring up the DiLoCo security model: 1. **Scheme mismatch on local URLs.** ``_local_servers`` and ``_ever_local_base_urls`` in ``routes/diloco.py`` hardcoded ``http://``, even though the scheduler now stamps the actual scheme on the JobRecord (commit cd7aa8e). Result: a TLS-enabled server appeared as ``http://...`` in the Job card AND the proxy spoke HTTP at the TLS socket, producing ``502 Bad Gateway: ReadError`` on every status poll. Fix: respect ``job_params["scheme"]``, falling back to http for pre-stamping records. 2. **DiLoCoServerModal had no security fields.** The spawn modal exposed every other DiLoCo knob but none of the new ``--no-auth`` / ``--regen-token`` / ``--quiet-tokens`` / ``--bulk-port`` / ``--bulk-tls`` / ``--bulk-auth`` flags. Operators couldn't disable auth, rotate the persisted token, suppress the launch banner for shared TTYs, or split off the bulk plane — all through the UI. Mirrors ``DatasetServerModal``'s auth layout: ``--no-auth`` at the top of a "Security (auth + bulk plane)" fieldset, ``--regen-token`` / ``--quiet-tokens`` underneath (disabled when ``--no-auth`` is set), then the bulk-port number input with the ``--bulk-tls`` / ``--bulk-auth`` checkboxes revealed when bulk_port > 0. ``PersistedAdHoc`` / edit-mode seed / state hooks / ``buildArgs`` / ``persistCurrent`` all extended to round-trip the new fields. ``quiet_tokens`` plumbed through ``build_diloco_server_command``, ``spawn_diloco_server_process``, and the scheduler's ``_build_diloco_server`` so the modal toggle reaches the actual spawn argv. ``--regen-token`` doesn't need a CLI plumb — the effect already lands via ``_resolve_diloco_server_token`` reading ``regen_token`` from ``job_params``. 3. **DiLoCo Job card didn't show the token (or honor demo mode).** The inference and dataset_server cards show ``token:`` for authenticated servers, ``auth: hidden (demo mode)`` when the webui is in demo mode, and ``auth: --no-auth`` when bearer is off. DiLoCo's card showed none of these. Added the same pattern (token redaction in demo mode is the user's stated requirement), plus a ``bulk:`` row showing the bulk port + TLS/auth posture when a two-port topology is configured. Tests: * New ``test_https_scheme_stamped_by_scheduler_is_respected`` — pins the scheme fix as a regression guard. * New ``test_missing_scheme_falls_back_to_http`` — the pre-stamping fallback path. WebUI builds clean (``npm run build`` → tsc + vite). All 35 affected backend tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When DiLoCo is spawned with the default ``--host 127.0.0.1`` binding, the proxy's existing loopback-only ``_token_for_local`` matches. When the operator binds to ``0.0.0.0`` or a specific LAN IP — or just browses the webui from a different host — the proxy builds the URL from the JobRecord's stamped ``routable_host`` (e.g. ``https://192.168.9.43:8512``), but ``_token_for_local`` rejected anything non-loopback. The bearer didn't attach; the upstream returned 401; the webui's blanket-401-means-reauth path bounced the operator to the login screen. Extend the matcher to accept three independent signals, *any* of which proves the URL points at one of our own JobRecords: 1. Loopback URL against a loopback or ``0.0.0.0`` bind (original). 2. URL hostname equals the record's stamped ``routable_host`` — the proxy synthesized this URL itself from the JobRecord, so trusting it for token lookup is consistent with trusting it for display. 3. URL hostname equals the record's explicit bind ``host`` (e.g. operator typed ``--host 10.0.0.5``). Terminated records (``status not in {starting, running}``) still return None — a just-died job can't keep handing out its token. Tests: 6 new in ``test_routes_diloco_auth.py``: * loopback + loopback bind * loopback + 0.0.0.0 bind * LAN URL + routable_host (regression for the user-reported bug) * LAN URL + explicit bind host * unrelated LAN host / wrong port returns None * terminated record returns None The auth-bounce symptom (proxy 401 → webui login bounce) is filed separately as issue #94 — even with this fix, an upstream that legitimately 401s for a different reason should land as an inline error, not a session-expired bounce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes #94. Two surgical changes to break the login-bounce loop and get visibility into the bearer-attachment path: 1. **X-Upstream-Auth-Failed tagging.** The webui's fetch wrapper (``webui/src/auth.ts:82-83``) already suppresses the ``AUTH_REQUIRED_EVENT`` when a response carries ``X-Upstream-Auth-Failed: 1``. Inference, dataset_server, and cluster proxies all set this header on upstream 401/403. The DiLoCo proxy (added in PR #93) didn't, so every upstream 401 looked like a session-expired event and bounced the operator to the login screen — making the panel unusable while we debug the underlying token-attachment issue. Now ``_proxy_get`` and ``proxy_control`` both stamp the tag via the existing ``_upstream_auth_headers`` pattern. The operator's session stays intact; the panel surfaces the 401 inline. 2. **Diagnostic logging in _token_for_local.** When the matcher finds a port-matching JobRecord but can't reconcile the URL's hostname against any of the record's host / routable_host fields, we now log at INFO showing exactly what each side looked like and whether the record carried an auth_token. The token itself is never logged. Lets an operator hitting the bounce on a non-loopback URL inspect the TTY and see whether the JobRecord is missing routable_host, missing auth_token, or if the hostnames truly don't match. 3. **Bearer-attached-or-not log on upstream 401/403** in ``_proxy_get`` so even when ``_token_for_local`` matches but the bearer is wrong, the operator sees which side of the issue they have. Tests: 2 new in ``test_routes_diloco_auth.py``: * Upstream 401 surfaces with ``x-upstream-auth-failed: 1``. * Upstream 200 does NOT carry the header (the negative case). Together with f4ecef1 (the routable_host matcher fix), this should: (a) stop the immediate bounce loop, (b) give the operator clear log signal on why bearer attachment failed if it still does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The worker dialed http:// at a TLS-wrapped DiLoCo server and the
TLS layer slammed the connection with RST, surfacing as
``ConnectionResetError: [Errno 104] Connection reset by peer`` at
the ``/status`` probe in ``DiLoCoCallback.on_train_begin``. Two
contributing bugs:
1. **SubmitModal stripped the scheme.** ``buildDiLoCoPayload`` in
``SubmitModal.tsx`` did:
const serverAddr = s.base.replace(/^https?:\/\//, "").replace(/\/$/, "");
based on a stale assumption that the callback wanted bare
``host:port``. The callback just passes the value through to
``DiLoCoClient``, which handles both forms — and which can't
recover the scheme once it's gone. Fix: trim only the trailing
slash; preserve the scheme.
2. **DiLoCoClient defaulted bare host:port to http://.** Legacy
callers (the ``forgather diloco worker --server`` CLI) and any
path that hands a bare ``host:port`` were locked into HTTP.
Now: when no scheme is present, call
``forgather.tls.client_scheme()`` — the same helper the
scheduler uses to stamp the JobRecord — so the worker picks
``https://`` when TLS is locally provisioned and ``http://``
otherwise.
The two fixes are independent guardrails: even if the webui regresses
back to scheme-stripping, the client-side scheme picker keeps the
worker pointed at the right scheme on any host where TLS is
provisioned. Likewise, an explicit ``https://`` from the webui
overrides whatever the local TLS posture happens to be.
Tests: 2 new in ``test_client_auth.py``:
* Bare host:port picks scheme from client_scheme() (covers both
TLS-provisioned and not).
* Explicit ``http://`` / ``https://`` passes through unchanged.
WebUI builds clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 29, 2026
Open
Review pass on PR #93 surfaced several issues; fixes here, ordered by severity: High - cli/diloco: reject --bulk-auth on a cleartext bulk listener. The bulk and control listeners share one bearer, so requiring it over cleartext would leak the control-plane credential to a LAN sniffer. Error out unless --bulk-tls (or run --no-bulk-auth). - forgather_server/scheduler: inject FORGATHER_DILOCO_SERVER_TOKEN into the training worker's env when the DiLoCo URL is routable/non-loopback. Loopback per-port-file auto-discovery can't fire for a 0.0.0.0-bound server's stamped routable URL, so the worker 401'd. Matched by JobRecord host/port; operator-set token still wins. - diloco/server: get_bulk_url() no longer advertises a wildcard bind host. A 0.0.0.0-bound server would hand remote workers an unroutable http://0.0.0.0:<port>; now it uses the worker's Host header, falling back to loopback. Medium - cli/diloco + server: resolve an ephemeral --port 0 to the concrete bound port before the token file is written/banner printed, so loopback token discovery matches the real port. _find_available_port made static. - diloco/client: bare host:port scheme inference is now loopback-aware (loopback -> http; routable -> client_scheme()), and routable guesses carry a fail-loud hint on connection failure instead of a bare RST. This also fixes a pre-existing test regression from 72c3225: on a TLS-provisioned dev machine every localhost test client was dialing https against cleartext servers. - diloco/server: audit-log writes never block the sync barrier. Barrier paths accumulate records and flush via _audit_many after releasing _sync_cond; a persistent append handle replaces per-record open/close. - tls/runtime: urllib_ssl_context(verify=False) still loads the node's client cert, so --no-verify-tls doesn't silently disable the mTLS skip-bearer path. Low - routes/diloco: locally-spawned servers report has_auth_token (the lock indicator was inverted vs registered remotes). - diloco/client: _ssl_for_request matches on the origin boundary, not a bare string prefix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The DiLoCo section was only open when a server was selected or a restore error was present, so a silent reset-to-None (prior selection no longer in the server list) collapsed the section and hid the fact that the job would run as vanilla training. Default it open like the model/dataset sections so the current state is always visible at a glance; still user-collapsible. Addresses the "folded by default" half of #95. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #90. Lands the bearer + TLS + mTLS + audit layer for the DiLoCo subsystem (server, client, worker, callback, scheduler, webui proxy, webui form). Mirrors the dataset_server / inference_server pattern, adapted to DiLoCo's stdlib
http.serverstack.9 logical commits, all individually green. Single bundled PR per the agreed phasing.
What it does
~/.config/forgather/diloco_server/<port>.token. Standard--auth-token/--auth-token-file/--no-auth/--regen-token/--quiet-tokensCLI surface.forgather.tlsstack, with two new stdlib-flavored helpers (stdlib_ssl_context,urllib_ssl_context) that don't depend on uvicorn/httpx._PEER_ALLOWED_PATHSinforgather_server.--bulk-port) — pseudo-gradients +/global_paramsmove to a separate listener with cleartext+no-auth defaults (matchingtorch.distributed's LAN posture). Control port keeps full security.weights_only=Truetorch.load means even an open bulk plane can disrupt training but cannot RCE the host.<output_dir>/diloco_audit.log— JSONL records for register / deregister / eviction / outer-step / control. Tokens are never written._resolve_diloco_server_token(port, regen)mirrors the dataset_server flow. Token persisted onJobRecord.auth_token; spawn command uses--auth-token-fileso the token never lands in argv.AuthorizationviaX-Diloco-Auth-Tokenoverride → JobRecord auto-lookup → registry → empty. Honorsverify_tls=Falsefor SSH-tunneled remotes. AddExternalServerForm now surfacesauth_token(masked) andverify_tls(default checked); registry rows show 🔒 when authed.Test plan
pytest tests/unit/ml/diloco/ tests/unit/forgather_server/ tests/unit/forgather/test_tls.py→ 982 passedforgather diloco server --helpshows the new auth + TLS + bulk-port flagsnpm run buildintools/forgather_server/webui/clean--bulk-port --no-bulk-tls --no-bulk-authand confirm bulk routingDocs
docs/operations/tls.mdgains a DiLoCo subsection with the standalone-CLI walkthrough, mTLS notes, and bulk-port trade-off.docs/design/diloco-security.mdis the new design doc covering planes, threat model, identity binding, audit log format, spawn flow, and wire-format additions.docs/design/diloco-pipeline-groups.mdandtools/forgather_server/README.mdcross-link to the new doc.Out of scope (follow-ups)
http.serverto FastAPI/uvicorn.🤖 Generated with Claude Code