Skip to content

DiLoCo: full security model (auth, mTLS, audit) — closes #90#93

Merged
jdinalt merged 16 commits into
devfrom
feature/diloco-security
May 29, 2026
Merged

DiLoCo: full security model (auth, mTLS, audit) — closes #90#93
jdinalt merged 16 commits into
devfrom
feature/diloco-security

Conversation

@jdinalt
Copy link
Copy Markdown
Owner

@jdinalt jdinalt commented May 28, 2026

Summary

Closes #90. Lands the bearer + TLS + mTLS + audit layer for the DiLoCo subsystem (server, client, worker, callback, scheduler, webui proxy, webui form). Mirrors the dataset_server / inference_server pattern, adapted to DiLoCo's stdlib http.server stack.

9 logical commits, all individually green. Single bundled PR per the agreed phasing.

What it does

  • Bearer-token auth on the control plane with per-port persisted token files at ~/.config/forgather/diloco_server/<port>.token. Standard --auth-token / --auth-token-file / --no-auth / --regen-token / --quiet-tokens CLI surface.
  • TLS via the shared forgather.tls stack, with two new stdlib-flavored helpers (stdlib_ssl_context, urllib_ssl_context) that don't depend on uvicorn/httpx.
  • mTLS skip-bearer: when TLS is enabled with a cluster CA bundle, a client presenting a CA-signed cert at the handshake is cluster-authenticated without a bearer. Same model as _PEER_ALLOWED_PATHS in forgather_server.
  • Two-port bulk plane (--bulk-port) — pseudo-gradients + /global_params move to a separate listener with cleartext+no-auth defaults (matching torch.distributed's LAN posture). Control port keeps full security. weights_only=True torch.load means even an open bulk plane can disrupt training but cannot RCE the host.
  • Audit log at <output_dir>/diloco_audit.log — JSONL records for register / deregister / eviction / outer-step / control. Tokens are never written.
  • Identity binding — job-level only (phase 1, per the planning discussion). Cross-job spoofing blocked by per-server tokens; per-rank binding deferred.
  • Scheduler integration: _resolve_diloco_server_token(port, regen) mirrors the dataset_server flow. Token persisted on JobRecord.auth_token; spawn command uses --auth-token-file so the token never lands in argv.
  • Webui proxy attaches Authorization via X-Diloco-Auth-Token override → JobRecord auto-lookup → registry → empty. Honors verify_tls=False for SSH-tunneled remotes. AddExternalServerForm now surfaces auth_token (masked) and verify_tls (default checked); registry rows show 🔒 when authed.

Test plan

  • pytest tests/unit/ml/diloco/ tests/unit/forgather_server/ tests/unit/forgather/test_tls.py → 982 passed
  • forgather diloco server --help shows the new auth + TLS + bulk-port flags
  • npm run build in tools/forgather_server/webui/ clean
  • Manual smoke (operator): standalone DiLoCo server with TLS+bearer, worker against it, then bring up a second config with --bulk-port --no-bulk-tls --no-bulk-auth and confirm bulk routing
  • Manual smoke (operator): spawn DiLoCo via the webui modal, confirm proxy auto-uses the persisted token, confirm audit log populated

Docs

  • docs/operations/tls.md gains a DiLoCo subsection with the standalone-CLI walkthrough, mTLS notes, and bulk-port trade-off.
  • docs/design/diloco-security.md is the new design doc covering planes, threat model, identity binding, audit log format, spawn flow, and wire-format additions.
  • docs/design/diloco-pipeline-groups.md and tools/forgather_server/README.md cross-link to the new doc.

Out of scope (follow-ups)

  • Per-rank identity binding (mTLS subject-bound or pre-registered roster).
  • Token rotation while server is running.
  • Audit-log tamper-evidence (currently plain JSONL).
  • Migrating off stdlib http.server to FastAPI/uvicorn.

🤖 Generated with Claude Code

jdinalt and others added 10 commits May 28, 2026 19:28
The stdlib http.server / urllib analogues of the uvicorn / httpx
helpers. Needed by DiLoCo (stdlib HTTP stack) for issue #90. Pure
addition; no existing behavior changes.

* ``stdlib_ssl_context(args, cfg)`` builds a server-side SSLContext
  ready for ``ctx.wrap_socket(sock, server_side=True)`` on
  ``http.server`` listening sockets. When a cluster CA bundle is
  present, configures ``CERT_OPTIONAL`` so the handshake validates
  any client cert that is presented (mTLS path).
* ``urllib_ssl_context(cfg, verify=True)`` mirrors
  ``httpx_peer_kwargs``: builds one context carrying both the
  cluster CA bundle and this node's cert+key. ``verify=False``
  returns an unverified context for SSH-tunneled remotes.

Tests cover defaults, half-provisioned hosts, FileNotFoundError on
missing cert/key, the verify=False opt-out path, and an end-to-end
handshake between an ``http.server`` instance using the server
context and a ``urllib.request`` call using the client context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #90. The DiLoCo HTTP wire ran cleartext + unauthenticated.
Add the per-port bearer-token / TLS pattern dataset_server uses, but
adapted to DiLoCo's stdlib http.server stack instead of FastAPI.

* ``src/forgather/ml/diloco/auth.py`` (new):
  - ``add_auth_args`` / ``resolve_auth_token`` / ``format_auth_mode``
    mirror dataset_server's CLI surface (--auth-token / --auth-token-file
    / --no-auth / --regen-token / --quiet-tokens).
  - Per-port token files at
    ``~/.config/forgather/diloco_server/<port>.token`` (mode 0600 in
    a 0700 dir). ``read_standalone_token`` does the loopback-only
    auto-discovery for clients.
  - ``verify_bearer(handler, expected)`` is a stdlib-friendly verifier
    for ``BaseHTTPRequestHandler``: constant-time compare, 401 +
    ``WWW-Authenticate: Bearer realm="forgather-diloco"`` on failure.
* ``DiLoCoServer.__init__`` gains ``auth_token`` and ``ssl_context``
  kwargs (both default None to keep existing in-process callers
  working). Request handler runs the bearer check at the top of every
  POST/GET dispatch; ``/health`` is added and intentionally exempt.
* ``DiLoCoServer.run`` / ``start`` wrap the listening socket when an
  SSL context is configured; startup banner reports scheme + auth
  state.
* ``forgather diloco server`` CLI: ``add_auth_args`` and
  ``add_server_tls_args`` plumbed in; ``_server_cmd`` resolves the
  token (persisting auto-generated/regenerated tokens), builds the
  SSL context via the new ``stdlib_ssl_context``, runs
  ``enforce_non_loopback_policy``, and passes both to DiLoCoServer.

The ``weights_only=True`` torch.load hardening that #90 requires was
already in place on both server and client (commits 19bcf83,
2e79aeb).

Tests: 15 new in ``tests/unit/ml/diloco/test_server_auth.py`` cover
missing/malformed/wrong-token 401s, /health open, no-auth mode, the
verify_bearer unit, per-port token file mode 0600 + round-trip, and
loopback-only auto-discovery. Full 245-test DiLoCo suite passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the server-side bearer/TLS infrastructure (commit 2) through to
the client, worker, and DiLoCoCallback so the same auth surface works
in-process and from the CLI.

* ``DiLoCoClient.__init__`` gains ``token`` and ``verify_tls`` kwargs.
  When ``token=None`` the client falls back to (in order)
  ``FORGATHER_DILOCO_SERVER_TOKEN`` and the per-port loopback file —
  the locally-spawned case Just Works without extra plumbing.
  ``Authorization: Bearer <token>`` is attached to every request via
  a shared ``_headers`` helper; ``urlopen`` is invoked with
  ``context=self._ssl_ctx`` (a ``urllib_ssl_context`` for https URLs,
  None for cleartext).
* ``DiLoCoWorker.__init__`` accepts and forwards ``auth_token`` /
  ``verify_tls`` to its embedded client.
* ``DiLoCoCallback.__init__`` adds matching kwargs and a
  ``DILOCO_VERIFY_TLS`` env var fallback; the ``/status`` reachability
  probe at on_train_begin uses the same credentials.
* ``forgather diloco status`` CLI grows ``--auth-token`` and
  ``--no-verify-tls`` flags so the operator can talk to a protected
  remote without environmental coupling.

Tests: 5 new in ``test_client_auth.py`` exercise the full
server↔client round-trip — matching token, missing token (401),
loopback file discovery, env-var override, and the no-auth
backwards-compat path. Existing
``test_diloco_callback.py::TestWorkerLifecycle`` assertions extended
to cover the two new kwargs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the DiLoCo server is launched with a stdlib SSL context that
loaded the cluster CA bundle (verify_mode = ssl.CERT_OPTIONAL), a
client that presents a CA-signed cert at the TLS handshake is
treated as cluster-authenticated and the bearer-token check is
skipped. Same model as ``_PEER_ALLOWED_PATHS`` in
``tools/forgather_server/auth.py``.

* New ``peer_cert_authenticated(handler)`` predicate reads
  ``handler.connection.getpeercert()``: a non-empty dict means the
  handshake validated a peer cert against the configured CA.
  Defensive defaults for cleartext sockets, missing ``connection``,
  and ``OSError`` from a torn-down socket.
* New ``authenticate_request(handler, expected_token)`` is the
  combined entry point: mTLS first, then bearer fallback. The
  server's ``_authenticated`` helper switches to it.

This means inter-cluster peer calls (forgather_server proxy → DiLoCo,
DiLoCo → another forgather_server, etc.) don't have to share each
server's per-port bearer — presenting the cluster cert is enough.
Mirrors the inference_server / forgather_server pattern.

Tests: 11 new in ``test_server_mtls.py``. End-to-end: a TLS server
with a real provisioned cluster CA accepts a cluster-cert client
without a bearer (200), rejects no-cert + no-bearer (401), and
still serves no-cert + valid-bearer (200). Unit tests cover the
peer-cert predicate against cleartext, missing-conn, empty-cert,
populated-cert, and ``OSError`` edge cases, plus the combined
authenticate_request resolution order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #90. Operators want the option to disable TLS+auth for bulk
data transport (pseudo-gradients, model weights) on a trusted LAN
where throughput is the bottleneck, while keeping registration,
heartbeat, control, and status fully protected on the control port.
The split also lets us cleanly bind the bulk plane to a different
host/interface than the control plane in future work.

* ``DiLoCoServer.__init__`` gains ``bulk_port``, ``bulk_ssl_context``,
  and ``bulk_auth_enabled``. When ``bulk_port`` is set, a second
  ``ThreadingHTTPServer`` runs in its own daemon thread with a
  ``role="bulk"`` handler that only serves the three bulk endpoints
  (and ``/health``).
* The control-port handler refuses bulk paths with 404 +
  ``X-Forgather-Bulk-Url`` so misrouted clients can self-correct.
  Avoids two ways into the bulk plane (slow-but-secure vs
  fast-but-cleartext) which would let an attacker pick whichever
  was convenient.
* ``/register`` advertises the bulk URL via the same response
  header. ``DiLoCoClient.register`` captures it; subsequent
  ``submit_pseudograd`` / ``submit_fragment_pseudograd`` /
  ``global_params`` calls route to the bulk URL automatically via
  ``_base_for_path``. Per-URL SSL-context selection
  (``_ssl_for_request``) lets the control URL be https and the
  bulk URL be http on the same client without crossed wires.
* CLI: ``--bulk-port N``, plus ``--bulk-tls`` / ``--no-bulk-tls``
  and ``--bulk-auth`` / ``--no-bulk-auth`` mutex groups. Defaults
  when ``--bulk-port`` is set: cleartext, no-auth (the user's
  stated "torch.distributed-equivalent" posture).

RCE protection is independent of these knobs — every tensor blob
deserializes via ``torch.load(..., weights_only=True)``, so an
attacker on the open bulk plane can disrupt training but cannot
take over the host. That guarantee was already in place from
prior commits; this one preserves it across the new listener.

Tests: 8 new in ``test_server_bulk_port.py`` cover get_bulk_url,
control-port 404 + hint header, bulk-port serves /global_params
without bearer, bulk-port refuses control endpoints, control still
requires bearer, /register response carries X-Forgather-Bulk-Url,
and end-to-end client routing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Append-only JSONL records at ``<output_dir>/diloco_audit.log`` for
events worth reconstructing after the fact. Best-effort: write
failures (disk full, permissions, missing dir) log a warning and
keep the request going — the audit log is a record, not a guard.

Instrumented sites:

* ``register`` — worker_id, hostname, group_id, pp_rank,
  pp_world_size, num_registered.
* ``deregister`` — worker_id (followed by an ``eviction`` record
  because deregister goes through _handle_worker_death).
* ``eviction`` — trigger_worker_id, evicted list (group-aware),
  group_id, remaining workers.
* ``outer_step`` — sync_round, contributors, missing_contributors.
* ``control`` — action and the JSON payload of every ``/control/*``
  call. No per-caller identity yet (phase 1 = job-level only; a
  future PR adding mTLS subject-bound identity will populate it).

Records carry a UTC ISO-8601 timestamp (``+00:00`` suffix so plain
sort works) and JSON-encode any non-stringable values via
``default=str``. Tokens are never logged — the regression guard in
``test_token_is_never_logged`` watches that.

Tests: 7 new in ``test_audit_log.py`` cover register/deregister/
control records, ISO timestamp, no-token-in-log, write-failure
graceful degradation, and the empty-output_dir no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the dataset_server-style local-spawn flow for DiLoCo (issue
#90). When the webui spawns a DiLoCo server through the queue, the
scheduler now:

1. Calls ``_resolve_diloco_server_token(port, regen)`` — a new
   helper mirroring ``_resolve_dataset_server_token``. Reads the
   per-port persisted file if non-empty; mints + writes a fresh
   ``secrets.token_hex(32)`` otherwise. ``regen=True`` rotates.
2. Persists ``auth_token`` on the resulting ``JobRecord`` so the
   webui proxy can find it by job (next commit's wiring).
3. Builds the spawn command with ``--auth-token-file <per-port>``
   so the token never touches argv — the spawned process reads it
   from the same file the standalone CLI does.

``build_diloco_server_command`` and ``spawn_diloco_server_process``
gain matching kwargs: ``auth_token_file``, ``no_auth``,
``bulk_port``, ``bulk_tls``, ``bulk_auth``. The bulk-port knobs let
operators configure the two-port bulk plane through the same job
spec the queue already accepts.

URL-scheme stamping (the cosmetic ``scheme`` field used by Job
cards) is extended to ``diloco_server`` jobs so https/http renders
correctly when the operator provisioned TLS.

Tests: 8 new in ``test_scheduler_diloco_server_token.py``: mint+
persist contract, reuse across calls, regen rotate, empty-file
treated as missing, plus CLI-builder assertions confirming
``--auth-token-file`` / ``--no-auth`` / ``--bulk-port*`` surface
correctly. Full 665-test forgather_server suite green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #90. The DiLoCo registry's ``auth_token`` / ``verify_tls``
fields existed but were inert — the proxy used hardcoded
``verify=True`` and never attached an Authorization header. Wire
both ends now.

* ``routes/diloco.py``:
  - New ``_token_for_local(base)`` walks running diloco_server
    JobRecords and returns the persisted bearer for the matching
    port — same shape as dataset_server's helper.
  - New ``_auth_headers_for(base, request)`` applies the
    standard precedence: ``X-Diloco-Auth-Token`` override header →
    JobRecord auto-lookup → registry's ``find_token`` → empty.
  - New ``_verify_for(target, base)`` honors per-registry
    ``verify_tls=False`` and otherwise defers to
    ``forgather.tls.httpx_verify_for_url``.
  - All proxy callers (status/info/work-queues/work-queue/control)
    now thread ``request`` through and attach headers + verify
    accordingly.
  - Module docstring rewritten to describe the now-active auth
    surface; ``AddRegistryEntryRequest`` doc no longer claims the
    fields are ignored.

* ``webui/src/components/DiLoCoPanel.tsx``:
  - ``AddExternalServerForm`` gains a masked ``auth_token`` input
    and a ``verify_tls`` checkbox (defaults to checked).
  - Registry rows display a 🔒 indicator when ``has_auth_token``
    is true, so operators can see at a glance which entries are
    protected.
  - API surface unchanged — ``api.addDiLoCoRegistryEntry`` already
    accepts ``auth_token`` and ``verify_tls``.

Tests: 6 new in ``test_routes_diloco_auth.py`` cover the override
header precedence, JobRecord-vs-registry fallback, no-auth (no
Authorization header sent), ``verify_tls=False`` propagation, and
the control endpoint attaching bearer the same way GETs do.
Existing 21-test ``test_routes_diloco.py`` suite stays green.

WebUI builds clean (``npm run build`` → tsc + vite).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ``docs/operations/tls.md``: new "DiLoCo server" subsection covering
  the standalone CLI, per-port bearer token discovery, mTLS skip-bearer
  path, the ``--bulk-port`` two-port plane and its trusted-LAN
  defaults, the trade-off between throughput and security, and the
  audit log location.
* ``docs/design/diloco-security.md`` (new): full design doc — control
  vs bulk plane, threat model, identity binding (job-level only in
  phase 1, mTLS subject-bound deferred), audit log format, spawn
  flow, wire-format additions, and the test surface.
* ``docs/design/diloco-pipeline-groups.md``: cross-link to the new
  security doc.
* ``tools/forgather_server/README.md``: new "DiLoCo server" section
  under the proxy threat-model area documenting the auth model,
  control-vs-bulk plane, and where to find the design notes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address findings from the three-agent review pass on PR #93. No
correctness blockers; this commit folds in the recommended fixes.

Security review (L1, L2):

* Run ``_authenticated`` before ``_bulk_offloaded`` in do_POST /
  do_GET so unauthenticated callers don't learn the bulk-listener
  topology from the 404 hint.
* Add ``_CONTROL_AUDIT_FIELDS`` per-action allowlist and route
  ``_audit("control", ...)`` payloads through ``_audit_control_data``.
  Today's actions only carry intent metadata; the allowlist is the
  forward-compat guardrail against a future control endpoint that
  ships secret material. Two regression tests pin it.

Architecture review:

* On reconnect, clear ``DiLoCoClient.bulk_url`` when /register's
  X-Forgather-Bulk-Url header is absent. A server that drops or
  reshapes its bulk listener no longer leaves clients dialing a
  dead URL.
* Add ``fragment_outer_step`` audit event at both fragment-apply
  call sites (the submit path and the eviction-triggered path).
  ``docs/design/diloco-security.md`` updated.
* Fix docstring drift: ``X-Forgather-Bulk-Port`` → ``X-Forgather-Bulk-Url``
  in two server.py comments (the actual header was always correct).

Code-quality review:

* F1: warn when ``DiLoCoServer(auth_token="")`` is constructed
  explicitly with the empty-string token (silent auth-disable was
  too quiet for an obvious misconfiguration).
* F3: add ``_log_auth_failure`` and call it on every 401 path —
  ``auth.py``'s ``logger`` was imported but unused, leaving 401s
  invisible in operator logs. Now they land at INFO with the
  client IP + path, no token leakage.
* F4: explicit ``self._bulk_ssl_ctx = None`` initialisation in
  ``DiLoCoClient.__init__``; replaces the getattr-lazy pattern.
* T4: validate the advertised bulk URL's scheme on intake; ignore
  anything that's not http/https with a WARNING log. Defense in
  depth against a misconfigured proxy or compromised server.

New tests (4):

* ``test_audit_log.py``: control-payload redaction + unknown-action
  empty-data fallback.
* ``test_server_bulk_port.py``: stray bearer on no-auth bulk port
  is ignored, not rejected.
* ``test_server_mtls.py``: full mTLS control + cleartext bulk
  matrix (the recommended production posture per the design doc).

Documentation:

* ``docs/operations/tls.md``: new "Migrating an existing no-auth
  deployment" subsection — loopback path, cross-host paths, and
  the deliberate stay-on-no-auth path.
* ``docs/design/diloco-security.md``: audit-log table extended with
  ``fragment_outer_step`` and a note on the control allowlist.

All 350 security-touching tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jdinalt
Copy link
Copy Markdown
Owner Author

jdinalt commented May 28, 2026

Review response (commit a330f67)

Three review agents (security, architecture, code-quality) returned. Verdicts: clean for merge with no Critical/High/Medium findings. Folded the recommended fixes into a single review-response commit on top.

Security (L1, L2)

  • L1 Run _authenticated before _bulk_offloaded so unauthenticated callers don't learn the bulk-listener topology from the 404 hint header.
  • L2 Added _CONTROL_AUDIT_FIELDS per-action allowlist; _audit("control", ...) now redacts data through _audit_control_data. Two regression tests in test_audit_log.py.

Architecture

  • Clear DiLoCoClient.bulk_url on reconnect when the X-Forgather-Bulk-Url header is absent. A server that drops/reshapes its bulk listener no longer leaves clients dialing a dead URL.
  • Added fragment_outer_step audit event at both fragment-apply call sites (submit path + eviction-triggered path); design doc updated.
  • Fixed docstring drift X-Forgather-Bulk-PortX-Forgather-Bulk-Url.

Code quality

  • F1 Warn when DiLoCoServer(auth_token="") is constructed explicitly with empty string — silent auth-disable was too quiet for an obvious misconfiguration.
  • F3 Added _log_auth_failure and call on every 401 path — auth.py's logger was unused, leaving 401s invisible in operator logs. Now at INFO with client IP + path; no token leakage.
  • F4 Explicit self._bulk_ssl_ctx = None init in DiLoCoClient.__init__; replaces getattr-lazy pattern.
  • T4 Validate the advertised bulk URL's scheme on intake — ignore anything other than http/https with a WARNING. Defense in depth against a misconfigured proxy or compromised server.

New tests (4 added; 350 total in security suite)

  • test_audit_log.py: control-payload redaction + unknown-action empty-data fallback.
  • test_server_bulk_port.py: stray bearer on no-auth bulk port is ignored, not rejected.
  • test_server_mtls.py: full mTLS-control + cleartext-bulk matrix (the recommended production posture).

Docs

  • docs/operations/tls.md: new "Migrating an existing no-auth deployment" subsection — loopback path, cross-host paths, and the deliberate stay-on-no-auth path.
  • docs/design/diloco-security.md: audit-log table extended with fragment_outer_step and a note on the control allowlist.

Deferred to follow-up (non-blocking)

  • T5: concurrent register/deregister ordering stress test.
  • T6: audit-log file deleted mid-run.
  • T7: --regen-token race on two simultaneous launches.
  • T8: _token_for_local against a just-terminated JobRecord.
  • S2: consolidate _LOCALHOST_HOSTS constant (pre-existing tech debt).

Happy to file these as a tracking issue if you'd prefer them not get lost.

🤖 Generated with Claude Code

Three bugs in the operator-facing surface I missed when wiring up the
DiLoCo security model:

1. **Scheme mismatch on local URLs.** ``_local_servers`` and
   ``_ever_local_base_urls`` in ``routes/diloco.py`` hardcoded
   ``http://``, even though the scheduler now stamps the actual
   scheme on the JobRecord (commit cd7aa8e). Result: a TLS-enabled
   server appeared as ``http://...`` in the Job card AND the proxy
   spoke HTTP at the TLS socket, producing ``502 Bad Gateway:
   ReadError`` on every status poll. Fix: respect
   ``job_params["scheme"]``, falling back to http for pre-stamping
   records.

2. **DiLoCoServerModal had no security fields.** The spawn modal
   exposed every other DiLoCo knob but none of the new
   ``--no-auth`` / ``--regen-token`` / ``--quiet-tokens`` /
   ``--bulk-port`` / ``--bulk-tls`` / ``--bulk-auth`` flags.
   Operators couldn't disable auth, rotate the persisted token,
   suppress the launch banner for shared TTYs, or split off the
   bulk plane — all through the UI.

   Mirrors ``DatasetServerModal``'s auth layout: ``--no-auth`` at the
   top of a "Security (auth + bulk plane)" fieldset, ``--regen-token``
   / ``--quiet-tokens`` underneath (disabled when ``--no-auth`` is
   set), then the bulk-port number input with the ``--bulk-tls`` /
   ``--bulk-auth`` checkboxes revealed when bulk_port > 0.

   ``PersistedAdHoc`` / edit-mode seed / state hooks / ``buildArgs`` /
   ``persistCurrent`` all extended to round-trip the new fields.
   ``quiet_tokens`` plumbed through ``build_diloco_server_command``,
   ``spawn_diloco_server_process``, and the scheduler's
   ``_build_diloco_server`` so the modal toggle reaches the actual
   spawn argv. ``--regen-token`` doesn't need a CLI plumb — the
   effect already lands via ``_resolve_diloco_server_token`` reading
   ``regen_token`` from ``job_params``.

3. **DiLoCo Job card didn't show the token (or honor demo mode).**
   The inference and dataset_server cards show ``token:`` for
   authenticated servers, ``auth: hidden (demo mode)`` when the
   webui is in demo mode, and ``auth: --no-auth`` when bearer is
   off. DiLoCo's card showed none of these. Added the same pattern
   (token redaction in demo mode is the user's stated requirement),
   plus a ``bulk:`` row showing the bulk port + TLS/auth posture
   when a two-port topology is configured.

Tests:
* New ``test_https_scheme_stamped_by_scheduler_is_respected`` —
  pins the scheme fix as a regression guard.
* New ``test_missing_scheme_falls_back_to_http`` — the
  pre-stamping fallback path.

WebUI builds clean (``npm run build`` → tsc + vite).
All 35 affected backend tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jdinalt and others added 3 commits May 29, 2026 03:54
When DiLoCo is spawned with the default ``--host 127.0.0.1`` binding,
the proxy's existing loopback-only ``_token_for_local`` matches.
When the operator binds to ``0.0.0.0`` or a specific LAN IP — or
just browses the webui from a different host — the proxy builds the
URL from the JobRecord's stamped ``routable_host`` (e.g.
``https://192.168.9.43:8512``), but ``_token_for_local`` rejected
anything non-loopback. The bearer didn't attach; the upstream
returned 401; the webui's blanket-401-means-reauth path bounced the
operator to the login screen.

Extend the matcher to accept three independent signals, *any* of
which proves the URL points at one of our own JobRecords:

1. Loopback URL against a loopback or ``0.0.0.0`` bind (original).
2. URL hostname equals the record's stamped ``routable_host`` —
   the proxy synthesized this URL itself from the JobRecord, so
   trusting it for token lookup is consistent with trusting it for
   display.
3. URL hostname equals the record's explicit bind ``host`` (e.g.
   operator typed ``--host 10.0.0.5``).

Terminated records (``status not in {starting, running}``) still
return None — a just-died job can't keep handing out its token.

Tests: 6 new in ``test_routes_diloco_auth.py``:
* loopback + loopback bind
* loopback + 0.0.0.0 bind
* LAN URL + routable_host (regression for the user-reported bug)
* LAN URL + explicit bind host
* unrelated LAN host / wrong port returns None
* terminated record returns None

The auth-bounce symptom (proxy 401 → webui login bounce) is filed
separately as issue #94 — even with this fix, an upstream that
legitimately 401s for a different reason should land as an inline
error, not a session-expired bounce.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes #94. Two surgical changes to break the login-bounce loop and
get visibility into the bearer-attachment path:

1. **X-Upstream-Auth-Failed tagging.** The webui's fetch wrapper
   (``webui/src/auth.ts:82-83``) already suppresses the
   ``AUTH_REQUIRED_EVENT`` when a response carries
   ``X-Upstream-Auth-Failed: 1``. Inference, dataset_server, and
   cluster proxies all set this header on upstream 401/403. The
   DiLoCo proxy (added in PR #93) didn't, so every upstream 401
   looked like a session-expired event and bounced the operator to
   the login screen — making the panel unusable while we debug the
   underlying token-attachment issue. Now ``_proxy_get`` and
   ``proxy_control`` both stamp the tag via the existing
   ``_upstream_auth_headers`` pattern. The operator's session
   stays intact; the panel surfaces the 401 inline.

2. **Diagnostic logging in _token_for_local.** When the matcher
   finds a port-matching JobRecord but can't reconcile the URL's
   hostname against any of the record's host / routable_host
   fields, we now log at INFO showing exactly what each side
   looked like and whether the record carried an auth_token. The
   token itself is never logged. Lets an operator hitting the
   bounce on a non-loopback URL inspect the TTY and see whether
   the JobRecord is missing routable_host, missing auth_token, or
   if the hostnames truly don't match.

3. **Bearer-attached-or-not log on upstream 401/403** in
   ``_proxy_get`` so even when ``_token_for_local`` matches but the
   bearer is wrong, the operator sees which side of the issue they
   have.

Tests: 2 new in ``test_routes_diloco_auth.py``:
* Upstream 401 surfaces with ``x-upstream-auth-failed: 1``.
* Upstream 200 does NOT carry the header (the negative case).

Together with f4ecef1 (the routable_host matcher fix), this
should: (a) stop the immediate bounce loop, (b) give the operator
clear log signal on why bearer attachment failed if it still does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The worker dialed http:// at a TLS-wrapped DiLoCo server and the
TLS layer slammed the connection with RST, surfacing as
``ConnectionResetError: [Errno 104] Connection reset by peer`` at
the ``/status`` probe in ``DiLoCoCallback.on_train_begin``. Two
contributing bugs:

1. **SubmitModal stripped the scheme.** ``buildDiLoCoPayload`` in
   ``SubmitModal.tsx`` did:
       const serverAddr = s.base.replace(/^https?:\/\//, "").replace(/\/$/, "");
   based on a stale assumption that the callback wanted bare
   ``host:port``. The callback just passes the value through to
   ``DiLoCoClient``, which handles both forms — and which can't
   recover the scheme once it's gone. Fix: trim only the trailing
   slash; preserve the scheme.

2. **DiLoCoClient defaulted bare host:port to http://.** Legacy
   callers (the ``forgather diloco worker --server`` CLI) and any
   path that hands a bare ``host:port`` were locked into HTTP.
   Now: when no scheme is present, call
   ``forgather.tls.client_scheme()`` — the same helper the
   scheduler uses to stamp the JobRecord — so the worker picks
   ``https://`` when TLS is locally provisioned and ``http://``
   otherwise.

The two fixes are independent guardrails: even if the webui regresses
back to scheme-stripping, the client-side scheme picker keeps the
worker pointed at the right scheme on any host where TLS is
provisioned. Likewise, an explicit ``https://`` from the webui
overrides whatever the local TLS posture happens to be.

Tests: 2 new in ``test_client_auth.py``:
* Bare host:port picks scheme from client_scheme() (covers both
  TLS-provisioned and not).
* Explicit ``http://`` / ``https://`` passes through unchanged.

WebUI builds clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jdinalt and others added 2 commits May 29, 2026 06:32
Review pass on PR #93 surfaced several issues; fixes here, ordered by
severity:

High
- cli/diloco: reject --bulk-auth on a cleartext bulk listener. The bulk
  and control listeners share one bearer, so requiring it over cleartext
  would leak the control-plane credential to a LAN sniffer. Error out
  unless --bulk-tls (or run --no-bulk-auth).
- forgather_server/scheduler: inject FORGATHER_DILOCO_SERVER_TOKEN into
  the training worker's env when the DiLoCo URL is routable/non-loopback.
  Loopback per-port-file auto-discovery can't fire for a 0.0.0.0-bound
  server's stamped routable URL, so the worker 401'd. Matched by
  JobRecord host/port; operator-set token still wins.
- diloco/server: get_bulk_url() no longer advertises a wildcard bind
  host. A 0.0.0.0-bound server would hand remote workers an unroutable
  http://0.0.0.0:<port>; now it uses the worker's Host header, falling
  back to loopback.

Medium
- cli/diloco + server: resolve an ephemeral --port 0 to the concrete
  bound port before the token file is written/banner printed, so
  loopback token discovery matches the real port. _find_available_port
  made static.
- diloco/client: bare host:port scheme inference is now loopback-aware
  (loopback -> http; routable -> client_scheme()), and routable guesses
  carry a fail-loud hint on connection failure instead of a bare RST.
  This also fixes a pre-existing test regression from 72c3225: on a
  TLS-provisioned dev machine every localhost test client was dialing
  https against cleartext servers.
- diloco/server: audit-log writes never block the sync barrier. Barrier
  paths accumulate records and flush via _audit_many after releasing
  _sync_cond; a persistent append handle replaces per-record open/close.
- tls/runtime: urllib_ssl_context(verify=False) still loads the node's
  client cert, so --no-verify-tls doesn't silently disable the mTLS
  skip-bearer path.

Low
- routes/diloco: locally-spawned servers report has_auth_token (the lock
  indicator was inverted vs registered remotes).
- diloco/client: _ssl_for_request matches on the origin boundary, not a
  bare string prefix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The DiLoCo section was only open when a server was selected or a
restore error was present, so a silent reset-to-None (prior selection
no longer in the server list) collapsed the section and hid the fact
that the job would run as vanilla training. Default it open like the
model/dataset sections so the current state is always visible at a
glance; still user-collapsible. Addresses the "folded by default" half
of #95.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jdinalt jdinalt merged commit dbed18d into dev May 29, 2026
1 check passed
@jdinalt jdinalt deleted the feature/diloco-security branch May 29, 2026 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant