Skip to content

feat(tailscale): auto-recover from stale tsnet state at runtime#394

Open
pdevito3 wants to merge 1 commit intoalmeidapaulopt:mainfrom
pdevito3:feat/auto-recover-stale-tsnet
Open

feat(tailscale): auto-recover from stale tsnet state at runtime#394
pdevito3 wants to merge 1 commit intoalmeidapaulopt:mainfrom
pdevito3:feat/auto-recover-stale-tsnet

Conversation

@pdevito3
Copy link
Copy Markdown

@pdevito3 pdevito3 commented May 1, 2026

Summary

Layers runtime auto-recovery on top of the existing preventive cleanup in cleanStaleState. Today, if tsnet's backend reports "invalid key" while the proxy is already running, the proxy gets stuck in ProxyStatusError until a human notices and either restarts tsdproxy or manually deletes the data directory.

cleanStaleState (added in d4a2e3c) only runs at NewProxy time and only triggers on ephemeral flag mismatch, so it doesn't cover these runtime failure modes:

  • Hard crash / OOM / docker kill -9 before Close()'s ephemeral cleanup runs, leaving stale tailscaled.state
  • Host power loss with ephemeral node expiry on the tailnet side
  • Manual device deletion in the Tailscale admin console while tsdproxy is running
  • Control-plane revocation for any reason

After this change, those scenarios trigger a single automatic RemoveAll(datadir) + restart cycle. If the restart also fails, the proxy stops in ProxyStatusError (no infinite loop).

Changes

  • tailscale/proxy.go: when watchStatus sees "invalid key", os.RemoveAll the tsnet data directory (mirroring cleanStaleState's approach), set ProxyStatusError, and close the events channel so proxymanager can detect the unexpected termination.
  • proxymanager/proxy.go: when the events channel closes outside of a normal Stopping/Stopped flow, log and trigger a one-shot onRestart callback. The restartable flag is consumed (set to false) on first use to prevent restart loops.
  • proxymanager/proxymanager.go: newAndStartProxy now takes a restartable parameter. eventStart passes true; the auto-recovery path passes false when re-spawning, so a persistently-failing proxy can't loop.

Loop-prevention contract

restartable is consumed exactly once, before onRestart is called. The recursive newAndStartProxy(name, proxyConfig, false) ensures the second proxy instance has restartable=false, so if it also hits "invalid key", the channel close triggers the unexpected-termination logging but no further restart. The proxy ends in ProxyStatusError and surfaces normally on the dashboard.

Test plan

  • proxyproviders/tailscale/proxy_test.go — 4 cases: removeStaleState removes the datadir, handles missing/empty/nil cases without panic
  • proxymanager/proxy_restart_test.go — 3 cases: restart fires on unexpected close, doesn't fire when restartable=false, doesn't fire on normal Stopping shutdown
  • All 7 unit tests pass locally
  • go vet clean across both modified packages
  • Validated end-to-end on personal NAS deployment by force-killing the container mid-run; on next start the proxy auto-recovered exactly once. After the recovery, normal traffic flowed without intervention.

Related

Builds on top of d4a2e3c (preventive cleanup of tsnet state on ephemeral flag change). Complementary, not overlapping — that change prevents stale state at config-change time; this change recovers from stale state at runtime when prevention wasn't possible.

Tracked separately from #384 (env var overrides) to keep scopes small.

Layer runtime auto-recovery on top of the existing preventive cleanup in
cleanStaleState. Today, if tsnet's backend reports "invalid key" while the
proxy is already running (e.g. after a hard crash that bypassed Close()'s
ephemeral cleanup, ephemeral node expiry on the tailnet side, host power
loss, or admin-side device deletion), the proxy gets stuck in
ProxyStatusError until a human notices and either restarts tsdproxy or
manually deletes the data directory.

cleanStaleState only runs at NewProxy time and only triggers on ephemeral
flag mismatch, so it doesn't cover these runtime failure modes.

Changes:
- tailscale/proxy.go: when watchStatus sees "invalid key", os.RemoveAll
  the tsnet data directory (mirroring cleanStaleState's approach), set
  ProxyStatusError, and close the events channel so proxymanager can
  detect the unexpected termination.
- proxymanager/proxy.go: when the events channel closes outside of a
  normal Stopping/Stopped flow, log and trigger a one-shot onRestart
  callback. The restartable flag is consumed (set to false) on first use
  to prevent restart loops if the recovery itself fails.
- proxymanager/proxymanager.go: newAndStartProxy now takes a restartable
  parameter. eventStart passes true; the auto-recovery path passes
  false when re-spawning the proxy, so a misbehaving proxy can't loop
  forever.

Tests:
- proxyproviders/tailscale/proxy_test.go (4 cases): removeStaleState
  removes the datadir, handles missing/empty/nil cases without panic.
- proxymanager/proxy_restart_test.go (3 cases): restart fires on
  unexpected close, doesn't fire when restartable=false, doesn't fire
  on normal shutdown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@almeidapaulopt almeidapaulopt force-pushed the main branch 2 times, most recently from b720a3f to 28506e0 Compare May 7, 2026 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant