feat(tailscale): auto-recover from stale tsnet state at runtime by pdevito3 · Pull Request #394 · almeidapaulopt/tsdproxy

pdevito3 · 2026-05-01T21:09:25Z

Summary

Layers runtime auto-recovery on top of the existing preventive cleanup in cleanStaleState. Today, if tsnet's backend reports "invalid key" while the proxy is already running, the proxy gets stuck in ProxyStatusError until a human notices and either restarts tsdproxy or manually deletes the data directory.

cleanStaleState (added in d4a2e3c) only runs at NewProxy time and only triggers on ephemeral flag mismatch, so it doesn't cover these runtime failure modes:

Hard crash / OOM / docker kill -9 before Close()'s ephemeral cleanup runs, leaving stale tailscaled.state
Host power loss with ephemeral node expiry on the tailnet side
Manual device deletion in the Tailscale admin console while tsdproxy is running
Control-plane revocation for any reason

After this change, those scenarios trigger a single automatic RemoveAll(datadir) + restart cycle. If the restart also fails, the proxy stops in ProxyStatusError (no infinite loop).

Changes

tailscale/proxy.go: when watchStatus sees "invalid key", os.RemoveAll the tsnet data directory (mirroring cleanStaleState's approach), set ProxyStatusError, and close the events channel so proxymanager can detect the unexpected termination.
proxymanager/proxy.go: when the events channel closes outside of a normal Stopping/Stopped flow, log and trigger a one-shot onRestart callback. The restartable flag is consumed (set to false) on first use to prevent restart loops.
proxymanager/proxymanager.go: newAndStartProxy now takes a restartable parameter. eventStart passes true; the auto-recovery path passes false when re-spawning, so a persistently-failing proxy can't loop.

Loop-prevention contract

restartable is consumed exactly once, before onRestart is called. The recursive newAndStartProxy(name, proxyConfig, false) ensures the second proxy instance has restartable=false, so if it also hits "invalid key", the channel close triggers the unexpected-termination logging but no further restart. The proxy ends in ProxyStatusError and surfaces normally on the dashboard.

Test plan

proxyproviders/tailscale/proxy_test.go — 4 cases: removeStaleState removes the datadir, handles missing/empty/nil cases without panic
proxymanager/proxy_restart_test.go — 3 cases: restart fires on unexpected close, doesn't fire when restartable=false, doesn't fire on normal Stopping shutdown
All 7 unit tests pass locally
go vet clean across both modified packages
Validated end-to-end on personal NAS deployment by force-killing the container mid-run; on next start the proxy auto-recovered exactly once. After the recovery, normal traffic flowed without intervention.

Layer runtime auto-recovery on top of the existing preventive cleanup in cleanStaleState. Today, if tsnet's backend reports "invalid key" while the proxy is already running (e.g. after a hard crash that bypassed Close()'s ephemeral cleanup, ephemeral node expiry on the tailnet side, host power loss, or admin-side device deletion), the proxy gets stuck in ProxyStatusError until a human notices and either restarts tsdproxy or manually deletes the data directory. cleanStaleState only runs at NewProxy time and only triggers on ephemeral flag mismatch, so it doesn't cover these runtime failure modes. Changes: - tailscale/proxy.go: when watchStatus sees "invalid key", os.RemoveAll the tsnet data directory (mirroring cleanStaleState's approach), set ProxyStatusError, and close the events channel so proxymanager can detect the unexpected termination. - proxymanager/proxy.go: when the events channel closes outside of a normal Stopping/Stopped flow, log and trigger a one-shot onRestart callback. The restartable flag is consumed (set to false) on first use to prevent restart loops if the recovery itself fails. - proxymanager/proxymanager.go: newAndStartProxy now takes a restartable parameter. eventStart passes true; the auto-recovery path passes false when re-spawning the proxy, so a misbehaving proxy can't loop forever. Tests: - proxyproviders/tailscale/proxy_test.go (4 cases): removeStaleState removes the datadir, handles missing/empty/nil cases without panic. - proxymanager/proxy_restart_test.go (3 cases): restart fires on unexpected close, doesn't fire when restartable=false, doesn't fire on normal shutdown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

almeidapaulopt force-pushed the main branch 2 times, most recently from b720a3f to 28506e0 Compare May 7, 2026 23:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(tailscale): auto-recover from stale tsnet state at runtime#394

feat(tailscale): auto-recover from stale tsnet state at runtime#394
pdevito3 wants to merge 1 commit intoalmeidapaulopt:mainfrom
pdevito3:feat/auto-recover-stale-tsnet

pdevito3 commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

pdevito3 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Loop-prevention contract

Test plan

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pdevito3 commented May 1, 2026 •

edited

Loading