feat(tailscale): auto-recover from stale tsnet state at runtime#394
Open
pdevito3 wants to merge 1 commit intoalmeidapaulopt:mainfrom
Open
feat(tailscale): auto-recover from stale tsnet state at runtime#394pdevito3 wants to merge 1 commit intoalmeidapaulopt:mainfrom
pdevito3 wants to merge 1 commit intoalmeidapaulopt:mainfrom
Conversation
Layer runtime auto-recovery on top of the existing preventive cleanup in cleanStaleState. Today, if tsnet's backend reports "invalid key" while the proxy is already running (e.g. after a hard crash that bypassed Close()'s ephemeral cleanup, ephemeral node expiry on the tailnet side, host power loss, or admin-side device deletion), the proxy gets stuck in ProxyStatusError until a human notices and either restarts tsdproxy or manually deletes the data directory. cleanStaleState only runs at NewProxy time and only triggers on ephemeral flag mismatch, so it doesn't cover these runtime failure modes. Changes: - tailscale/proxy.go: when watchStatus sees "invalid key", os.RemoveAll the tsnet data directory (mirroring cleanStaleState's approach), set ProxyStatusError, and close the events channel so proxymanager can detect the unexpected termination. - proxymanager/proxy.go: when the events channel closes outside of a normal Stopping/Stopped flow, log and trigger a one-shot onRestart callback. The restartable flag is consumed (set to false) on first use to prevent restart loops if the recovery itself fails. - proxymanager/proxymanager.go: newAndStartProxy now takes a restartable parameter. eventStart passes true; the auto-recovery path passes false when re-spawning the proxy, so a misbehaving proxy can't loop forever. Tests: - proxyproviders/tailscale/proxy_test.go (4 cases): removeStaleState removes the datadir, handles missing/empty/nil cases without panic. - proxymanager/proxy_restart_test.go (3 cases): restart fires on unexpected close, doesn't fire when restartable=false, doesn't fire on normal shutdown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b720a3f to
28506e0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Layers runtime auto-recovery on top of the existing preventive cleanup in
cleanStaleState. Today, if tsnet's backend reports"invalid key"while the proxy is already running, the proxy gets stuck inProxyStatusErroruntil a human notices and either restarts tsdproxy or manually deletes the data directory.cleanStaleState(added ind4a2e3c) only runs atNewProxytime and only triggers on ephemeral flag mismatch, so it doesn't cover these runtime failure modes:docker kill -9beforeClose()'s ephemeral cleanup runs, leaving staletailscaled.stateAfter this change, those scenarios trigger a single automatic
RemoveAll(datadir) + restartcycle. If the restart also fails, the proxy stops inProxyStatusError(no infinite loop).Changes
tailscale/proxy.go: whenwatchStatussees"invalid key",os.RemoveAllthe tsnet data directory (mirroringcleanStaleState's approach), setProxyStatusError, and close the events channel soproxymanagercan detect the unexpected termination.proxymanager/proxy.go: when the events channel closes outside of a normalStopping/Stoppedflow, log and trigger a one-shotonRestartcallback. Therestartableflag is consumed (set to false) on first use to prevent restart loops.proxymanager/proxymanager.go:newAndStartProxynow takes arestartableparameter.eventStartpassestrue; the auto-recovery path passesfalsewhen re-spawning, so a persistently-failing proxy can't loop.Loop-prevention contract
restartableis consumed exactly once, beforeonRestartis called. The recursivenewAndStartProxy(name, proxyConfig, false)ensures the second proxy instance hasrestartable=false, so if it also hits"invalid key", the channel close triggers the unexpected-termination logging but no further restart. The proxy ends inProxyStatusErrorand surfaces normally on the dashboard.Test plan
proxyproviders/tailscale/proxy_test.go— 4 cases:removeStaleStateremoves the datadir, handles missing/empty/nil cases without panicproxymanager/proxy_restart_test.go— 3 cases: restart fires on unexpected close, doesn't fire whenrestartable=false, doesn't fire on normalStoppingshutdowngo vetclean across both modified packagesRelated
Builds on top of
d4a2e3c(preventive cleanup of tsnet state on ephemeral flag change). Complementary, not overlapping — that change prevents stale state at config-change time; this change recovers from stale state at runtime when prevention wasn't possible.Tracked separately from #384 (env var overrides) to keep scopes small.