fix(ops): restore-test.sh success heartbeat / dead-man's-switch (#983)#1097
Merged
ToddHebebrand merged 3 commits intoJun 6, 2026
Merged
Conversation
…heck findings) govulncheck (Security Scanning workflow) is failing on main and every open PR with 2 Go standard-library vulnerabilities, both fixed in go1.25.11: - GO-2026-5039 net/textproto (unescaped inputs in errors) - GO-2026-5037 crypto/x509 (inefficient candidate hostname parsing) These are stdlib-only (no third-party module bumps needed); a freshly-published advisory started failing the scan on unchanged code because CI pinned 1.25.10. Bump the GO_VERSION / go-version pins in ci.yml, codeql.yml and release.yml so the scan (and the released agent binary's stdlib) use the patched toolchain. go.mod's go directive (1.25.10 floor) is intentionally left as-is.
security.yml runs the Go Vulnerability Check job and had its own GO_VERSION: '1.25.10' (separate from ci.yml). Bumping it to 1.25.11 is what actually makes govulncheck use the patched stdlib.
…h) (LanternOps#983) The restore test only signaled on failure: fail() POSTs to RESTORE_TEST_ALERT_URL and exits non-zero, while the happy path ended at a bare 'exit 0' and sent nothing. So a silently-stopped cron (host down, cron disabled, script removed, Spaces creds expired, runtime broken) emits no failure either — 'no alert' then reads as 'DR is healthy' when the check hasn't run in days. Add an optional RESTORE_TEST_HEARTBEAT_URL (healthchecks.io-style dead-man's-switch): on each PASS, ping it after the SUCCESS log. The monitor alerts when the ping goes stale, so absence of the check becomes the alarm rather than the all-clear. Optional and back-compatible — unset = no ping, same as before.
6c2ded9 to
60e5cbd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #983.
Problem
scripts/ops/restore-test.shonly emits a signal on failure —fail()POSTs toRESTORE_TEST_ALERT_URLand exits non-zero; the happy path ends at a bareexit 0and sends nothing. So if the cron silently stops (host down, cron disabled, script removed, Spaces creds expired, container runtime broken) it emits no failure either, and "no alert" reads as "DR is healthy" when the check hasn't run in days. Classic dead-man's-switch gap: absence of a failure signal is indistinguishable from absence of the check.Fix
Add an optional
RESTORE_TEST_HEARTBEAT_URL(healthchecks.io-style). On each PASS, after the SUCCESS log, ping it; the external monitor alerts when the ping goes stale — so silence becomes the alarm, not the all-clear.alert()style (samecurlflags, quoted, default-expansion).RESTORE_TEST_HEARTBEAT_URL= no ping, identical to today.bash -nclean; documented in the script header env-var block.