Skip to content

fix(ops): restore-test.sh success heartbeat / dead-man's-switch (#983)#1097

Merged
ToddHebebrand merged 3 commits into
LanternOps:mainfrom
bdunncompany:fix/restore-test-success-heartbeat
Jun 6, 2026
Merged

fix(ops): restore-test.sh success heartbeat / dead-man's-switch (#983)#1097
ToddHebebrand merged 3 commits into
LanternOps:mainfrom
bdunncompany:fix/restore-test-success-heartbeat

Conversation

@bdunncompany

Copy link
Copy Markdown
Collaborator

Closes #983.

Problem

scripts/ops/restore-test.sh only emits a signal on failurefail() POSTs to RESTORE_TEST_ALERT_URL and exits non-zero; the happy path ends at a bare exit 0 and sends nothing. So if the cron silently stops (host down, cron disabled, script removed, Spaces creds expired, container runtime broken) it emits no failure either, and "no alert" reads as "DR is healthy" when the check hasn't run in days. Classic dead-man's-switch gap: absence of a failure signal is indistinguishable from absence of the check.

Fix

Add an optional RESTORE_TEST_HEARTBEAT_URL (healthchecks.io-style). On each PASS, after the SUCCESS log, ping it; the external monitor alerts when the ping goes stale — so silence becomes the alarm, not the all-clear.

  • Mirrors the existing alert() style (same curl flags, quoted, default-expansion).
  • Fully back-compatible: unset RESTORE_TEST_HEARTBEAT_URL = no ping, identical to today.
  • bash -n clean; documented in the script header env-var block.

…heck findings)

govulncheck (Security Scanning workflow) is failing on main and every open PR with
2 Go standard-library vulnerabilities, both fixed in go1.25.11:
  - GO-2026-5039  net/textproto  (unescaped inputs in errors)
  - GO-2026-5037  crypto/x509    (inefficient candidate hostname parsing)

These are stdlib-only (no third-party module bumps needed); a freshly-published
advisory started failing the scan on unchanged code because CI pinned 1.25.10.
Bump the GO_VERSION / go-version pins in ci.yml, codeql.yml and release.yml so
the scan (and the released agent binary's stdlib) use the patched toolchain.
go.mod's go directive (1.25.10 floor) is intentionally left as-is.
security.yml runs the Go Vulnerability Check job and had its own GO_VERSION:
'1.25.10' (separate from ci.yml). Bumping it to 1.25.11 is what actually makes
govulncheck use the patched stdlib.
…h) (LanternOps#983)

The restore test only signaled on failure: fail() POSTs to RESTORE_TEST_ALERT_URL
and exits non-zero, while the happy path ended at a bare 'exit 0' and sent nothing.
So a silently-stopped cron (host down, cron disabled, script removed, Spaces creds
expired, runtime broken) emits no failure either — 'no alert' then reads as 'DR is
healthy' when the check hasn't run in days.

Add an optional RESTORE_TEST_HEARTBEAT_URL (healthchecks.io-style dead-man's-switch):
on each PASS, ping it after the SUCCESS log. The monitor alerts when the ping goes
stale, so absence of the check becomes the alarm rather than the all-clear. Optional
and back-compatible — unset = no ping, same as before.
@bdunncompany bdunncompany force-pushed the fix/restore-test-success-heartbeat branch from 6c2ded9 to 60e5cbd Compare June 3, 2026 01:15
@ToddHebebrand ToddHebebrand merged commit 23a61d4 into LanternOps:main Jun 6, 2026
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

restore-test.sh has no success signal — a silently-stopped cron reads as healthy

2 participants