Skip to content

fix(patches): grace-period prune of stale patch tombstones (#1004)#1098

Merged
ToddHebebrand merged 3 commits into
LanternOps:mainfrom
bdunncompany:fix/patch-tombstone-grace-prune
Jun 8, 2026
Merged

fix(patches): grace-period prune of stale patch tombstones (#1004)#1098
ToddHebebrand merged 3 commits into
LanternOps:mainfrom
bdunncompany:fix/patch-tombstone-grace-prune

Conversation

@bdunncompany

Copy link
Copy Markdown
Collaborator

Closes #1004.

Problem

Scan ingest marks every device_patches row status='missing' at the start of a scan, then re-upserts what the scan reported. Rows left 'missing' are stale tombstones and accumulate unbounded (a Linux package upgraded to a new externalId orphans the old row forever — US prod had devices at 300–960 'missing' rows).

Fix — API-only grace-period prune (the safe option from the thread)

pruneStaleTombstones() runs after each scan commits and deletes this device's 'missing' rows whose updatedAt is older than a grace window (default 7 days, PATCH_TOMBSTONE_PRUNE_AFTER_HOURS).

Why updatedAt (not lastCheckedAt): the bulk mark-missing sets status='missing' + lastCheckedAt=now on every row each scan, so lastCheckedAt is useless as an age signal. It leaves updatedAt untouched — only a real upsert (a patch the scan actually reported) bumps it — so updatedAt dates the patch's last real sighting.

Why it's safe re: the destructive edge you flagged (source buckets coarser than providers; agent submits partial payloads on partial-provider failure): a winget failure under the shared third_party bucket while chocolatey succeeds leaves winget's rows 'missing' this cycle, but their updatedAt is still recent, so they're inside the window and not pruned. They self-heal on the next clean scan. Only genuinely-removed packages age out. Scoped to one device + org (cross-tenant safe).

Tests

6 integration cases against a real DB (a Drizzle mock never runs the make_interval filter): stale row pruned; recent row kept (covers empty/zero-item payload + same-bucket partial-provider self-heal); pending/installed never touched; idempotency; target-device-only; org-scope guard. Existing patches.test.ts unit suite updated and green. Full apps/api tsc clean.

Note

Stacked on #1096 (the Go-1.25.11 govuln fix) so CI's Security Scanning is green — the 4 Go-workflow lines in this diff belong to #1096 and drop out once it merges (I'll rebase onto main then).

…heck findings)

govulncheck (Security Scanning workflow) is failing on main and every open PR with
2 Go standard-library vulnerabilities, both fixed in go1.25.11:
  - GO-2026-5039  net/textproto  (unescaped inputs in errors)
  - GO-2026-5037  crypto/x509    (inefficient candidate hostname parsing)

These are stdlib-only (no third-party module bumps needed); a freshly-published
advisory started failing the scan on unchanged code because CI pinned 1.25.10.
Bump the GO_VERSION / go-version pins in ci.yml, codeql.yml and release.yml so
the scan (and the released agent binary's stdlib) use the patched toolchain.
go.mod's go directive (1.25.10 floor) is intentionally left as-is.
security.yml runs the Go Vulnerability Check job and had its own GO_VERSION:
'1.25.10' (separate from ci.yml). Bumping it to 1.25.11 is what actually makes
govulncheck use the patched stdlib.
…rnOps#1004)

The scan ingest marks every device_patches row status='missing' at the start of
a scan, then re-upserts the rows the scan reported. Rows left 'missing' are stale
tombstones (e.g. a Linux package upgraded to a new externalId orphans the old
row) and accumulated unbounded — US prod had devices with 300-960 'missing' rows.

Add pruneStaleTombstones(): after each scan commits, delete this device's
'missing' rows whose updatedAt is older than a grace window (default 7d,
PATCH_TOMBSTONE_PRUNE_AFTER_HOURS). updatedAt is the right signal — it's bumped
only when a scan actually reports the patch; the bulk mark-missing leaves it
untouched (unlike lastCheckedAt, which is refreshed every scan). This realizes
the API-only 'grace-period' option from the issue thread safely: the destructive
edge Todd flagged (source buckets are coarser than providers; a winget failure
under the shared 'third_party' bucket while chocolatey succeeds) self-heals,
because the row is re-upserted on the next clean scan inside the window. Only
genuinely-removed packages age out. Scoped to one device + org.

Tests: 6 integration cases against a real DB (a Drizzle mock would never run the
make_interval filter) — stale row pruned; recent row kept (covers empty/zero-item
payload + same-bucket self-heal); pending/installed never touched; idempotency;
target-device-only; org-scope guard. Existing patches.test.ts unit suite updated
(db.delete + and() mocks) and green.
@ToddHebebrand ToddHebebrand merged commit 34da7fc into LanternOps:main Jun 8, 2026
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Patch scan tombstones (device_patches.status='missing') accumulate unbounded

2 participants