Description
On busy instances, audit_log_verify_chain() reports breaks even with zero tampering. Surfaced on the US droplet immediately after the v0.68.2 deploy activated the audit hash chain for the first time on a high-traffic instance: ~20 false breaks accumulated within the first few minutes, all from live agent telemetry rows (agent.sessions.submit, agent.security_status.submit, agent.patches.submit, …), and the count grows continuously under load. The 89k historical (backfilled) rows verify clean — only concurrently-inserted live rows break.
This is a pre-existing design bug in the chain (PR #900), not a regression from the convert_to bytea fix (#994). The deploy merely ran the chain at write-concurrency for the first time. EU (low traffic, ~5k rows) shows 0 breaks but has the same latent bug.
No availability or data impact: every audit row is written and individually checksummed correctly; only the chain linkage forks, which degrades the tamper-detection signal (false positives make real tampering indistinguishable).
Root Cause
The BEFORE INSERT trigger audit_log_compute_checksum() selects its predecessor with no serialization:
SELECT checksum INTO prev
FROM audit_logs
WHERE org_id IS NOT DISTINCT FROM NEW.org_id
AND id <> NEW.id
ORDER BY timestamp DESC, id DESC
LIMIT 1;
When N transactions insert audit rows for the same org concurrently, each reads the same latest committed checksum as prev and chains off it → the chain forks. Verified on US: 6 distinct prev_checksum values were each used by 2–5 rows (a clean serial chain uses each exactly once), all within one 4-minute window of concurrent agent telemetry. audit_log_verify_chain walks timestamp, id order and flags every forked row as a break.
The same-timestamp variant of this was already noted as known future-work in the test suite:
apps/api/src/__tests__/integration/audit-checksum.integration.test.ts:110-111 — "requires a chain_seq bigserial + per-org advisory lock — tracked as future-task hardening." In practice it manifests under plain write concurrency, not just same-transaction batches.
Proposed Fix
- Serialize per-org inserts in the trigger with a transaction-scoped advisory lock before the predecessor SELECT, so concurrent same-org inserts each see the prior committed checksum:
PERFORM pg_advisory_xact_lock(<namespace_const>, hashtext(COALESCE(NEW.org_id::text, 'NULL')));
(or the chain_seq bigserial approach). New migration that CREATE OR REPLACEs the trigger function.
- Heal migration to re-anchor already-forked rows (re-run the per-org backfill in
timestamp, id order). Must deploy the lock fix first, otherwise new forks keep appearing during/after the heal.
- Concurrency regression test (TDD): insert many same-org audit rows concurrently, assert
verify_chain returns 0 breaks.
Design tension to weigh
A per-org hash chain inherently serializes that org's audit writes. For high-frequency agent telemetry (the rows that forked here), a per-org advisory lock could become a write-throughput bottleneck. Worth deciding whether every telemetry event belongs in the tamper-evident chain, or only security-relevant events (auth, RBAC, config, retention), with high-volume telemetry excluded.
Affected Files
apps/api/migrations/2026-05-25-c-audit-log-checksum-canonical-fix.sql:38 — canonical audit_log_compute_checksum() trigger (predecessor SELECT, no lock)
apps/api/migrations/2026-05-25-b-audit-log-checksum-chain.sql:25 — original definition (same issue)
apps/api/src/__tests__/integration/audit-checksum.integration.test.ts:100-111 — existing note documenting the limitation
- New migrations: per-org advisory-lock trigger fix + forked-row heal
Reported By
Discovered during the v0.68.2 production deploy (2026-05-29). Both EU and US are on v0.68.2 and healthy; US's verify_chain accumulates false breaks under load until the lock fix ships.
Description
On busy instances,
audit_log_verify_chain()reports breaks even with zero tampering. Surfaced on the US droplet immediately after the v0.68.2 deploy activated the audit hash chain for the first time on a high-traffic instance: ~20 false breaks accumulated within the first few minutes, all from live agent telemetry rows (agent.sessions.submit,agent.security_status.submit,agent.patches.submit, …), and the count grows continuously under load. The 89k historical (backfilled) rows verify clean — only concurrently-inserted live rows break.This is a pre-existing design bug in the chain (PR #900), not a regression from the
convert_tobytea fix (#994). The deploy merely ran the chain at write-concurrency for the first time. EU (low traffic, ~5k rows) shows 0 breaks but has the same latent bug.No availability or data impact: every audit row is written and individually checksummed correctly; only the chain linkage forks, which degrades the tamper-detection signal (false positives make real tampering indistinguishable).
Root Cause
The BEFORE INSERT trigger
audit_log_compute_checksum()selects its predecessor with no serialization:When N transactions insert audit rows for the same org concurrently, each reads the same latest committed checksum as
prevand chains off it → the chain forks. Verified on US: 6 distinctprev_checksumvalues were each used by 2–5 rows (a clean serial chain uses each exactly once), all within one 4-minute window of concurrent agent telemetry.audit_log_verify_chainwalkstimestamp, idorder and flags every forked row as a break.The same-timestamp variant of this was already noted as known future-work in the test suite:
apps/api/src/__tests__/integration/audit-checksum.integration.test.ts:110-111— "requires a chain_seq bigserial + per-org advisory lock — tracked as future-task hardening." In practice it manifests under plain write concurrency, not just same-transaction batches.Proposed Fix
chain_seq bigserialapproach). New migration thatCREATE OR REPLACEs the trigger function.timestamp, idorder). Must deploy the lock fix first, otherwise new forks keep appearing during/after the heal.verify_chainreturns 0 breaks.Design tension to weigh
A per-org hash chain inherently serializes that org's audit writes. For high-frequency agent telemetry (the rows that forked here), a per-org advisory lock could become a write-throughput bottleneck. Worth deciding whether every telemetry event belongs in the tamper-evident chain, or only security-relevant events (auth, RBAC, config, retention), with high-volume telemetry excluded.
Affected Files
apps/api/migrations/2026-05-25-c-audit-log-checksum-canonical-fix.sql:38— canonicalaudit_log_compute_checksum()trigger (predecessor SELECT, no lock)apps/api/migrations/2026-05-25-b-audit-log-checksum-chain.sql:25— original definition (same issue)apps/api/src/__tests__/integration/audit-checksum.integration.test.ts:100-111— existing note documenting the limitationReported By
Discovered during the v0.68.2 production deploy (2026-05-29). Both EU and US are on v0.68.2 and healthy; US's
verify_chainaccumulates false breaks under load until the lock fix ships.