Skip to content

[API] Audit hash-chain forks under concurrent writes — verify_chain false positives on busy instances #1002

@ToddHebebrand

Description

@ToddHebebrand

Description

On busy instances, audit_log_verify_chain() reports breaks even with zero tampering. Surfaced on the US droplet immediately after the v0.68.2 deploy activated the audit hash chain for the first time on a high-traffic instance: ~20 false breaks accumulated within the first few minutes, all from live agent telemetry rows (agent.sessions.submit, agent.security_status.submit, agent.patches.submit, …), and the count grows continuously under load. The 89k historical (backfilled) rows verify clean — only concurrently-inserted live rows break.

This is a pre-existing design bug in the chain (PR #900), not a regression from the convert_to bytea fix (#994). The deploy merely ran the chain at write-concurrency for the first time. EU (low traffic, ~5k rows) shows 0 breaks but has the same latent bug.

No availability or data impact: every audit row is written and individually checksummed correctly; only the chain linkage forks, which degrades the tamper-detection signal (false positives make real tampering indistinguishable).

Root Cause

The BEFORE INSERT trigger audit_log_compute_checksum() selects its predecessor with no serialization:

SELECT checksum INTO prev
FROM audit_logs
WHERE org_id IS NOT DISTINCT FROM NEW.org_id
  AND id <> NEW.id
ORDER BY timestamp DESC, id DESC
LIMIT 1;

When N transactions insert audit rows for the same org concurrently, each reads the same latest committed checksum as prev and chains off it → the chain forks. Verified on US: 6 distinct prev_checksum values were each used by 2–5 rows (a clean serial chain uses each exactly once), all within one 4-minute window of concurrent agent telemetry. audit_log_verify_chain walks timestamp, id order and flags every forked row as a break.

The same-timestamp variant of this was already noted as known future-work in the test suite:
apps/api/src/__tests__/integration/audit-checksum.integration.test.ts:110-111"requires a chain_seq bigserial + per-org advisory lock — tracked as future-task hardening." In practice it manifests under plain write concurrency, not just same-transaction batches.

Proposed Fix

  1. Serialize per-org inserts in the trigger with a transaction-scoped advisory lock before the predecessor SELECT, so concurrent same-org inserts each see the prior committed checksum:
    PERFORM pg_advisory_xact_lock(<namespace_const>, hashtext(COALESCE(NEW.org_id::text, 'NULL')));
    (or the chain_seq bigserial approach). New migration that CREATE OR REPLACEs the trigger function.
  2. Heal migration to re-anchor already-forked rows (re-run the per-org backfill in timestamp, id order). Must deploy the lock fix first, otherwise new forks keep appearing during/after the heal.
  3. Concurrency regression test (TDD): insert many same-org audit rows concurrently, assert verify_chain returns 0 breaks.

Design tension to weigh

A per-org hash chain inherently serializes that org's audit writes. For high-frequency agent telemetry (the rows that forked here), a per-org advisory lock could become a write-throughput bottleneck. Worth deciding whether every telemetry event belongs in the tamper-evident chain, or only security-relevant events (auth, RBAC, config, retention), with high-volume telemetry excluded.

Affected Files

  • apps/api/migrations/2026-05-25-c-audit-log-checksum-canonical-fix.sql:38 — canonical audit_log_compute_checksum() trigger (predecessor SELECT, no lock)
  • apps/api/migrations/2026-05-25-b-audit-log-checksum-chain.sql:25 — original definition (same issue)
  • apps/api/src/__tests__/integration/audit-checksum.integration.test.ts:100-111 — existing note documenting the limitation
  • New migrations: per-org advisory-lock trigger fix + forked-row heal

Reported By

Discovered during the v0.68.2 production deploy (2026-05-29). Both EU and US are on v0.68.2 and healthy; US's verify_chain accumulates false breaks under load until the lock fix ships.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions