[API] Audit hash-chain forks under concurrent writes — verify_chain false positives on busy instances

## Description

On busy instances, `audit_log_verify_chain()` reports breaks even with zero tampering. Surfaced on the **US droplet** immediately after the v0.68.2 deploy activated the audit hash chain for the first time on a high-traffic instance: ~20 false breaks accumulated within the first few minutes, **all** from live agent telemetry rows (`agent.sessions.submit`, `agent.security_status.submit`, `agent.patches.submit`, …), and the count grows continuously under load. The 89k historical (backfilled) rows verify clean — only concurrently-inserted live rows break.

This is a **pre-existing design bug** in the chain (PR #900), not a regression from the `convert_to` bytea fix (#994). The deploy merely ran the chain at write-concurrency for the first time. EU (low traffic, ~5k rows) shows 0 breaks but has the same latent bug.

**No availability or data impact:** every audit row is written and individually checksummed correctly; only the chain *linkage* forks, which degrades the tamper-detection signal (false positives make real tampering indistinguishable).

## Root Cause

The BEFORE INSERT trigger `audit_log_compute_checksum()` selects its predecessor with no serialization:

```sql
SELECT checksum INTO prev
FROM audit_logs
WHERE org_id IS NOT DISTINCT FROM NEW.org_id
  AND id <> NEW.id
ORDER BY timestamp DESC, id DESC
LIMIT 1;
```

When N transactions insert audit rows for the same org concurrently, each reads the *same* latest committed checksum as `prev` and chains off it → the chain **forks**. Verified on US: 6 distinct `prev_checksum` values were each used by 2–5 rows (a clean serial chain uses each exactly once), all within one 4-minute window of concurrent agent telemetry. `audit_log_verify_chain` walks `timestamp, id` order and flags every forked row as a break.

The same-timestamp variant of this was already noted as known future-work in the test suite:
`apps/api/src/__tests__/integration/audit-checksum.integration.test.ts:110-111` — *"requires a chain_seq bigserial + per-org advisory lock — tracked as future-task hardening."* In practice it manifests under plain write concurrency, not just same-transaction batches.

## Proposed Fix

1. **Serialize per-org inserts** in the trigger with a transaction-scoped advisory lock before the predecessor SELECT, so concurrent same-org inserts each see the prior committed checksum:
   ```sql
   PERFORM pg_advisory_xact_lock(<namespace_const>, hashtext(COALESCE(NEW.org_id::text, 'NULL')));
   ```
   (or the `chain_seq bigserial` approach). New migration that `CREATE OR REPLACE`s the trigger function.
2. **Heal migration** to re-anchor already-forked rows (re-run the per-org backfill in `timestamp, id` order). Must deploy the lock fix **first**, otherwise new forks keep appearing during/after the heal.
3. **Concurrency regression test** (TDD): insert many same-org audit rows concurrently, assert `verify_chain` returns 0 breaks.

## Design tension to weigh

A per-org hash chain inherently serializes that org's audit writes. For high-frequency agent telemetry (the rows that forked here), a per-org advisory lock could become a write-throughput bottleneck. Worth deciding whether every telemetry event belongs in the tamper-evident chain, or only security-relevant events (auth, RBAC, config, retention), with high-volume telemetry excluded.

## Affected Files

- `apps/api/migrations/2026-05-25-c-audit-log-checksum-canonical-fix.sql:38` — canonical `audit_log_compute_checksum()` trigger (predecessor SELECT, no lock)
- `apps/api/migrations/2026-05-25-b-audit-log-checksum-chain.sql:25` — original definition (same issue)
- `apps/api/src/__tests__/integration/audit-checksum.integration.test.ts:100-111` — existing note documenting the limitation
- New migrations: per-org advisory-lock trigger fix + forked-row heal

## Reported By

Discovered during the v0.68.2 production deploy (2026-05-29). Both EU and US are on v0.68.2 and healthy; US's `verify_chain` accumulates false breaks under load until the lock fix ships.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API] Audit hash-chain forks under concurrent writes — verify_chain false positives on busy instances #1002

Description

Root Cause

Proposed Fix

Design tension to weigh

Affected Files

Reported By

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[API] Audit hash-chain forks under concurrent writes — verify_chain false positives on busy instances #1002

Description

Description

Root Cause

Proposed Fix

Design tension to weigh

Affected Files

Reported By

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions