PostHog · jghoman · May 4, 2026 · May 4, 2026 · May 4, 2026 · May 4, 2026
diff --git a/.gitignore b/.gitignore
@@ -9,3 +9,4 @@ dist/
 .claude/settings.local.json
 session-*.md
 test/ssl/
+FOLLOWUPS.md
diff --git a/AGENT.md b/AGENT.md
@@ -24,15 +24,27 @@ Prefer fixup commits over amending and force-pushing.
 
 ## Maintenance Tooling
 
-`tools/maintenance.py` is a self-contained Python script for DuckLake maintenance operations (snapshot expiry, file cleanup, orphan deletion, checkpoint, tiered compaction). It connects to DuckLake using the same env vars as the main app (`DUCKLAKE_RDS_*`, `DUCKDB_S3_*`, `DUCKLAKE_DATA_PATH`) but does not import from the `millpond` package. Designed to run as a K8s CronJob reusing the same Docker image.
+`tools/maintenance.py` is a self-contained Python script for DuckLake maintenance operations (snapshot expiry, file cleanup, orphan deletion, checkpoint, tiered compaction, deletion-queue dedup, catalog-side orphan recovery, fsck). It connects to DuckLake using the same env vars as the main app (`DUCKLAKE_RDS_*`, `DUCKDB_S3_*`, `DUCKLAKE_DATA_PATH`) but does not import from the `millpond` package. Designed to run as a K8s CronJob reusing the same Docker image.
 
-The `compact` subcommand implements tiered compaction: each tier wraps `ducklake_merge_adjacent_files()` with a save/restore of the catalog's `target_file_size` option (which `ducklake_set_option` writes durably and cannot be unset). Tier specs and the steady-state default live in `TIERS` and `DEFAULT_TARGET_FILE_SIZE` at the top of the file. Bin semantics on DuckLake 1.4.x are `min_file_size` inclusive, `max_file_size` exclusive — so the tier ranges `[0, 1 MiB)`, `[1 MiB, 10 MiB)`, `[10 MiB, 64 MiB)` partition the file-size space without overlap. The `compact-probe` subcommand runs `ducklake_merge_adjacent_files()` against one table with a small `max_compacted_files` cap and no `target_file_size` change — used as a lightweight diagnostic from a periodic CronJob.
+`connect()` does two ATTACHes against the same upstream Postgres: the DuckLake catalog as `lake` (the `ATTACH_NAME` constant) and a direct Postgres ATTACH as `pg` (the `PG_ATTACH_NAME` constant) used by `postgres_execute` / `postgres_query` for ctid-based DML and advisory-lock acquisition. S3 access is configured via both the legacy `SET s3_*` settings (honored by the ducklake catalog driver) and a `CREATE OR REPLACE SECRET s3` (required for ad-hoc httpfs ops like `glob('s3://...')` and `read_parquet('s3://...')` in DuckDB 1.4 — the legacy form alone returns 403 for those paths). Spills are pointed at `/tmp/duckdb_spill` so they land on the writable cron-pod emptyDir, not the read-only rootfs.
+
+The `compact` subcommand implements tiered compaction: each tier wraps `ducklake_merge_adjacent_files()` with a save/restore of the catalog's `target_file_size` option (which `ducklake_set_option` writes durably and cannot be unset). Tier specs and the steady-state default live in `TIERS` and `DEFAULT_TARGET_FILE_SIZE` at the top of the file. Bin semantics on DuckLake 1.4.x are `min_file_size` inclusive, `max_file_size` exclusive — so the tier ranges `[0, 1 MiB)`, `[1 MiB, 10 MiB)`, `[10 MiB, 64 MiB)` partition the file-size space without overlap. The `compact-probe` subcommand runs `ducklake_merge_adjacent_files()` against one table with a small `max_compacted_files` cap and no `target_file_size` change — used as a lightweight diagnostic from a periodic CronJob. `compact` exposes `--threads` (default 2) and `--memory-limit` (default 4GB) because `ducklake_merge_adjacent_files` over-uses memory relative to input volume in 1.4 and the conservative defaults keep the cron pod alive on real lakes; raise them when the lake fits.
+
+DuckLake stores its catalog tables (`ducklake_data_file`, `ducklake_table`, `ducklake_files_scheduled_for_deletion`, etc.) in a Postgres schema named `__ducklake_metadata_<attach_name>`, where `<attach_name>` is the alias from the `ATTACH … AS <name>` statement. `maintenance.py` exposes the alias as the `ATTACH_NAME` constant and the derived `METADATA_SCHEMA = f"__ducklake_metadata_{ATTACH_NAME}"`; any new code that reads the catalog directly (rather than through the `ducklake_*` SQL functions) must reference `METADATA_SCHEMA` so the attach name and schema name never drift.
+
+`tools/maintenance.sql` is executed verbatim at every session start, both by `maintenance.py`'s `connect()` and by the `just shell` recipe (via the duckdb CLI's `.read` meta-command). The file contains no templating — `__ducklake_metadata_lake` and the rest are written literally so both load paths stay consistent. It defines runtime macros (`count_pending_dups()`, `find_catalog_orphans(data_path)`) and documents the constraints any new recipe must follow: no `LEFT ANTI JOIN` (DuckDB 1.4 limitation; use `LEFT JOIN ... WHERE rhs IS NULL`), no Postgres `ctid` from duckdb-side SQL (use `postgres_execute` / `postgres_query` instead), no literal `glob('s3://...')` inside `CREATE MACRO` bodies (DuckDB 1.4 evaluates them eagerly at macro creation, which would S3-LIST the lake on every connect — pass the path as a parameter), and the advisory-lock key.
+
+The catalog-side orphan-recovery subcommands (`dedup-deletions`, `find-orphans`, `heal-orphans`, `cleanup-all-safe`, `fsck`) form a self-contained recovery toolkit for the failure mode where an interrupted `cleanup-all` leaves the catalog with rows pointing at S3 keys that no longer exist (because the upstream txn rolled back but the S3 deletes are permanent). `heal-orphans` runs two safety gates before deleting: B1 proves `ducklake_data_file` is non-empty AND no orphan path is still live (no vacuous pass on an empty catalog, no false positive that would delete a live file), and B3 aborts if any positional-delete vector references an "orphan" id (such a file is still live for vector lookups). `cleanup-all-safe` is the orchestrator that loops dedup + heal + cleanup-all under one advisory lock until cleanup-all exits clean; `fsck` adds the `ducklake_delete_orphaned_files` S3-side sweep on top. All destructive subcommands take `pg_try_advisory_lock(hashtext('millpond-ducklake-maintenance')::bigint)` on the `pg` ATTACH; the lock is held by the `pg` connection (not the `lake` connection that DuckLake uses internally), so it provides mutual exclusion between maintenance invocations but not catalog-write atomicity against arbitrary writers — document this caveat anywhere the lock is mentioned.
 
 `main()` logs `millpond <version> (maintenance)` on startup using `importlib.metadata.version("millpond")`, mirroring `millpond/main.py`. If `MILLPOND_IMAGE` is set in the env, the image identifier is appended (`image=<value>`). The chart-side wiring is optional — without it, the package version is sufficient to identify which build is running.
 
+`cleanup` and `cleanup-all` log a single structured throughput line on completion: `cleanup throughput: files_processed=N elapsed_s=T rate_obj_s=R queue_depth_after=A`. `files_processed` comes directly from `len(result)` — the count of rows `ducklake_cleanup_old_files` returned — rather than a queue-depth delta. A delta would be misleading whenever any other writer enqueues deletions during the call (the maintenance advisory lock by design only mutexes maintenance invocations, not arbitrary writers); `len(result)` is accurate regardless. `queue_depth_after` is queried with a single post-call snapshot and shows remaining work but is not used in the rate calculation. The line is intentionally suppressed on `--dry-run` because dry-run returns preview rows and a rate computed from those would falsely claim work was done.
+
 If `PUSHGATEWAY_URL` is set, the script pushes two metrics: `maintenance_start_time{operation}` (pushed immediately on start) and `maintenance_duration_seconds{operation, status}` (pushed on completion). This enables Grafana annotation queries for maintenance windows.
 
-`tools/justfile` wraps the script for interactive use and is copied to `/justfile` in the Docker image. The `shell` and `drop` recipes still use the DuckDB CLI directly. The `DUCKDB` env var can override the duckdb binary path.
+DuckDB's HTTP logging and the postgres extension's `pg_debug_show_queries` are off by default — both add per-call overhead that compounds across tens of thousands of S3 deletes. Pass `--debug` at the maintenance.py command level to opt back into both for short-lived debugging.
+
+`tools/justfile` wraps the script for interactive use and is copied to `/justfile` in the Docker image. The `shell` recipe pre-ATTACHes both `lake` and `pg`, configures the S3 SECRET, and `.read`s `tools/maintenance.sql` so every session starts with the macros loaded; the `drop` recipe still uses the DuckDB CLI directly. The `DUCKDB` env var can override the duckdb binary path; `MAINTENANCE_SCRIPT` and `MAINTENANCE_SQL` env vars override the in-image paths for dev use.
 
 ## What This Is
 

diff --git a/Dockerfile b/Dockerfile
@@ -24,6 +24,7 @@ COPY --from=builder /app/millpond /app/millpond
 COPY --from=builder /root/.local/bin/duckdb /usr/local/bin/duckdb
 COPY tools/justfile /justfile
 COPY tools/maintenance.py /app/tools/maintenance.py
+COPY tools/maintenance.sql /app/tools/maintenance.sql
 
 ENV PATH="/app/.venv/bin:$PATH"
 

diff --git a/README.md b/README.md
@@ -94,28 +94,60 @@ Requires Docker (uses `keytool` from the Kafka container image for cert generati
 
 ### DuckLake Maintenance
 
-`tools/maintenance.py` is a self-contained Python script for DuckLake maintenance operations (snapshot expiry, file cleanup, orphan deletion, checkpoint). It is baked into the Docker image at `/app/tools/maintenance.py` and designed to run as a K8s CronJob reusing the same image and credentials as the main application.
+`tools/maintenance.py` is a self-contained Python script for DuckLake maintenance operations (snapshot expiry, file cleanup, orphan deletion, checkpoint, tiered compaction, deletion-queue dedup, catalog-side orphan recovery). It is baked into the Docker image at `/app/tools/maintenance.py` and designed to run as a K8s CronJob reusing the same image and credentials as the main application.
 
 ```bash
-python /app/tools/maintenance.py maintain --days 7          # expire snapshots + cleanup files
+python /app/tools/maintenance.py maintain --days 7           # expire snapshots + cleanup files
 python /app/tools/maintenance.py maintain --days 7 --dry-run # preview only
-python /app/tools/maintenance.py expire --days 3            # expire snapshots only
-python /app/tools/maintenance.py cleanup --days 1           # cleanup scheduled files only
-python /app/tools/maintenance.py checkpoint                 # integrated merge + expire + cleanup
-python /app/tools/maintenance.py orphans                    # delete orphaned S3 files
+python /app/tools/maintenance.py expire --days 3             # expire snapshots only
+python /app/tools/maintenance.py cleanup --days 1            # cleanup scheduled files only
+python /app/tools/maintenance.py cleanup-all                 # cleanup all scheduled files regardless of age
+python /app/tools/maintenance.py dedup-deletions             # drop duplicate rows in the pending-deletion queue
+python /app/tools/maintenance.py find-orphans                # list catalog rows whose S3 key no longer exists
+python /app/tools/maintenance.py heal-orphans                # delete those catalog rows (gated B1/B3 safety checks)
+python /app/tools/maintenance.py cleanup-all-safe            # dedup + heal-orphans + cleanup-all in a loop until clean
+python /app/tools/maintenance.py fsck                        # cleanup-all-safe + ducklake_delete_orphaned_files
+python /app/tools/maintenance.py checkpoint                  # integrated merge + expire + cleanup
+python /app/tools/maintenance.py orphans                     # delete S3-side orphaned files (catalog has no row)
+python /app/tools/maintenance.py compact --tier 1            # tiered compaction (see "When to add a merge job")
 ```
 
+The script logs `cleanup throughput: files_processed=N elapsed_s=T rate_obj_s=R queue_depth_after=A` after every `cleanup` / `cleanup-all` (skipped on `--dry-run`), so you can confirm steady-state throughput without enabling debug logging. `files_processed` is the actual count of files the call returned, not a queue-depth delta, so the number is accurate even when other writers enqueue deletions during the run. Pass `--debug` to opt back into DuckDB's HTTP and Postgres-extension query logging — both are off by default because they add per-call overhead that compounds across tens of thousands of S3 deletes.
+
 If `PUSHGATEWAY_URL` is set, the script pushes `maintenance_start_time` (on start) and `maintenance_duration_seconds` (on completion) to a Prometheus Pushgateway, enabling Grafana annotation queries for maintenance windows.
 
+#### Catalog-side orphan recovery
+
+If a `cleanup-all` run is interrupted (DuckLake bug: an S3 NoSuchKey on DELETE rolls back the whole transaction, but the S3 deletes already-completed are permanent), the catalog ends up with rows in `ducklake_files_scheduled_for_deletion` that point at S3 keys that no longer exist. Every subsequent `cleanup-all` will crash on those orphans until they're cleaned up. The catalog-recovery subcommands handle this without manual SQL surgery:
+
+| Subcommand | Action |
+|---|---|
+| `find-orphans` | List orphan rows on stdout (read-only). |
+| `heal-orphans` | Delete the orphan rows. Two safety gates: B1 proves `ducklake_data_file` is non-empty AND no orphan path is still live; B3 aborts if any positional-delete vector references an orphan id. `--dry-run` runs the gates but skips the DELETE. |
+| `cleanup-all-safe` | Loop dedup-deletions + heal-orphans + cleanup-all under one advisory lock until cleanup-all exits clean. Caps at `--max-iterations` (default 10). |
+| `fsck` | `cleanup-all-safe` followed by `ducklake_delete_orphaned_files` (S3-side orphan sweep). The end-to-end "lake catalog is healthy" recipe. |
+
+Mutual exclusion comes from `pg_try_advisory_lock(hashtext('millpond-ducklake-maintenance')::bigint)` taken on the `pg` ATTACH; concurrent maintenance invocations bail with a clear error rather than racing each other's DELETEs.
+
+`tools/maintenance.sql` is loaded at every session start (both by `maintenance.py` and by the `just shell` recipe) and defines small DuckDB macros for ad-hoc inspection — `SELECT count_pending_dups()` for queue dup count, `SELECT * FROM find_catalog_orphans('s3://bucket/lake/data')` for the orphan list. The header documents the conventions (no `LEFT ANTI JOIN`, no duckdb-side `ctid`, advisory-lock key) that any new recipe must follow.
+
 `tools/justfile` wraps the script and is also baked into the image at `/justfile` for interactive use:
 
 ```bash
-just --list              # see available recipes
-just maintain-dry-run 3  # preview: expire >3 day snapshots + cleanup
-just maintain 3          # execute it
-just shell               # interactive DuckDB shell connected to DuckLake
-just drop events         # drop a table (data files remain until cleanup)
-just orphans-dry-run     # preview orphaned S3 files
+just --list                  # see available recipes
+just maintain-dry-run 3      # preview: expire >3 day snapshots + cleanup
+just maintain 3              # execute it
+just dedup-deletions-dry-run # preview duplicate rows in the pending-deletion queue
+just dedup-deletions         # drop them
+just find-orphans            # list catalog-side orphan rows
+just heal-orphans-dry-run    # preview heal-orphans (gates only, no DELETE)
+just heal-orphans            # delete catalog-side orphan rows
+just cleanup-all-safe        # dedup + heal + cleanup-all in a loop
+just fsck-dry-run            # preview fsck end-to-end
+just fsck                    # bring catalog to known-good state
+just shell                   # interactive DuckDB shell with lake + pg ATTACHed and macros loaded
+just drop events             # drop a table (data files remain until cleanup)
+just orphans-dry-run         # preview S3-side orphaned files
 ```
 
 All commands use the pod's existing env vars (`DUCKLAKE_RDS_*`, `DUCKDB_S3_*`, `DUCKLAKE_DATA_PATH`).
@@ -242,6 +274,8 @@ Tier ranges (verified semantics: `min_file_size` inclusive, `max_file_size` excl
 | `compact-to-tier-2` | `[1 MiB, 10 MiB)` | ~32 MiB |
 | `compact-to-tier-3` | `[10 MiB, 64 MiB)` | ~128 MiB |
 
+The `compact` subcommand bounds DuckDB resource use during the merge — `--threads` (default 2) and `--memory-limit` (default 4GB) — because `ducklake_merge_adjacent_files` isn't fully streaming today and over-uses memory relative to input size. The defaults are conservative; raise them on lakes that fit comfortably in pod memory.
+
 This is an out-of-band maintenance operation, not part of the hot path.
 
 See the [sizing calculator](https://posthog.github.io/millpond/sizing-calculator.html) for interactive estimates.

diff --git a/millpond/ducklake.py b/millpond/ducklake.py
@@ -16,8 +16,15 @@
 def _escape_libpq(value: str | None) -> str:
     """Escape a value for a libpq connection string.
 
-    Wraps in single quotes and backslash-escapes internal single quotes and backslashes.
-    See: https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING
+    Wraps in single quotes and backslash-escapes internal single quotes and
+    backslashes, per the libpq connstring grammar:
+
+      https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING
+
+    Note this is *not* the same parser as Postgres SQL string literals — the
+    SQL parser uses ``''`` for embedded quotes and is governed by
+    ``standard_conforming_strings``; the libpq connstring parser is a
+    separate grammar that has always required ``\\'`` and ``\\\\``.
     """
     if value is None:
         return "''"

diff --git a/pyproject.toml b/pyproject.toml
@@ -20,6 +20,9 @@ dependencies = [
     "pyarrow>=18.0",
     "orjson>=3.10",
     "prometheus-client>=0.21",
+    # Required by duckdb's Python conversion for TIMESTAMPTZ columns; the
+    # stdlib zoneinfo module isn't accepted as a substitute as of 1.4.x.
+    "pytz>=2024.1",
 ]
 
 [project.optional-dependencies]
@@ -47,6 +50,7 @@ markers = [
     "integration: marks tests as integration tests (deselect with '-m \"not integration\"')",
     "e2e: marks tests as E2E tests requiring docker-compose stack",
 ]
+pythonpath = ["tools"]
 
 [tool.uv]
 constraint-dependencies = [

diff --git a/tests/unit/test_ducklake.py b/tests/unit/test_ducklake.py
@@ -74,6 +74,8 @@ def test_quoted_string_rejected(self):
 
 
 class TestEscapeLibpq:
+    """libpq connstring grammar: backslash escapes, NOT SQL-style doubled quotes."""
+
     def test_plain_value(self):
         assert _escape_libpq("ducklake") == "'ducklake'"