Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
24ed5ab
Fix _escape_libpq to use SQL-style '' for embedded quotes
jghoman May 4, 2026
64d90d6
Always restore target_file_size to DEFAULT_TARGET_FILE_SIZE
jghoman May 4, 2026
ea9adbf
Set temp_directory to /tmp/duckdb_spill in maintenance connect()
jghoman May 4, 2026
63fc7f5
Disable enable_http_logging and pg_debug_show_queries by default
jghoman May 4, 2026
dd39339
Centralize DuckLake ATTACH name and metadata schema in constants
jghoman May 4, 2026
a1c0ca0
Use CREATE SECRET for S3 in justfile shell setup
jghoman May 4, 2026
9a9fc2a
Pre-ATTACH upstream Postgres as 'pg' in justfile shell setup
jghoman May 4, 2026
5076f09
Add pytz dependency for DuckDB TIMESTAMPTZ conversion
jghoman May 4, 2026
8de874b
Document __ducklake_metadata_<attach> schema convention in AGENT.md
jghoman May 4, 2026
079cfc2
Bound DuckDB resource use during compact subcommand
jghoman May 4, 2026
9765fa2
Log cleanup throughput as a single structured line
jghoman May 4, 2026
4352b2c
Add dedup-deletions and SQL macro file for maintenance recipes
jghoman May 4, 2026
4bb9818
Add find-orphans subcommand and find_catalog_orphans macro
jghoman May 4, 2026
8502f86
Take a Postgres advisory lock around dedup-deletions
jghoman May 4, 2026
2d30e76
Add heal-orphans subcommand with B1/B3 safety gates
jghoman May 4, 2026
17128e6
Add cleanup-all-safe orchestrator subcommand
jghoman May 4, 2026
193c144
Add fsck end-to-end catalog-healthy recipe
jghoman May 4, 2026
31a805e
Fix two real-stack bugs in maintenance.py connect+lock paths
jghoman May 4, 2026
b5bea83
Document catalog-side orphan recovery in README and AGENT.md
jghoman May 4, 2026
424ed51
Drop str.format() templating from maintenance.sql
jghoman May 4, 2026
144bdc4
Honor DUCKDB_S3_ENDPOINT / URL_STYLE / USE_SSL in justfile shell setup
jghoman May 4, 2026
45c4cd2
Take heal-orphans advisory lock before materializing the orphan set
jghoman May 4, 2026
247563b
Run heal-orphans gates in fsck dry-run
jghoman May 4, 2026
4c3f846
Add unit tests for maintenance.py helpers and orchestrators
jghoman May 4, 2026
1dbeab1
Normalize path forms in heal-orphans B1 safety gate
jghoman May 4, 2026
2e1acfb
Preserve operator-configured target_file_size across compactions
jghoman May 4, 2026
cca9744
Restore parity between just shell setup and maintenance.py connect()
jghoman May 4, 2026
13d768e
Stop logging false-positive warning when prior target_file_size is th…
jghoman May 4, 2026
9b12a79
Revert _escape_libpq to libpq grammar (backslash escapes)
jghoman May 4, 2026
c91a06f
Default-off debug logging in just shell + add macro coverage
jghoman May 4, 2026
c5da4da
Filter heal-orphans gates to live rows + lock the schema-name coupling
jghoman May 4, 2026
e8504ae
Normalize trailing slash in data_path before joining relative keys
jghoman May 4, 2026
949fad6
Fix throughput-log race, validate s3_use_ssl, cover s3:// gate paths
jghoman May 4, 2026
fbab440
Sync README + AGENT.md cleanup-throughput-line description with code
jghoman May 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ dist/
.claude/settings.local.json
session-*.md
test/ssl/
FOLLOWUPS.md
18 changes: 15 additions & 3 deletions AGENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,27 @@ Prefer fixup commits over amending and force-pushing.

## Maintenance Tooling

`tools/maintenance.py` is a self-contained Python script for DuckLake maintenance operations (snapshot expiry, file cleanup, orphan deletion, checkpoint, tiered compaction). It connects to DuckLake using the same env vars as the main app (`DUCKLAKE_RDS_*`, `DUCKDB_S3_*`, `DUCKLAKE_DATA_PATH`) but does not import from the `millpond` package. Designed to run as a K8s CronJob reusing the same Docker image.
`tools/maintenance.py` is a self-contained Python script for DuckLake maintenance operations (snapshot expiry, file cleanup, orphan deletion, checkpoint, tiered compaction, deletion-queue dedup, catalog-side orphan recovery, fsck). It connects to DuckLake using the same env vars as the main app (`DUCKLAKE_RDS_*`, `DUCKDB_S3_*`, `DUCKLAKE_DATA_PATH`) but does not import from the `millpond` package. Designed to run as a K8s CronJob reusing the same Docker image.

The `compact` subcommand implements tiered compaction: each tier wraps `ducklake_merge_adjacent_files()` with a save/restore of the catalog's `target_file_size` option (which `ducklake_set_option` writes durably and cannot be unset). Tier specs and the steady-state default live in `TIERS` and `DEFAULT_TARGET_FILE_SIZE` at the top of the file. Bin semantics on DuckLake 1.4.x are `min_file_size` inclusive, `max_file_size` exclusive — so the tier ranges `[0, 1 MiB)`, `[1 MiB, 10 MiB)`, `[10 MiB, 64 MiB)` partition the file-size space without overlap. The `compact-probe` subcommand runs `ducklake_merge_adjacent_files()` against one table with a small `max_compacted_files` cap and no `target_file_size` change — used as a lightweight diagnostic from a periodic CronJob.
`connect()` does two ATTACHes against the same upstream Postgres: the DuckLake catalog as `lake` (the `ATTACH_NAME` constant) and a direct Postgres ATTACH as `pg` (the `PG_ATTACH_NAME` constant) used by `postgres_execute` / `postgres_query` for ctid-based DML and advisory-lock acquisition. S3 access is configured via both the legacy `SET s3_*` settings (honored by the ducklake catalog driver) and a `CREATE OR REPLACE SECRET s3` (required for ad-hoc httpfs ops like `glob('s3://...')` and `read_parquet('s3://...')` in DuckDB 1.4 — the legacy form alone returns 403 for those paths). Spills are pointed at `/tmp/duckdb_spill` so they land on the writable cron-pod emptyDir, not the read-only rootfs.

The `compact` subcommand implements tiered compaction: each tier wraps `ducklake_merge_adjacent_files()` with a save/restore of the catalog's `target_file_size` option (which `ducklake_set_option` writes durably and cannot be unset). Tier specs and the steady-state default live in `TIERS` and `DEFAULT_TARGET_FILE_SIZE` at the top of the file. Bin semantics on DuckLake 1.4.x are `min_file_size` inclusive, `max_file_size` exclusive — so the tier ranges `[0, 1 MiB)`, `[1 MiB, 10 MiB)`, `[10 MiB, 64 MiB)` partition the file-size space without overlap. The `compact-probe` subcommand runs `ducklake_merge_adjacent_files()` against one table with a small `max_compacted_files` cap and no `target_file_size` change — used as a lightweight diagnostic from a periodic CronJob. `compact` exposes `--threads` (default 2) and `--memory-limit` (default 4GB) because `ducklake_merge_adjacent_files` over-uses memory relative to input volume in 1.4 and the conservative defaults keep the cron pod alive on real lakes; raise them when the lake fits.

DuckLake stores its catalog tables (`ducklake_data_file`, `ducklake_table`, `ducklake_files_scheduled_for_deletion`, etc.) in a Postgres schema named `__ducklake_metadata_<attach_name>`, where `<attach_name>` is the alias from the `ATTACH … AS <name>` statement. `maintenance.py` exposes the alias as the `ATTACH_NAME` constant and the derived `METADATA_SCHEMA = f"__ducklake_metadata_{ATTACH_NAME}"`; any new code that reads the catalog directly (rather than through the `ducklake_*` SQL functions) must reference `METADATA_SCHEMA` so the attach name and schema name never drift.

`tools/maintenance.sql` is executed verbatim at every session start, both by `maintenance.py`'s `connect()` and by the `just shell` recipe (via the duckdb CLI's `.read` meta-command). The file contains no templating — `__ducklake_metadata_lake` and the rest are written literally so both load paths stay consistent. It defines runtime macros (`count_pending_dups()`, `find_catalog_orphans(data_path)`) and documents the constraints any new recipe must follow: no `LEFT ANTI JOIN` (DuckDB 1.4 limitation; use `LEFT JOIN ... WHERE rhs IS NULL`), no Postgres `ctid` from duckdb-side SQL (use `postgres_execute` / `postgres_query` instead), no literal `glob('s3://...')` inside `CREATE MACRO` bodies (DuckDB 1.4 evaluates them eagerly at macro creation, which would S3-LIST the lake on every connect — pass the path as a parameter), and the advisory-lock key.

The catalog-side orphan-recovery subcommands (`dedup-deletions`, `find-orphans`, `heal-orphans`, `cleanup-all-safe`, `fsck`) form a self-contained recovery toolkit for the failure mode where an interrupted `cleanup-all` leaves the catalog with rows pointing at S3 keys that no longer exist (because the upstream txn rolled back but the S3 deletes are permanent). `heal-orphans` runs two safety gates before deleting: B1 proves `ducklake_data_file` is non-empty AND no orphan path is still live (no vacuous pass on an empty catalog, no false positive that would delete a live file), and B3 aborts if any positional-delete vector references an "orphan" id (such a file is still live for vector lookups). `cleanup-all-safe` is the orchestrator that loops dedup + heal + cleanup-all under one advisory lock until cleanup-all exits clean; `fsck` adds the `ducklake_delete_orphaned_files` S3-side sweep on top. All destructive subcommands take `pg_try_advisory_lock(hashtext('millpond-ducklake-maintenance')::bigint)` on the `pg` ATTACH; the lock is held by the `pg` connection (not the `lake` connection that DuckLake uses internally), so it provides mutual exclusion between maintenance invocations but not catalog-write atomicity against arbitrary writers — document this caveat anywhere the lock is mentioned.

`main()` logs `millpond <version> (maintenance)` on startup using `importlib.metadata.version("millpond")`, mirroring `millpond/main.py`. If `MILLPOND_IMAGE` is set in the env, the image identifier is appended (`image=<value>`). The chart-side wiring is optional — without it, the package version is sufficient to identify which build is running.

`cleanup` and `cleanup-all` log a single structured throughput line on completion: `cleanup throughput: files_processed=N elapsed_s=T rate_obj_s=R queue_depth_after=A`. `files_processed` comes directly from `len(result)` — the count of rows `ducklake_cleanup_old_files` returned — rather than a queue-depth delta. A delta would be misleading whenever any other writer enqueues deletions during the call (the maintenance advisory lock by design only mutexes maintenance invocations, not arbitrary writers); `len(result)` is accurate regardless. `queue_depth_after` is queried with a single post-call snapshot and shows remaining work but is not used in the rate calculation. The line is intentionally suppressed on `--dry-run` because dry-run returns preview rows and a rate computed from those would falsely claim work was done.

If `PUSHGATEWAY_URL` is set, the script pushes two metrics: `maintenance_start_time{operation}` (pushed immediately on start) and `maintenance_duration_seconds{operation, status}` (pushed on completion). This enables Grafana annotation queries for maintenance windows.

`tools/justfile` wraps the script for interactive use and is copied to `/justfile` in the Docker image. The `shell` and `drop` recipes still use the DuckDB CLI directly. The `DUCKDB` env var can override the duckdb binary path.
DuckDB's HTTP logging and the postgres extension's `pg_debug_show_queries` are off by default — both add per-call overhead that compounds across tens of thousands of S3 deletes. Pass `--debug` at the maintenance.py command level to opt back into both for short-lived debugging.

`tools/justfile` wraps the script for interactive use and is copied to `/justfile` in the Docker image. The `shell` recipe pre-ATTACHes both `lake` and `pg`, configures the S3 SECRET, and `.read`s `tools/maintenance.sql` so every session starts with the macros loaded; the `drop` recipe still uses the DuckDB CLI directly. The `DUCKDB` env var can override the duckdb binary path; `MAINTENANCE_SCRIPT` and `MAINTENANCE_SQL` env vars override the in-image paths for dev use.

## What This Is

Expand Down
1 change: 1 addition & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ COPY --from=builder /app/millpond /app/millpond
COPY --from=builder /root/.local/bin/duckdb /usr/local/bin/duckdb
COPY tools/justfile /justfile
COPY tools/maintenance.py /app/tools/maintenance.py
COPY tools/maintenance.sql /app/tools/maintenance.sql

ENV PATH="/app/.venv/bin:$PATH"

Expand Down
58 changes: 46 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,28 +94,60 @@ Requires Docker (uses `keytool` from the Kafka container image for cert generati

### DuckLake Maintenance

`tools/maintenance.py` is a self-contained Python script for DuckLake maintenance operations (snapshot expiry, file cleanup, orphan deletion, checkpoint). It is baked into the Docker image at `/app/tools/maintenance.py` and designed to run as a K8s CronJob reusing the same image and credentials as the main application.
`tools/maintenance.py` is a self-contained Python script for DuckLake maintenance operations (snapshot expiry, file cleanup, orphan deletion, checkpoint, tiered compaction, deletion-queue dedup, catalog-side orphan recovery). It is baked into the Docker image at `/app/tools/maintenance.py` and designed to run as a K8s CronJob reusing the same image and credentials as the main application.

```bash
python /app/tools/maintenance.py maintain --days 7 # expire snapshots + cleanup files
python /app/tools/maintenance.py maintain --days 7 # expire snapshots + cleanup files
python /app/tools/maintenance.py maintain --days 7 --dry-run # preview only
python /app/tools/maintenance.py expire --days 3 # expire snapshots only
python /app/tools/maintenance.py cleanup --days 1 # cleanup scheduled files only
python /app/tools/maintenance.py checkpoint # integrated merge + expire + cleanup
python /app/tools/maintenance.py orphans # delete orphaned S3 files
python /app/tools/maintenance.py expire --days 3 # expire snapshots only
python /app/tools/maintenance.py cleanup --days 1 # cleanup scheduled files only
python /app/tools/maintenance.py cleanup-all # cleanup all scheduled files regardless of age
python /app/tools/maintenance.py dedup-deletions # drop duplicate rows in the pending-deletion queue
python /app/tools/maintenance.py find-orphans # list catalog rows whose S3 key no longer exists
python /app/tools/maintenance.py heal-orphans # delete those catalog rows (gated B1/B3 safety checks)
python /app/tools/maintenance.py cleanup-all-safe # dedup + heal-orphans + cleanup-all in a loop until clean
python /app/tools/maintenance.py fsck # cleanup-all-safe + ducklake_delete_orphaned_files
python /app/tools/maintenance.py checkpoint # integrated merge + expire + cleanup
python /app/tools/maintenance.py orphans # delete S3-side orphaned files (catalog has no row)
python /app/tools/maintenance.py compact --tier 1 # tiered compaction (see "When to add a merge job")
```

The script logs `cleanup throughput: files_processed=N elapsed_s=T rate_obj_s=R queue_depth_after=A` after every `cleanup` / `cleanup-all` (skipped on `--dry-run`), so you can confirm steady-state throughput without enabling debug logging. `files_processed` is the actual count of files the call returned, not a queue-depth delta, so the number is accurate even when other writers enqueue deletions during the run. Pass `--debug` to opt back into DuckDB's HTTP and Postgres-extension query logging — both are off by default because they add per-call overhead that compounds across tens of thousands of S3 deletes.

If `PUSHGATEWAY_URL` is set, the script pushes `maintenance_start_time` (on start) and `maintenance_duration_seconds` (on completion) to a Prometheus Pushgateway, enabling Grafana annotation queries for maintenance windows.

#### Catalog-side orphan recovery

If a `cleanup-all` run is interrupted (DuckLake bug: an S3 NoSuchKey on DELETE rolls back the whole transaction, but the S3 deletes already-completed are permanent), the catalog ends up with rows in `ducklake_files_scheduled_for_deletion` that point at S3 keys that no longer exist. Every subsequent `cleanup-all` will crash on those orphans until they're cleaned up. The catalog-recovery subcommands handle this without manual SQL surgery:

| Subcommand | Action |
|---|---|
| `find-orphans` | List orphan rows on stdout (read-only). |
| `heal-orphans` | Delete the orphan rows. Two safety gates: B1 proves `ducklake_data_file` is non-empty AND no orphan path is still live; B3 aborts if any positional-delete vector references an orphan id. `--dry-run` runs the gates but skips the DELETE. |
| `cleanup-all-safe` | Loop dedup-deletions + heal-orphans + cleanup-all under one advisory lock until cleanup-all exits clean. Caps at `--max-iterations` (default 10). |
| `fsck` | `cleanup-all-safe` followed by `ducklake_delete_orphaned_files` (S3-side orphan sweep). The end-to-end "lake catalog is healthy" recipe. |
Comment thread
jghoman marked this conversation as resolved.

Mutual exclusion comes from `pg_try_advisory_lock(hashtext('millpond-ducklake-maintenance')::bigint)` taken on the `pg` ATTACH; concurrent maintenance invocations bail with a clear error rather than racing each other's DELETEs.

`tools/maintenance.sql` is loaded at every session start (both by `maintenance.py` and by the `just shell` recipe) and defines small DuckDB macros for ad-hoc inspection — `SELECT count_pending_dups()` for queue dup count, `SELECT * FROM find_catalog_orphans('s3://bucket/lake/data')` for the orphan list. The header documents the conventions (no `LEFT ANTI JOIN`, no duckdb-side `ctid`, advisory-lock key) that any new recipe must follow.

`tools/justfile` wraps the script and is also baked into the image at `/justfile` for interactive use:

```bash
just --list # see available recipes
just maintain-dry-run 3 # preview: expire >3 day snapshots + cleanup
just maintain 3 # execute it
just shell # interactive DuckDB shell connected to DuckLake
just drop events # drop a table (data files remain until cleanup)
just orphans-dry-run # preview orphaned S3 files
just --list # see available recipes
just maintain-dry-run 3 # preview: expire >3 day snapshots + cleanup
just maintain 3 # execute it
just dedup-deletions-dry-run # preview duplicate rows in the pending-deletion queue
just dedup-deletions # drop them
just find-orphans # list catalog-side orphan rows
just heal-orphans-dry-run # preview heal-orphans (gates only, no DELETE)
just heal-orphans # delete catalog-side orphan rows
just cleanup-all-safe # dedup + heal + cleanup-all in a loop
just fsck-dry-run # preview fsck end-to-end
just fsck # bring catalog to known-good state
just shell # interactive DuckDB shell with lake + pg ATTACHed and macros loaded
just drop events # drop a table (data files remain until cleanup)
just orphans-dry-run # preview S3-side orphaned files
```

All commands use the pod's existing env vars (`DUCKLAKE_RDS_*`, `DUCKDB_S3_*`, `DUCKLAKE_DATA_PATH`).
Expand Down Expand Up @@ -242,6 +274,8 @@ Tier ranges (verified semantics: `min_file_size` inclusive, `max_file_size` excl
| `compact-to-tier-2` | `[1 MiB, 10 MiB)` | ~32 MiB |
| `compact-to-tier-3` | `[10 MiB, 64 MiB)` | ~128 MiB |

The `compact` subcommand bounds DuckDB resource use during the merge — `--threads` (default 2) and `--memory-limit` (default 4GB) — because `ducklake_merge_adjacent_files` isn't fully streaming today and over-uses memory relative to input size. The defaults are conservative; raise them on lakes that fit comfortably in pod memory.

This is an out-of-band maintenance operation, not part of the hot path.

See the [sizing calculator](https://posthog.github.io/millpond/sizing-calculator.html) for interactive estimates.
Expand Down
11 changes: 9 additions & 2 deletions millpond/ducklake.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,15 @@
def _escape_libpq(value: str | None) -> str:
"""Escape a value for a libpq connection string.

Wraps in single quotes and backslash-escapes internal single quotes and backslashes.
See: https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING
Wraps in single quotes and backslash-escapes internal single quotes and
backslashes, per the libpq connstring grammar:

https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING

Note this is *not* the same parser as Postgres SQL string literals — the
SQL parser uses ``''`` for embedded quotes and is governed by
``standard_conforming_strings``; the libpq connstring parser is a
separate grammar that has always required ``\\'`` and ``\\\\``.
"""
if value is None:
return "''"
Expand Down
4 changes: 4 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ dependencies = [
"pyarrow>=18.0",
"orjson>=3.10",
"prometheus-client>=0.21",
# Required by duckdb's Python conversion for TIMESTAMPTZ columns; the
# stdlib zoneinfo module isn't accepted as a substitute as of 1.4.x.
"pytz>=2024.1",
]

[project.optional-dependencies]
Expand Down Expand Up @@ -47,6 +50,7 @@ markers = [
"integration: marks tests as integration tests (deselect with '-m \"not integration\"')",
"e2e: marks tests as E2E tests requiring docker-compose stack",
]
pythonpath = ["tools"]

[tool.uv]
constraint-dependencies = [
Expand Down
2 changes: 2 additions & 0 deletions tests/unit/test_ducklake.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,8 @@ def test_quoted_string_rejected(self):


class TestEscapeLibpq:
"""libpq connstring grammar: backslash escapes, NOT SQL-style doubled quotes."""

def test_plain_value(self):
assert _escape_libpq("ducklake") == "'ducklake'"

Expand Down
Loading
Loading