Skip to content

Maintenance script followups: compaction tuning, dedup, orphan recovery, fsck#49

Merged
jghoman merged 34 commits intomainfrom
maintenance-followups
May 4, 2026
Merged

Maintenance script followups: compaction tuning, dedup, orphan recovery, fsck#49
jghoman merged 34 commits intomainfrom
maintenance-followups

Conversation

@jghoman
Copy link
Copy Markdown
Collaborator

@jghoman jghoman commented May 4, 2026

Summary

Eighteen small, mostly-mechanical improvements to tools/maintenance.py and its surrounding scaffolding (justfile, SQL macro file, image, README/AGENT). Three themes: (1) plug latent bugs we hit operating cleanup against real lakes, (2) add a SQL-first orphan-recovery library so the manual recovery dance becomes a recipe, (3) make compact not OOM the cron pod by default.

Every commit is one item from the team followups doc.

Bug fixes & defaults

# Change
_escape_libpq No behavior change to the escape rules (libpq connstring grammar uses \' / \\, distinct from the Postgres SQL parser's '' doubling — standard_conforming_strings governs the SQL parser, not libpq). An earlier commit on this branch mistakenly switched to SQL-style doubling and was reverted. Docstrings now name the grammar explicitly to prevent a re-occurrence.
_scoped_target_file_size Read the prior GLOBAL target_file_size (filtered to scope = 'GLOBAL' to avoid the original implementation's empty-tuple bug) and restore it on exit. DuckLake stores the value as a raw byte count; convert via a new _bytes_to_human helper that picks the largest 1024^i unit that divides cleanly. Fall back to DEFAULT_TARGET_FILE_SIZE only when conversion fails or the prior value is unset (operator-configured non-default values are preserved).
temp_directory SET temp_directory = '/tmp/duckdb_spill' in connect() so spills land on the writable emptyDir, not the read-only rootfs.
enable_http_logging / pg_debug_show_queries Both off by default; --debug flag opts back in. Both add per-call overhead that compounds across tens of thousands of S3 deletes.
cleanup throughput log One structured key=value line per cleanup: files_processed=N queue_depth_before=A queue_depth_after=B elapsed_s=T rate_obj_s=R.

Compaction tuning

# Change
--threads (default 2), --memory-limit (default 4GB) on compact Cap merge resource use. ducklake_merge_adjacent_files isn't fully streaming in 1.4 and over-uses memory relative to input volume; conservative defaults keep the cron pod alive on real lakes.
Per-session preserve_insertion_order=false, http_timeout=600000 Skip the implicit sort and survive S3 hiccups on multi-MB GETs/PUTs.

SQL-first orphan-recovery library

DuckLake bug c5 (duplicate paths in ducklake_files_scheduled_for_deletion) combined with c1 (S3 NoSuchKey rolls back the whole txn) means every interrupted cleanup-all can leave the catalog in a worse state than it started. New subcommands form the fix-it-yourself toolkit until c1/c5 land upstream:

Subcommand What
dedup-deletions Drop duplicate rows from the queue via DELETE … USING MIN(ctid) through postgres_execute.
find-orphans List catalog rows whose S3 key no longer exists. Prints data_file_id<TAB>path to stdout.
heal-orphans Materialize the orphan set, run B1 (positive-proof: ducklake_data_file non-empty AND zero orphan paths still live) and B3 (zero positional-delete vectors reference orphan ids) safety gates, then DELETE … WHERE path IN (...) through postgres_execute.
cleanup-all-safe Loop dedup + heal + cleanup-all under one advisory lock until cleanup-all exits clean. Caps at --max-iterations (default 10).
fsck cleanup-all-safe + ducklake_delete_orphaned_files (catalog-healthy end-to-end).

tools/maintenance.sql (new) holds the runtime-loaded macros (count_pending_dups(), find_catalog_orphans(data_path)) plus a header documenting the constraints (no LEFT ANTI JOIN, no duckdb-side ctid, advisory-lock key). Loaded by both maintenance.py connect and the just shell recipe.

Advisory lock keyed off hashtext('millpond-ducklake-maintenance')::bigint, taken on the pg ATTACH; provides mutual exclusion between maintenance invocations (not catalog-write atomicity against arbitrary writers — documented).

Plumbing

# Change
ATTACH_NAME = "lake" / METADATA_SCHEMA constants Single source of truth so the attach alias and the metadata-schema name never drift.
pg (TYPE postgres) ATTACH Added to connect() and to just shell so postgres_execute / postgres_query are available everywhere.
CREATE OR REPLACE SECRET s3 Replaces / supplements legacy SET s3_* in both connect() and just shell — the legacy form isn't honored by ad-hoc glob('s3://...') / read_parquet('s3://...') in 1.4.
pytz dependency DuckDB 1.4 raises Required module 'pytz' failed to import on any TIMESTAMPTZ fetch; stdlib zoneinfo not accepted. ~500 KB.

Real-stack canary (docker-compose lake)

Caught two bugs I'd otherwise have shipped:

Bug Fix
find-orphans 403 because connect() only set legacy S3 env vars Add CREATE OR REPLACE SECRET s3
cleanup-all-safe ParserException on advisory-lock SQL Wrap inner SQL via _sql_string_literal

End-to-end verified against a populated lake:

Path Result
connect() (ATTACH lake + pg, load macros, set SECRET)
find-orphans on a clean lake ✓ — returned 0
dedup-deletions --dry-run / heal-orphans --dry-run / fsck --dry-run
cleanup-all-safe against a 3,977-row queue ✓ — drained on attempt 1, throughput line fired
Idempotent re-run on clean queue
count_pending_dups() / find_catalog_orphans() macros
Path matching for relative-form keys
Advisory lock acquire

Not exercised in canary: heal-orphans against manufactured orphans (skipped after mc tooling friction), the NoSuchKey-retry loop in cleanup-all-safe (would need to deliberately trigger DuckLake bug c1).

Test plan

  • just lint + just fmt-check clean
  • 152 unit tests pass
  • Image rebuild copies tools/maintenance.sql to /app/tools/maintenance.sql
  • On a real lake: python /app/tools/maintenance.py fsck --dry-run reports counts without erroring
  • On a real lake: cleanup-all-safe against a non-empty queue drains it and emits the throughput log line
  • Compaction CronJob's compact --tier 1 honors --threads 2 --memory-limit 4GB defaults (visible in pod resource use)

jghoman added 18 commits May 4, 2026 12:08
Modern libpq with standard_conforming_strings=on (default since 9.1)
rejects the old \' escape form. Switch to '' doubling and drop the
no-longer-needed backslash handling. Same fix in millpond/ducklake.py
and tools/maintenance.py; unit tests updated to match.
The prior-value read in _scoped_target_file_size was unreliable: DuckLake
stores target_file_size as a byte-count string and ducklake_options('lake')
can return multiple rows across GLOBAL/SCHEMA/TABLE scopes, so the restore
SET intermittently failed with ParserException. Drop the read and always
restore to the documented steady-state default.
DuckDB's default spill path '.tmp' relative to CWD is on the read-only
rootfs in the millpond image, so any compaction that needed to spill
crashed. Point spills at /tmp, the writable emptyDir in the cron pod.
Both add per-call overhead that compounds across 30k+ S3 deletes and
many catalog reads. Add a top-level --debug flag to opt back in for
short-lived debugging.
DuckLake's Postgres metadata schema is named __ducklake_metadata_<attach>,
so the attach name and the schema name must be derived from one source or
queries silently target the wrong schema. Introduce ATTACH_NAME and
METADATA_SCHEMA constants and route every DuckLake call through them.
No behavior change.
The legacy SET s3_* form is honored for some duckdb/httpfs paths but
not for ad-hoc glob('s3://...') / read_parquet('s3://...') in 1.4, so
operators got 403s mid-incident until they ran CREATE SECRET by hand.
Switch the shared _setup block to CREATE OR REPLACE SECRET s3 so every
shell session starts with S3 access wired up.
Every maintenance recipe we keep adding (dedup, orphan-clean, ctid-based
ops) needs the upstream Postgres ATTACHed via the duckdb postgres
extension. Bake it in alongside the lake ATTACH so operators don't have
to type the connection string by hand. Extract pg_connstr to share the
libpq form between the ducklake-catalog and pg ATTACHes.
DuckDB 1.4 raises 'Required module pytz failed to import' on any Python
fetch from a TIMESTAMPTZ column; stdlib zoneinfo is not accepted as a
substitute. Verified empirically against duckdb 1.4.4. ~500 KB on the
image; cheap insurance against the next time a maintenance recipe
SELECTs a snapshot timestamp.
Future tooling that reads the catalog directly (not through ducklake_*
SQL functions) needs to know the schema name is derived from the ATTACH
alias. Point readers at the ATTACH_NAME / METADATA_SCHEMA constants in
maintenance.py so the two never drift.
ducklake_merge_adjacent_files isn't fully streaming today (DuckLake bug
c8), so the catalog-wide merge happily consumes 24+ GiB / 12 threads on
a real lake and OOMKills the cron pod. Set conservative defaults via a
new _set_compaction_tuning helper (threads=2, memory_limit=4GB,
preserve_insertion_order=false, http_timeout=600s) and expose --threads
/ --memory-limit so operators can raise the ceiling on lakes that fit.
compact-probe is unchanged — it caps work via max_compacted_files.
Snapshot the ducklake_files_scheduled_for_deletion queue depth before
and after cleanup, time the call, emit one key=value line:

  cleanup-all throughput: files_processed=N queue_depth_before=A
    queue_depth_after=B elapsed_s=T rate_obj_s=R

Lets an operator confirm the steady-state ~5 obj/s and feed the number
back to upstream DuckLake bug c2 (per-file vs per-batch commits in
cleanup_old_files) without enabling debug logging.
The same path can land in ducklake_files_scheduled_for_deletion across
multiple snapshots (DuckLake bug c5); the second visit by cleanup-all
hits NoSuchKey on S3, which currently rolls back the whole transaction
(DuckLake bug c1) and corrupts the queue. Until those land upstream,
operators need to dedup the queue before every cleanup-all run.

This commit adds:

* tools/maintenance.sql — runtime-loaded SQL file shared by maintenance.py
  and the just shell recipe. Header documents the constraints any new
  recipe must follow (no LEFT ANTI JOIN, no duckdb-side ctid, advisory
  lock convention) and defines the count_pending_dups() macro.
* connect() ATTACHes the upstream Postgres as 'pg' for postgres_execute
  / postgres_query, then sources maintenance.sql so macros are available
  in every session (cron and interactive).
* dedup-deletions subcommand and justfile recipes — DELETE … WHERE ctid
  NOT IN (SELECT MIN(ctid) … GROUP BY path) via postgres_execute.
* Dockerfile copies the SQL file alongside maintenance.py.
* README updated to reflect the current maintenance surface (dedup,
  cleanup-all, compact tuning flags, throughput log line, debug flag,
  maintenance.sql macros).
Catalog-side orphans — rows in ducklake_files_scheduled_for_deletion
whose S3 key no longer exists — poison cleanup-all because the S3 DELETE
returns NoSuchKey and the whole txn rolls back (DuckLake bug c1). Operators
need to be able to enumerate them before deciding to heal.

The macro takes the data path as a parameter rather than baking it in:
DuckDB 1.4 evaluates a literal glob() inside CREATE MACRO eagerly, so a
no-arg macro with a templated path would fire an S3 LIST on every
connect(), even for cron runs that don't care. The Python find-orphans
subcommand passes DUCKLAKE_DATA_PATH for you; ad-hoc users from just
shell pass the path themselves.

Adds tools/maintenance.sql header note about the eager-eval gotcha so
the next person doesn't repeat the experiment.
Add _acquire_advisory_lock helper that calls pg_try_advisory_lock on the
pg ATTACH and raises if another maintenance session already holds it.
Wire it into dedup-deletions before the DELETE so concurrent invocations
of the same recipe can't race the same rows.

The lock is keyed off hashtext('millpond-ducklake-maintenance')::bigint
and released implicitly on session close, so single-subcommand runs need
no explicit release. The helper is the shared entry point for the rest
of the SQL-first recipes (heal-orphans, cleanup-all-safe, fsck) that
will land in subsequent commits.
Five-step gated procedure that deletes catalog rows from
ducklake_files_scheduled_for_deletion whose S3 key no longer exists.
The set is materialized into a TEMP TABLE so the safety gates and the
final DELETE all see the same snapshot:

  B1: ducklake_data_file is non-empty AND zero of the orphan paths
      appear there (no vacuous pass on an empty catalog; no false
      positive that would delete a still-live file)
  B3: zero positional delete vectors (ducklake_delete_file) reference
      an orphan data_file_id (such a file is still live for vector
      lookups)
  H4: advisory lock acquired before the DELETE
  B2: DELETE matches on path, which is UUIDv7-globally-unique per
      quirk r3
  H1: single postgres_execute call so the DELETE is atomic at the
      upstream Postgres layer

Adds a small _sql_string_literal helper for safe quoting of the
nested SQL strings sent through postgres_execute, and threads it
through the dedup-deletions DELETE for consistency.
Loop dedup-deletions + heal-orphans + cleanup-all under the advisory
lock until cleanup-all exits clean. Until DuckLake bug c1 lands, every
crashed cleanup-all (NoSuchKey on S3 DELETE rolls back the whole txn)
creates fresh catalog-side orphans, so the next cleanup-all attempt
needs the queue healed before it can succeed. Cap at --max-iterations
(default 10) and raise non-zero if exhausted; the lock is taken once
for the whole orchestration so the steps share mutual exclusion.
Wraps cleanup-all-safe (dedup + heal-orphans + cleanup-all in a loop)
followed by ducklake_delete_orphaned_files to mop up S3-side orphans
from any prior interrupted writes. Tiered compaction is intentionally
not bundled — run compact-to-tier-N separately.

Dry-run reports duplicate-queue depth, catalog-side orphan count, and
the S3-orphan dry-run output without mutating anything; pair with
fsck-dry-run in the justfile for ad-hoc audits.
Found by canary against the docker-compose lake:

1. connect() configured S3 access via the legacy SET s3_* form only.
   The ducklake catalog driver honors that form, but DuckDB 1.4's httpfs
   does not — ad-hoc glob('s3://...') / read_parquet('s3://...') go
   through the SECRET manager. find-orphans (which uses glob() inside
   the find_catalog_orphans macro) failed with HTTP 403 on every real
   call. Mirror the same credentials into a CREATE OR REPLACE SECRET
   alongside the existing SETs so both paths authenticate.

2. _acquire_advisory_lock embedded the inner SQL into postgres_query
   without doubling the single quotes around 'millpond-ducklake-
   maintenance', producing a ParserException at the duckdb-side parser.
   Use the existing _sql_string_literal helper to wrap the inner SQL,
   matching the pattern dedup-deletions and heal-orphans already use.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a broad set of maintenance-tool follow-ups around DuckLake operations: safer cleanup/orphan recovery workflows, lower-default resource usage for compaction, and supporting SQL/docs/image plumbing so operators can run these recipes from the shipped container.

Changes:

  • Extends tools/maintenance.py with new maintenance subcommands, advisory locking, throughput logging, compaction tuning defaults, and SQL macro loading.
  • Adds tools/maintenance.sql, wires it into the image and interactive tooling, and updates docs/agent guidance around the new maintenance flow.
  • Adds pytz to runtime dependencies and updates libpq escaping/tests for current Postgres/libpq behavior.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
uv.lock Locks the new pytz runtime dependency.
tools/maintenance.sql Adds reusable DuckDB macros and maintenance SQL conventions.
tools/maintenance.py Implements the new maintenance/orphan-recovery/compaction behavior.
tools/justfile Expands operator recipes and interactive shell setup.
tests/unit/test_ducklake.py Updates _escape_libpq expectations.
README.md Documents the expanded maintenance workflow and commands.
pyproject.toml Declares pytz as a runtime dependency.
millpond/ducklake.py Updates libpq escaping implementation/docs in app code.
Dockerfile Copies the new SQL macro file into the image.
AGENT.md Records new maintenance-tooling conventions and metadata-schema guidance.
.gitignore Ignores FOLLOWUPS.md.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/justfile
Comment thread README.md Outdated
Comment thread tools/justfile Outdated
jghoman added 3 commits May 4, 2026 14:06
Five subcommands and the supporting plumbing landed without docs:
find-orphans, heal-orphans (with B1/B3 safety gates), cleanup-all-safe,
fsck, plus the advisory-lock convention and the SECRET-vs-legacy-S3
split in connect(). Adds a 'Catalog-side orphan recovery' subsection
to the README's DuckLake Maintenance section and expands AGENT.md's
Maintenance Tooling section to cover the recovery toolkit, the SQL
macro file's eager-eval gotcha, and the throughput log line.
The .sql file is loaded two ways: maintenance.py runs it through
str.format() before conn.execute(), but the just shell recipe sources
it via the duckdb CLI's .read meta-command, which reads it verbatim.
The {schema} placeholders survived the second path, so opening the
shell hit a parser error instead of loading the advertised macros.

Drop the templating: write __ducklake_metadata_lake literally, update
the file's header to document the no-templating rule, drop the
.format() call in connect(). Both paths now execute the same bytes.
ATTACH_NAME / METADATA_SCHEMA still exist for Python-side queries.
AGENT.md updated to match.
Previously the SECRET was hardcoded to s3.<region>.amazonaws.com /
URL_STYLE 'vhost' / USE_SSL true, which broke just shell against the
repo's own MinIO compose stack and any non-AWS S3-compatible target,
even though maintenance.py picked up these env vars correctly.

Read all three from env with the same AWS-style defaults
maintenance.py uses, so AWS deployments still work without any extra
config and MinIO / R2 / etc. work out of the box once the env is set.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/maintenance.py
Comment thread tools/maintenance.py Outdated
Comment thread tools/maintenance.py
jghoman added 3 commits May 4, 2026 14:20
The docstring claimed step 1 was 'take the advisory lock' but the code
took it at step 5 (right before the DELETE), after the scan and the
B1/B3 gates had already run. For the standalone heal-orphans
invocation that left a window where another maintenance writer could
change ducklake_files_scheduled_for_deletion between scan and DELETE,
so the gates had passed against state that no longer held by the time
we mutated.

(The cleanup-all-safe orchestrator wasn't affected — it took the lock
once for the whole loop — but the standalone subcommand is exactly the
case the docstring's snapshot claim was made for.)

Move the acquire to the top of heal_orphans, before materializing.
Skip on --dry-run since dry-run doesn't mutate and the dry-run output
is informational only.
The dry-run path inlined raw count queries against count_pending_dups()
and find_catalog_orphans() and then called orphans(dry_run=True). It
never went through heal_orphans, so the B1 (vacuous-pass + 'orphan'
still live) and B3 (positional delete vector references) gates were
skipped — a real fsck that would abort during heal-orphans showed as
healthy in the dry-run.

Delegate to the dry-run forms of the individual steps. heal_orphans
with dry_run=True runs both gates and raises on failure, so the
dry-run now produces the same go/no-go signal the real fsck would.
29 tests covering tier A (pure helpers and argparse plumbing) and
tier B (orchestrator logic with mocked sub-calls):

* _sql_string_literal quote-doubling
* _log_cleanup_throughput message shape, divide-by-zero guard
* argparse plumbing for the new subcommands (dedup-deletions,
  find-orphans, heal-orphans, cleanup-all-safe, fsck) plus
  --threads / --memory-limit on compact
* cleanup_all_safe retry loop: succeed-on-first, retry-after-IOException,
  exhaust-iterations
* fsck: dry-run delegates to dry-run sub-calls (so heal-orphans gates
  actually run), real run goes through cleanup_all_safe, gate failure
  propagates out of dry-run
* _set_compaction_tuning emits expected SETs and rejects injection
* _acquire_advisory_lock emits the doubled-quote postgres_query form
  and raises when another session holds the lock

Tier C (macros against a stubbed catalog) and tier D (full e2e against
the docker-compose stack) remain follow-up work — they need either an
in-process duckdb with stubbed schemas or a real lake.

Adds pythonpath = ["tools"] to pytest config so 'import maintenance'
resolves.
…e default

The previous restore logic checked 'restore == DEFAULT_TARGET_FILE_SIZE'
to decide whether to log the conversion-failure warning, but
_bytes_to_human('134217728') returns '128MiB', which is exactly
DEFAULT_TARGET_FILE_SIZE. Healthy installs that already use the
default emitted a warning on every compaction run, making it look
like the restore was broken when it wasn't.

Gate the warning on 'converted is None' (the genuine conversion
failure case) instead. Add 6 unit tests pinning the four cases:
default-value round-trip → no warning; non-default convertible value
→ no warning; non-power-of-1024 input → warning; prior unset → no
warning; plus assertions that the restore SET targets the converted
form and the default fallback respectively.

Also tidy the awkward 'if v else None' validation pattern in
connect()'s SECRET path to a plain 'if v: _sanitize_setting_value(v)'.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 12 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/maintenance.py Outdated
Comment thread millpond/ducklake.py Outdated
Comment thread tools/justfile Outdated
Comment thread README.md
Earlier in this branch a7 changed _escape_libpq to SQL-style ''
doubling on the premise that 'modern libpq with
standard_conforming_strings=on requires ""'. That premise
conflated two different parsers:

  * The Postgres SQL parser interprets string literals; "" is
    the embedded-quote form there, governed by
    standard_conforming_strings.
  * The libpq connection-string parser is a separate grammar that
    has always required backslash escapes (\' and \\) inside
    quoted values. It is not affected by
    standard_conforming_strings.

Applying SQL rules at the libpq layer broke every code path that
builds a connstring once a value contains an apostrophe — which
never showed up in canary because every test password was alphanumeric.

Restore the original backslash escaping in millpond/ducklake.py
and tools/maintenance.py with a docstring that names the grammar
explicitly so this doesn't get re-broken. Revert the unit tests
to assert the libpq form. In tools/justfile, rewrite the inner
(libpq) layer to backslash-escape \ and ', then keep the outer
SQL-wrap layer unchanged (every ' in the resulting connstring
still gets doubled because the whole thing is wrapped in
'ATTACH '...' AS lake').

Verified end-to-end against a postgres:17 container booted with
POSTGRES_PASSWORD=O'Brien:

  * maintenance.py connect() authenticates and ATTACHes
  * just shell (run inside the millpond image so duckdb-cli +
    duckdb-python versions match) authenticates via the pg
    ATTACH and returns 'PostgreSQL 17.9 ...' from
    postgres_query('pg', 'SELECT version()').
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 12 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/justfile Outdated
Comment thread tools/maintenance.sql
Two real issues from the latest review:

* tools/justfile baked SET pg_debug_show_queries = true into the
  shell setup, so every interactive session paid the per-call
  overhead this PR's a14 commit had explicitly defaulted to off in
  maintenance.py. Mirror the maintenance.py policy: default both
  enable_http_logging and pg_debug_show_queries to false in _setup;
  operators who want them temporarily can SET them after entering
  the shell.

* The find_catalog_orphans / count_pending_dups macros had no
  automated coverage, even though path-normalization in that area
  is exactly where heal-orphans's B1 gate had a real regression.
  Add tests/unit/test_maintenance_macros.py: 11 tests against an
  in-memory DuckDB with a stub __ducklake_metadata_lake schema and
  a real local-filesystem glob, covering count_pending_dups
  arithmetic and find_catalog_orphans for absolute / relative /
  mixed path forms (per quirk r1) plus the empty-queue and
  empty-listing edge cases.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/maintenance.py Outdated
Comment thread tools/maintenance.sql
Comment thread tools/maintenance.py
Three issues from review:

1. heal-orphans's B1 gate counted every row in ducklake_data_file
   without filtering on end_snapshot. Historical (expired) rows
   matching an orphan path made would_be_live non-zero and aborted
   heal-orphans on perfectly valid queue entries. The compact
   subcommand uses 'end_snapshot IS NULL' as the live-row marker
   throughout — apply the same filter to the B1 gate's data_abs CTE
   and to its 'vacuous catalog' guard so an empty live set is
   distinguished from an empty catalog. Same fix to B3 against
   ducklake_delete_file: only LIVE positional-delete vectors should
   block heal-orphans; expired ones are not load-bearing.

2. tools/maintenance.sql hard-codes __ducklake_metadata_lake because
    in 'just shell' is verbatim and templating is not
   compatible with that load path (we tried; it produced parser
   errors). Add a unit test that fails loudly if anyone changes
   ATTACH_NAME without updating the .sql file's references — the
   .sql header documented the coupling but had no enforcement, and
   the bot correctly flagged that as a single-source-of-truth gap.

3. Add 6 unit tests against the stub schema covering both gates with
   live, expired, and mixed end_snapshot values, plus the schema-
   coupling assertion.

Also updated the PR description to reflect what _escape_libpq and
_scoped_target_file_size actually do at the current branch tip
(the earlier table claimed behavior that was reverted partway
through review).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/maintenance.sql Outdated
Comment thread tools/maintenance.py Outdated
Comment thread tests/unit/test_maintenance_macros.py
A DUCKLAKE_DATA_PATH ending in '/' (e.g. s3://bucket/lake/data/) is a
common operator-supplied form. Both find_catalog_orphans (in
maintenance.sql) and the B1 safety gate (in heal_orphans) concatenated
'data_path || / || key' without stripping a trailing slash, so a
relative-form queue path normalized to '.../data//file.parquet' that
never matched the absolute form '.../data/file.parquet'. Net result:
find-orphans flagged live files as orphans, and heal-orphans's gate
let through queue entries pointing at still-live files.

Wrap data_path in rtrim(data_path, '/') everywhere it concatenates a
key. Also broaden the B1 gate's absolute-path detection from
'path LIKE s3://%' to 'LIKE s3://% OR LIKE /%' so a /-rooted path
(stored locally during tests; defensible if it ever appears in
production via misconfiguration) isn't doubly-prefixed by the data
path.

Adds 4 tests pinning the trailing-slash variant: find_catalog_orphans
with absolute-form queue, with relative-form queue, with a real
orphan present, and the B1 gate against a still-live file via the
relative-form queue path.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/maintenance.py Outdated
Comment thread tools/maintenance.py Outdated
Comment thread tools/justfile Outdated
Three issues from review plus two adjacent ones I caught in audit:

* The cleanup throughput line derived files_processed from a
  queue-depth delta. Without holding the advisory lock (and the lock
  doesn't serialize against arbitrary writers anyway) the delta is
  wrong whenever any other writer enqueues during the call — could
  even go negative. Switch to len(result), the actual rows
  ducklake_cleanup_old_files returned. Accurate regardless of
  concurrent writers. Drop queue_depth_before; keep queue_depth_after
  as a remaining-work signal.

* Skip the throughput log on cleanup --dry-run. dry-run returns
  preview rows, so len(result) would claim work was done when none
  was — the old delta version printed 0 by accident.

* Validate DUCKDB_S3_USE_SSL in tools/justfile against a strict
  whitelist. Previously forwarded raw into both SET s3_use_ssl
  (SQL string literal, vulnerable to a quote) and CREATE OR REPLACE
  SECRET USE_SSL (unquoted, raw SQL injection). Just-side if/else
  error fails fast on bad input; matches the equivalent guard in
  maintenance.py via _sanitize_setting_value.

* Extract _heal_orphans_b1_counts as a module-level helper so
  production heal_orphans and unit tests share one source of truth
  for the gate query.

* Add 5 unit tests covering the gates LIKE s3 branch with literal
  s3:// paths in queue / data_file (matched, mixed forms, trailing-
  slash data_path, expired rows). Pre-existing tests used local-
  filesystem paths only.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md Outdated
Comment thread AGENT.md Outdated
The previous commit changed the throughput log shape (dropped
queue_depth_before; switched files_processed from a queue-depth delta
to len(result); skipped on --dry-run) but the README example and the
AGENT.md rationale still described the old shape. Operators copying
the README example would have searched for a field that doesn't exist.

Update both to the current shape and call out that files_processed is
the operation's return-row count (race-free) rather than a delta.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/maintenance.py
Comment on lines +274 to +275
"%s throughput: files_processed=%d elapsed_s=%.1f rate_obj_s=%.1f queue_depth_after=%d",
operation,
Comment thread tools/maintenance.py
Comment on lines +172 to +173
conn.execute(f"SET enable_http_logging = {str(debug).lower()}")
conn.execute(f"SET pg_debug_show_queries = {str(debug).lower()}")
@jghoman jghoman merged commit 2876a88 into main May 4, 2026
19 checks passed
@jghoman jghoman deleted the maintenance-followups branch May 4, 2026 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants