Skip to content

API, Core: Add ValidateRewriteTablePath action interface#16967

Open
Priyadarshini-Mitra wants to merge 1 commit into
apache:mainfrom
Priyadarshini-Mitra:validate-action-interface
Open

API, Core: Add ValidateRewriteTablePath action interface#16967
Priyadarshini-Mitra wants to merge 1 commit into
apache:mainfrom
Priyadarshini-Mitra:validate-action-interface

Conversation

@Priyadarshini-Mitra

@Priyadarshini-Mitra Priyadarshini-Mitra commented Jun 26, 2026

Copy link
Copy Markdown

Proposal #16868

This PR adds the ValidateRewriteTablePath action interface that verifies every metadata, data, and delete file the source table references is present at the destination. The most common pairing is with RewriteTablePath, but the action is designed to operate against any source/destination pair where the destination is supposed to be a copy of the source.

The source table is the source of truth: implementations walk source metadata to enumerate expected files, apply any configured prefix rewrite, and check each path at the destination. Files referenced by the source but missing at the destination are reported regardless of whether the destination's own metadata is internally consistent — the validator does not rely on the destination metadata to enumerate expected files.

This is kept as a separate action, not folded into RewriteTablePath, so validation can run independently of any copy producer — against any source/destination pair, regardless of how the destination was produced.

We are splitting implementation into three sequential PRs :

  1. API contract (interface + factory + Result base)
  2. Spark implementation + core helpers + exception
  3. validate_rewrite_table_path Spark procedure + user docs

This PR establishes the API contract only.

Two common uses (target use cases for the implementation):

Verify a copy before registering it — after producing a destination via RewriteTablePath, run this action to confirm the destination contains every expected file before registering the new metadata.json with a catalog. A failed validation prevents an incomplete copy from becoming a live table.

Audit any source/destination pair for sync — the destination does not have to have been produced by RewriteTablePath. This works against DR replicas, migration targets, backups, Distcp output, manual file copies, or any other out-of-band copy. Run this action to confirm the destination still references every file the source does — catching missing or stale copies before they cause data loss on failover, wrong query results from a migrated table, or failed restores. Requires loadable source and destination Tables plus rewriteLocationPrefix(...) to map source paths to destination paths. Choose the mode based on what's being audited:

  • Pass destinationSnapshotId (the destination's snapshot id from the previous successful copy/validation) for an incremental check that validates only the delta accumulated since then.
  • Call validateFullTable(true) for an audit-from-scratch that re-validates every source file at the destination regardless of prior state.

What this PR adds

  • API (ValidateRewriteTablePath, org.apache.iceberg.actions): fluent builder for source/destination metadata version, source/destination snapshot id, prefix rewrite (rewriteLocationPrefix(String, String) or rewriteLocationPrefix(Map<String, String>) — entries accumulate, applied longest-prefix-match), validateScope (ALL/LATEST), and validateFullTable override.
  • Core (BaseValidateRewriteTablePath): Immutables-based Result implementation with derived isValid(), missingFileCount(), and validationSummary() fields.
  • ActionsProvider#validateRewriteTablePath(Table) factory method.

The action operates in three modes selected by which parameters are configured: backfill (source only), incremental (source plus destination), and forced full validation via validateFullTable(true). Implementation lands with the next PR.

Constraints

  • Source and destination tables must implement HasTableOperations to resolve metadata file locations.
  • validateScope (ALL vs LATEST) affects backfill mode only; incremental mode always computes the source-vs-destination snapshot diff.

Out of scope (intentional)

  • Content equivalence — the validator checks file existence at the destination, not file content. Two tables with byte-different files at matching paths will both pass validation.
  • Validation of the destination's metadata internal consistency — separate concern from "are all source-referenced files present at the destination."
  • Multi-prefix copy in RewriteTablePath itself — the validator's rewriteLocationPrefix(Map<String, String>) setter accepts multiple source→destination mappings (resolved via longest-prefix match) once the implementation lands, but RewriteTablePath still rewrites with a single source/target pair. Extending the copy-side action to accept a Map is a planned follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant