Summary
UPDATE COLUMNS FROM commits a new dataset version but does not advance
_row_last_updated_at_version on rows whose column values were rewritten.
ALTER TABLE ... UPDATE COLUMNS ... FROM (UpdateColumnsBackfill in lance-spark)
commits through Lance CommitBuilder and Transaction as a normal Update with
UpdateMode.RewriteColumns. The table version increases, but per-row change-data
metadata _row_last_updated_at_version can stay the same as before the commit
(for example still equal to _row_created_at_version), even though data in the
updated columns changed.
Expected behavior
From Lance row lineage and change-data feed (CDF) docs, _row_last_updated_at_version
is the dataset version at which the row was last modified. If a write creates a
new dataset version and changes visible row data for matched rows, those rows
should get _row_last_updated_at_version set to that new version.
_row_created_at_version should stay at the version where the row first appeared.
Actual behavior
After UPDATE COLUMNS FROM, rows that had columns rewritten can still show the
same _row_last_updated_at_version as before, while the dataset version has moved
forward on commit.
Reproduction
- Create a Lance table with stable row IDs enabled (enable_stable_row_ids).
- Insert several rows (e.g. id 1, 2, 3) so CDF columns exist; note dataset version V0 and
_row_created_at_version / _row_last_updated_at_version for each row.
- Run ALTER TABLE ... UPDATE COLUMNS ... FROM with a source view that updates only one row
(e.g. id = 2); leave id 1 and 3 unchanged in the source.
- Read _row_last_updated_at_version for id = 2: it may still equal the pre-update value (or
match created_at only) even though the dataset version advanced past V0.
- id 1 and 3 should not incorrectly bump.
Note
This ticket is on top of the following:
Summary
UPDATE COLUMNS FROM commits a new dataset version but does not advance
_row_last_updated_at_version on rows whose column values were rewritten.
ALTER TABLE ... UPDATE COLUMNS ... FROM (UpdateColumnsBackfill in lance-spark)
commits through Lance CommitBuilder and Transaction as a normal Update with
UpdateMode.RewriteColumns. The table version increases, but per-row change-data
metadata _row_last_updated_at_version can stay the same as before the commit
(for example still equal to _row_created_at_version), even though data in the
updated columns changed.
Expected behavior
From Lance row lineage and change-data feed (CDF) docs, _row_last_updated_at_version
is the dataset version at which the row was last modified. If a write creates a
new dataset version and changes visible row data for matched rows, those rows
should get _row_last_updated_at_version set to that new version.
_row_created_at_version should stay at the version where the row first appeared.
Actual behavior
After UPDATE COLUMNS FROM, rows that had columns rewritten can still show the
same _row_last_updated_at_version as before, while the dataset version has moved
forward on commit.
Reproduction
_row_created_at_version / _row_last_updated_at_version for each row.
(e.g. id = 2); leave id 1 and 3 unchanged in the source.
match created_at only) even though the dataset version advanced past V0.
Note
This ticket is on top of the following:
JNI loses version metadata and row-ID lookup is incorrect for stable row IDs during updates lance#6464 — JNI loses version metadata; incorrect stable row ID
lookup in Operation::Update (see fix PRs such as fix: serialize version metadata through JNI and correct row-ID lookup lance#6465).
Spark connector cannot preserve stable row IDs across updates #406 — Spark connector cannot preserve stable row IDs
across standard SQL UPDATE (native DeltaWriter.update path on Spark 3.5+, etc.).