Row Lineage and Change Data Feed #4935

jackye1995 · 2025-10-12T08:42:57Z

jackye1995
Oct 12, 2025
Maintainer

Me and Vino (@yanghua) have been designing the row lineage and change data feed feature for some time now (#4895, #4741), I think it is better that I write down the full design in my mind after experimenting with various designs.

Prior Arts

Iceberg row lineage: https://iceberg.apache.org/spec/#row-lineage
Delta row tracking: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking
Iceberg changelog view: https://iceberg.apache.org/docs/nightly/spark-procedures/#change-data-capture
Delta change data feed: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-cdc-file

In general, row lineage and CDF are different features, and both Iceberg and Delta implement them separately. In fact, CDF dates before row lineage for both formats, and they are not very compatible.

However, I think there is a path to completely reuse row lineage to perform CDF. A key benefit of it is that, CDF becomes completely decoupled from the table versioning. This is important because you might want to consume CDF at any point of time (e.g. a downstream tries to refresh a MV based on it), and if the table maintenance process removes the old versions, the CDF should remain consumable.

Today Iceberg relies on historical versions present, which makes the coordination difficult. Also, using historical versions to construct CDF could be quite inefficient. Delta's CDF produces additional CDF files that are independent, but it mirrors all the writes essentially twice, the write performance impact is big.

Design

Row Lineage

We want to introduce the following row lineage columns:

_rowid with stable row ID
_row_created_at_version
_row_last_updated_at_version (new in Lance)

For 2, 3, in general we track these metadata in manfiest's fragment. These metadata are typically very small because most rows in a fragment are added and updated at the same time. We use a run-length encoded sequence to record them:

/// in rowid.proto

/// A sequence of dataset versions. Similar to RowIdSequence but tracks
/// version runs. It uses RLE (Run-Length Encoding) to efficiently
// represent consecutive rows with the same version.
message RowDatasetVersionSequence {
    repeated RowDatasetVersionRun runs = 1;
}

/// A run of rows with the same version.
message RowDatasetVersionRun {
    /// The number of consecutive rows with the same version.
    U64Segment span = 1;

    uint64 version = 2;
}

/// in table.proto in DataFragment

  oneof last_updated_at_version_sequence {
    // If small (< 200KB), the row latest updated versions are stored inline.
    bytes inline_last_updated_at_versions = 7;
    // Otherwise, stored as part of a file.
    ExternalFile external_last_updated_at_versions = 8;
  } // last_updated_at_version_sequence

  oneof created_at_version_sequence {
    // If small (< 200KB), the row created at versions are stored inline.
    bytes inline_created_at_versions = 9;
    // Otherwise, stored as part of a file.
    ExternalFile external_created_at_versions = 10;
  } // created_at_version_sequence

By tracking both created and updated version, we can now get the rows inserted or updated between v1 and v2 easily as:

-- Inserts
SELECT
  _rowid,
  _row_created_at_version AS version,
  'insert' AS raw_op,
  ROW(*) AS row_data
FROM table
WHERE _row_created_at_version BETWEEN :v1 + 1 AND :v2

-- Updates
SELECT
  _rowid,
  _row_updated_at_version AS version,
  'update' AS raw_op,
  ROW(*) AS row_data
FROM table
WHERE _row_updated_at_version BETWEEN :v1 + 1 AND :v2

Handling Deleted Rows

Similarly, we can also track _row_deleted_at_version. However, we face 2 challenges:

we cannot directly query _row_deleted_at_version because rows deleted are not visible in query.
we lose deletes when (1) we do a compaction, (2) we remove a fragment completely when we delete the entire fragment.

For 1, we can introduce a mode to query the deleted rows. So when this mode is enabled, we will not remove the rows with delete marker, but instead show it with a non-null _row_deleted_at_version. For this design, let's say in SQL we have this special system table table$deleted_rows which behaves like it.

For 2, we can introduce a system index, let's call it the deleted_rows_index. When either case happens in 2, we will update the deleted_rows_index with the pointers to the fragments that have those deleted rows. This is pretty much similar to how we already record an extra fragment reuse index during compaction. And similar to the fragment reuse index, this index can be maintained independently to fulfill the business need. For example, a table can declare that it provides CDF for the past N versions for X days as a service level agreement, without the need to worry about how the table maintenance process is going on to trim old table versions.

Then going back to 1, a SELECT * FROM table$deleted_rows WHERE _row_deleted_at_version IS NOT NULL becomes basically combining all the rows in the deleted_rows_index plus the rows in the table that are marked as deleted in deletion vectors.

And we can now answer the question of "what rows are deleted between v1 and v2" by:

-- Deletes (from delete marker table)
SELECT
  _rowid,
  _row_deleted_at_version AS version,
  'delete' AS raw_op,
  ROW(*) AS row_data
FROM table$deleted_rows
WHERE _row_deleted_at_version BETWEEN :v1 + 1 AND :v2

Constructing CDF

Now with these 2 prerequisites, we can now construct the full CDF between any 2 versions. Firstly, we will union the results in the previous 2 steps. Then, for each row ID, we need to deal with the situation that it contains either:

1 row of change, meaning just inserted, or
multiple rows, meaning it was optionally inserted, then deleted or inserted or updated one or multiple times, which means it has 0 or 1 insert, 0 to N deletes, and optionally 1 update in the end. When constructing a CDF of row_id, version, operation, pre_image, post_image, a delete is the pre-image of its next delete or update.

The key is that, all these can be described as SQL transformation, which means we can now just run a DataFusion SQL to get the CDF:

WITH all_changes AS (
  -- Inserts
  SELECT
    _rowid,
    _row_created_at_version AS version,
    'insert' AS raw_op,
    ROW(*) AS row_data
  FROM table
  WHERE _row_created_at_version BETWEEN :v1 + 1 AND :v2

  UNION ALL

  -- Updates
  SELECT
    _rowid,
    _row_updated_at_version AS version,
    'update' AS raw_op,
    ROW(*) AS row_data
  FROM table
  WHERE _row_updated_at_version BETWEEN :v1 + 1 AND :v2

  UNION ALL

  -- Deletes (from delete marker table)
  SELECT
    _rowid,
    _row_deleted_at_version AS version,
    'delete' AS raw_op,
    ROW(*) AS row_data
  FROM table$deleted_rows
  WHERE _row_deleted_at_version BETWEEN :v1 + 1 AND :v2
),

ordered AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY _rowid ORDER BY version ASC) AS seq,
    LAG(row_data) OVER (PARTITION BY _rowid ORDER BY version ASC) AS prev_row_data
  FROM all_changes
)

SELECT
  _rowid AS row_id,
  raw_op AS operation,
  prev_row_data AS pre_image,
  row_data AS post_image,
  version
FROM ordered
ORDER BY _rowid, version;

yanghua · 2025-10-13T03:23:34Z

yanghua
Oct 13, 2025
Maintainer

@jackye1995 Thanks for summarizing and writing down the design about row lineage and CDF.

I agree that distinguishing between the two concepts of row lineage and CDF is a good idea. Do we need to split multiple subtasks to support them? Or combine them with CDF, like my PR or your experiments PR the benefit is we can use CDF test row lineage. The disadvantage is the PR is huge.

_row_created_at_version

This metadata information is also new for Lance?

1 reply

jackye1995 Oct 14, 2025
Maintainer Author

@yanghua here is what I am thinking about breaking this down:

support _row_created_at_version and _row_last_updated_at_version metadata columns
support reading deleted columns marked by deletion vectors out from scanner
support reading deleted columns with _row_deleted_at_version metadata column
support creating the deleted_row_index during (1) compaction, (2) operations that optimizes and removes a fragment if all rows are deleted
support reading all rows updated, inserted, deleted between 2 versions using 1-4
support transform the result into the form of pre-image, post-image, operation

I am updating #4895 to just do 1 right now

yanghua · 2025-10-13T23:27:58Z

yanghua
Oct 13, 2025
Maintainer

What's more, shall we introduce API design to expose the ability of the row lineage, or just reuse the diff API?

0 replies

jackye1995 · 2025-10-15T03:52:03Z

jackye1995
Oct 15, 2025
Maintainer Author

Another aspect of CDF is the changed columns in a row. Traditionally, this requires getting the pre and post image of the row and then compare each column to know which column has changed in value.

I think with Lance's 2-dimensional table concept and the row lineage feature we are doing here, there could be a better solution. Similar to how we track rows, we can also track creation time, update time and deletion time of a column at fragment level, and at CDF generation time, we can use the 2x2 information to produce accurate result.

This is just a thought at this moment, put it here in case anyone would like to explore more.

0 replies

Henry-LTA · 2025-11-27T07:02:49Z

Henry-LTA
Nov 27, 2025

WITH all_changes AS (
  -- Inserts
  SELECT
    _rowid,
    _row_created_at_version AS version,
    'insert' AS raw_op,
    ROW(*) AS row_data
  FROM table
  WHERE _row_created_at_version BETWEEN :v1 + 1 AND :v2

  UNION ALL

  -- Updates
  SELECT
    _rowid,
    _row_updated_at_version AS version,
    'update' AS raw_op,
    ROW(*) AS row_data
  FROM table
  WHERE _row_updated_at_version BETWEEN :v1 + 1 AND :v2

  UNION ALL

  -- Deletes (from delete marker table)
  SELECT
    _rowid,
    _row_deleted_at_version AS version,
    'delete' AS raw_op,
    ROW(*) AS row_data
  FROM table$deleted_rows
  WHERE _row_deleted_at_version BETWEEN :v1 + 1 AND :v2
),

ordered AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY _rowid ORDER BY version ASC) AS seq,
    LAG(row_data) OVER (PARTITION BY _rowid ORDER BY version ASC) AS prev_row_data
  FROM all_changes
)

SELECT
  _rowid AS row_id,
  raw_op AS operation,
  prev_row_data AS pre_image,
  row_data AS post_image,
  version
FROM ordered
ORDER BY _rowid, version;

I have two questions – could you reply when you're available?

The _rowid is a stable row ID, and updates will generate a new _rowid (so the pre-image and post-image of an updated row have different _rowids). Why can we just partition by _rowid to build the update details?
Likewise, the post-image of an updated row is counted as an inserted row, which looks like there’s an overlap here. How can you distinguish new rows and updated rows?

3 replies

yanghua Nov 27, 2025
Maintainer

@Henry-LTA I think the doc that you have referenced is not very clear about explaining the stable row ID.

The behavior of the stable row ID is that when we update a row, it reuses the former one. So we can track the row lineage and do CDF.

Not sure if I have answered your question.

Henry-LTA Nov 27, 2025

Is the doc that I have referenced not an official website? It seems there is a big discrepancy with the description on your side. Additionally, I see that it is also described in the same way in this link.

Introduce a row id that is stable when moved. The id of a row should not change if it is compacted. However, it may change if it is updated.

Is there any clear and easy-to-understand learning material available? I'm currently learning about this area.

yanghua Nov 27, 2025
Maintainer

I see that it is also described in the same way in this link.

Introduce a row id that is stable when moved. The id of a row should not change if it is compacted. However, it may change if it is updated.

Actually, this description is part of the stable row id. We can call it move stable row ID. However, we also have another part that can be called update stable row ID. It is not included in the discussion you pasted(search this):

Introduce a public primary key that is stable when rows are updated. This will be handled in a future feature.

Is there any clear and easy-to-understand learning material available? I'm currently learning about this area.

Temporarily, you can reference this issue #2454. I would update the doc soon.

Row Lineage and Change Data Feed #4935

Uh oh!

Uh oh!

jackye1995 Oct 12, 2025 Maintainer

Prior Arts

Design

Row Lineage

Handling Deleted Rows

Constructing CDF

Replies: 4 comments · 4 replies

Uh oh!

yanghua Oct 13, 2025 Maintainer

Uh oh!

jackye1995 Oct 14, 2025 Maintainer Author

Uh oh!

yanghua Oct 13, 2025 Maintainer

Uh oh!

jackye1995 Oct 15, 2025 Maintainer Author

Uh oh!

Uh oh!

Henry-LTA Nov 27, 2025

Uh oh!

yanghua Nov 27, 2025 Maintainer

Uh oh!

Henry-LTA Nov 27, 2025

Uh oh!

yanghua Nov 27, 2025 Maintainer

jackye1995
Oct 12, 2025
Maintainer

Replies: 4 comments 4 replies

yanghua
Oct 13, 2025
Maintainer

jackye1995 Oct 14, 2025
Maintainer Author

yanghua
Oct 13, 2025
Maintainer

jackye1995
Oct 15, 2025
Maintainer Author

Henry-LTA
Nov 27, 2025

yanghua Nov 27, 2025
Maintainer

yanghua Nov 27, 2025
Maintainer