Skip to content

Conversation

@vaibhavk1992
Copy link
Contributor

@vaibhavk1992 vaibhavk1992 commented Jun 30, 2025

Important Read

  • Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

This draft introduces the Delta Kernel to read the tables instead of using spark sessions.

Brief change log

  • Implemented DeltaKernelConversionSource to read the table using kernel library
  • Added Delta Kernel API and Delta Kernel Defaults as compile-time dependencies
  • Added DeltaKernelTableExtractor for metadata extraction
  • Implemented DeltaKernelSchemaExtractor for schema conversion
  • Added DeltaKernelPartitionExtractor for partition handling
  • Implemented DeltaKernelStatsExtractor for statistics extraction
  • Added unit tests, integration test to verify all the edge cases

Verify this pull request

This pull request is a trivial rework , added unit tests, integration test to verify all the edge cases

-Unit Tests
-TestDeltaKernelSchemaExtractor

  • Tested primitive type conversions
  • Validated complex type handling (structs, arrays, maps)
  • Verified metadata preservation
  • Tested timestamp and decimal handling
  • Validated field metadata (comments, nullability)

-TestDeltaKernelPartitionExtractor

  • Tested partitioned and unpartitioned tables
  • Verified partition value extraction
  • Validated date and timestamp partition handling
  • Tested nested partition fields

-TestDeltaKernelStatsExtractor

  • Tested statistics extraction

  • Validated range and value statistics

  • Tested null count handling

  • Added integration tests for end-to-end.

@vaibhavk1992 vaibhavk1992 marked this pull request as draft June 30, 2025 10:42
@vaibhavk1992 vaibhavk1992 marked this pull request as ready for review June 30, 2025 15:40
Copy link
Contributor

@vinishjail97 vinishjail97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great progress @vaibhavk1992, added some comments.

@vaibhavk1992
Copy link
Contributor Author

@the-other-tim-brown @vinishjail97 @rahil-c All the comments have been addressed and the build is passing too.
Please review for the final merge.

myScan.getScanFiles(engine, includeColumnStats);

List<InternalDataFile> dataFiles = new ArrayList<>();
while (scanFiles.hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is eagerly pulling all the files into memory instead of using the iterator pattern. Can the code be updated to iterate through these values when next is called instead of eagerly materializing them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@the-other-tim-brown
the scanFiles.hasNext() Returns true if the iteration has more elements. (In other words, returns true if next would return an element rather than throwing an exception.). I am not sure what do you mean by materializing here and what we have to do exactly.
I have followed this convention from the suggestion by the doc form kernel community. Please check this.
https://github.com/delta-io/delta/blob/master/kernel/USER_GUIDE.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that the DeltaDataFileIterator is an iterator so we don't need to compute all the values in the constructor. We can instead compute the next value when next is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code, please check if that make sense. @the-other-tim-brown

@the-other-tim-brown
Copy link
Contributor

@vaibhavk1992 make sure to fill out the PR template

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants