Skip to content

create PageIndexPolicy to allow optional indexes #8071

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 15, 2025

Conversation

kczimm
Copy link
Contributor

@kczimm kczimm commented Aug 6, 2025

Which issue does this PR close?

Rationale for this change

This change introduces a more flexible way to handle page indexes (column and offset indexes) in Parquet files. Previously, the reading of these indexes was controlled by boolean flags, which indicated read required or do not read. The new PageIndexPolicy enum (Off, Optional, Required) provides finer control, allowing users to specify whether an index is not read, read if present (without error if missing), or strictly required (error if missing).

What changes are included in this PR?

  • Introduced a new PageIndexPolicy enum with Off, Optional, and Required variants.
  • Replaced the boolean column_index and offset_index fields in ParquetMetaDataReader with the new PageIndexPolicy enum.
  • Updated the ParquetMetaDataReader::new() function to initialize page index policies to Off, preserving previous defaults.
  • Modified existing with_page_indexes, with_column_indexes, and with_offset_indexes methods to utilize the new PageIndexPolicy, defaulting to Required when enabling indexes.
  • Added new methods: with_page_index_policy, with_column_index_policy, and with_offset_index_policy to allow direct setting of the page index policy.
  • Adjusted the internal logic for parsing column and offset indexes to respect the specified PageIndexPolicy, including returning an error if a Required index is not found.

Are these changes tested?

Yes, a new test file parquet/tests/page_index.rs has been added to cover the functionality of the new PageIndexPolicy and its integration with ParquetMetaDataReader.

Are there any user-facing changes?

Yes, there are user-facing changes to the ParquetMetaDataReader API. The with_column_indexes and with_offset_indexes methods now implicitly use PageIndexPolicy::Required when enabling page indexes. New methods with_page_index_policy, with_column_index_policy, and with_offset_index_policy have been added.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 6, 2025
kczimm added 2 commits August 7, 2025 15:03
- Rename PageIndexPolicy::Off to PageIndexPolicy::Skip
- impl From<bool> for PageIndexPolicy for DRY
- Expose PageIndexPolicy to Arrow
@alamb
Copy link
Contributor

alamb commented Aug 7, 2025

I think this is a good idea, FWIW and a nice change. Is this PR ready for review @kczimm (it is currently marked as a draft)?

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the desire for this, but I think some discussion is warranted to suss out what the desired behavior is for the Optional case.

Thanks for raising the issue @kczimm.

@kczimm kczimm marked this pull request as ready for review August 8, 2025 00:59
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kczimm !

I think this looks good to me -- though I think it would be good if @etseidl also had a look before we merged this. I had a few suggestions but nothing that I think is required before merging

The CI isn't passing -- I think if you merge up (or rebase) from main it should be clean

Thanks again for your patience

@etseidl
Copy link
Contributor

etseidl commented Aug 13, 2025

Sorry, I've been somewhat taken over by thrift and life 😬. I'll try to take another look today, but don't hold this up for me either. My main concern was addressed.

@alamb
Copy link
Contributor

alamb commented Aug 13, 2025

I've been somewhat taken over by thrift

That is what we like to hear ! I love fostering the ability to obsess over some low level technical thing to make it really cool!

@kczimm
Copy link
Contributor Author

kczimm commented Aug 13, 2025

working on the CI issues...

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! Just a few minor nits left.

@etseidl
Copy link
Contributor

etseidl commented Aug 15, 2025

@alamb any last words before merging this?

@alamb
Copy link
Contributor

alamb commented Aug 15, 2025

@alamb any last words before merging this?

Nope. DO IT!

@etseidl etseidl merged commit f87f60e into apache:main Aug 15, 2025
16 checks passed
@kczimm
Copy link
Contributor Author

kczimm commented Aug 15, 2025

Thank you @alamb and @etseidl for your feedback and time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optionally read parquet page indexes
4 participants