Skip to content

[thrift-remodel] Complete decoding of FileMetaData and RowGroupMetaData #8111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: gh5854_thrift_remodel
Choose a base branch
from

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Aug 11, 2025

Which issue does this PR close?

Note: this targets a feature branch, not main

Rationale for this change

What changes are included in this PR?

This PR completes reading of the FileMetaData and RowGroupMetaData pieces of the ParquetMetaData. Column indexes and encryption will be follow-on work.

This replaces the macro for generating structs with a more general one that can take visibility and lifetime specifiers. Also (temporarily) adds a new function ParquetMetaDataReader::decode_file_metadata which should be a drop-in replacement for ParquetMetaDataReader::decode_metadata.

Still todo:

  1. Add some tests that verify this produces the same output as ParquetMetaDataReader::decode_metadata
  2. Read column indexes with new decoder
  3. Read page headers with new decoder
  4. Integrate with @alamb's push decoder work [Parquet] Add ParquetMetadataPushDecoder #8080

Are these changes tested?

Not yet

Are there any user-facing changes?

Yes

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 11, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't pretend to really understand this entire PR, but the basics look good to me

In general, it would be really nice to consolidate the thrift decoding somewhere separate rather than sprinkling it throughout the crate

I wonder if you could implement a Trait or something

pub trait FromThrift {
  /// reads an instance of `self` from the input prot
  fn from_thrift(prot: &mut ThriftCompactInputProtocol) -> Result<Self>;
}

And then you could implement that trait for all the various structures that are read from the header 🤔


thrift_struct!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty sweet way of generating structs BTW

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jhorstmann! I agree! 😄

@@ -2203,9 +2247,9 @@ mod tests {
.build();

#[cfg(not(feature = "encryption"))]
let base_expected_size = 2312;
let base_expected_size = 2280;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why this changes? It isn't clear to me why the metadata size gets smaller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tracked it down yet, but I suspect it might be some of the enums shrinking in size (they're back to rust enums rather than structs). I need to add some prints in the HeapSize calls to see exactly what's different.

pub fn decode_file_metadata(buf: &[u8]) -> Result<ParquetMetaData> {
let mut prot = ThriftCompactInputProtocol::new(buf);

// components of the FileMetaData
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this would be better / easier to find if it were a method on ParquetMetaData ?

impl ParquetMetaData {
  fn from_thrift_bytes(&[u8]) -> Result<Self> {
...
  }
}

🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this for benchmarking purposes...I think eventually it would simply replace the existing decode_metadata.

let prot = &mut prot;

match field_ident.id {
1 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way to use one of the magic thrift_struct! macro here rather than hard coding the field ids? It isn't clear to me why this code is different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been punting on that for now...I have tried to simplify where I can (such as hiding the complexity of reading lists). The issue here is that the thrift FileMetaData contains the row group metadata, while in ParquetMetaData the crate FileMetaData has the schema and the row group meta is held separately. Similarly, thrift has ColumnChunk that contains ColumnMetaData while we collapse those two structures into a single ColumnChunkMetaData. I can go back to decoding to a private FileMetaData that is then pulled apart (as I've wound up doing for RowGroupMetaData), but was trying to skip that step thinking it would be faster. (For instance...the processing of the schema is quite expensive, so rather than allocating a vector of schema elements, parsing them, and then translating to TypePtr I here pull the schema elements one at a time. That did cut down on the processing time, but by enough to justify the complexity? I'll have to revisit that).

Back to the original question...hand coding is going to have some warts that can't be avoided. There may be a way to pretty it up some where we need custom parsers. Suggestions welcome :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to the original question...hand coding is going to have some warts that can't be avoided. There may be a way to pretty it up some where we need custom parsers. Suggestions welcome :D

I think the key thing in my mind is that if/when something new is added to the spec, there is some straightforward pattern to follow to add support for the new parser

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now moot...I wound up rolling back to using a temp FileMetaData as well. Since there are no string allocations done, the cost of decoding to the temp structures and then pretty much just taking ownership of their members during conversion winds up being acceptably fast.

@@ -1102,10 +1218,388 @@ fn get_file_decryptor(
}
}

// temp structs used to construct RowGroupMetaData
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way to avoid reading into a temp structure and directly construct a RowGroupMetaData?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above about FileMetaData 😄

I actually went down that road and it was slower. :( That may have been my incompetence, or it may be that the code got too bloated and the optimizer gave up.

}
);

#[cfg(feature = "encryption")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I suggest moving the code that does the thrift decoding into a different module from the ParquetMetaDataReader (which is basically a state machine for reading parts of the file into local buffers and decoding it)

Maybe parquet/src/file/metadata/from_thrift.rs 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I want to consolidate at some point...I'm just not sure where it should live. There's basic already for the low level enums and unions, with structs scattered about where they're used. Pulling them all into a single module (or maybe two) is a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now (largely) done. There are some exceptions that I've left as TODOs.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 14, 2025
@github-actions github-actions bot removed the arrow Changes to the arrow crate label Aug 14, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Aug 14, 2025

I wonder if you could implement a Trait or something

Yes, that's the path @jhorstmann took originally, and I'm now seeing the wisdom of it. I'll try that in the next iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants