-
Notifications
You must be signed in to change notification settings - Fork 990
[thrift-remodel] Complete decoding of FileMetaData
and RowGroupMetaData
#8111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh5854_thrift_remodel
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't pretend to really understand this entire PR, but the basics look good to me
In general, it would be really nice to consolidate the thrift decoding somewhere separate rather than sprinkling it throughout the crate
I wonder if you could implement a Trait or something
pub trait FromThrift {
/// reads an instance of `self` from the input prot
fn from_thrift(prot: &mut ThriftCompactInputProtocol) -> Result<Self>;
}
And then you could implement that trait for all the various structures that are read from the header 🤔
|
||
thrift_struct!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pretty sweet way of generating structs BTW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jhorstmann! I agree! 😄
@@ -2203,9 +2247,9 @@ mod tests { | |||
.build(); | |||
|
|||
#[cfg(not(feature = "encryption"))] | |||
let base_expected_size = 2312; | |||
let base_expected_size = 2280; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know why this changes? It isn't clear to me why the metadata size gets smaller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tracked it down yet, but I suspect it might be some of the enums shrinking in size (they're back to rust enums rather than structs). I need to add some prints in the HeapSize
calls to see exactly what's different.
parquet/src/file/metadata/reader.rs
Outdated
pub fn decode_file_metadata(buf: &[u8]) -> Result<ParquetMetaData> { | ||
let mut prot = ThriftCompactInputProtocol::new(buf); | ||
|
||
// components of the FileMetaData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this would be better / easier to find if it were a method on ParquetMetaData ?
impl ParquetMetaData {
fn from_thrift_bytes(&[u8]) -> Result<Self> {
...
}
}
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this for benchmarking purposes...I think eventually it would simply replace the existing decode_metadata
.
parquet/src/file/metadata/reader.rs
Outdated
let prot = &mut prot; | ||
|
||
match field_ident.id { | ||
1 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way to use one of the magic thrift_struct!
macro here rather than hard coding the field ids? It isn't clear to me why this code is different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been punting on that for now...I have tried to simplify where I can (such as hiding the complexity of reading lists). The issue here is that the thrift FileMetaData
contains the row group metadata, while in ParquetMetaData
the crate FileMetaData
has the schema and the row group meta is held separately. Similarly, thrift has ColumnChunk
that contains ColumnMetaData
while we collapse those two structures into a single ColumnChunkMetaData
. I can go back to decoding to a private FileMetaData
that is then pulled apart (as I've wound up doing for RowGroupMetaData
), but was trying to skip that step thinking it would be faster. (For instance...the processing of the schema is quite expensive, so rather than allocating a vector of schema elements, parsing them, and then translating to TypePtr
I here pull the schema elements one at a time. That did cut down on the processing time, but by enough to justify the complexity? I'll have to revisit that).
Back to the original question...hand coding is going to have some warts that can't be avoided. There may be a way to pretty it up some where we need custom parsers. Suggestions welcome :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to the original question...hand coding is going to have some warts that can't be avoided. There may be a way to pretty it up some where we need custom parsers. Suggestions welcome :D
I think the key thing in my mind is that if/when something new is added to the spec, there is some straightforward pattern to follow to add support for the new parser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now moot...I wound up rolling back to using a temp FileMetaData
as well. Since there are no string allocations done, the cost of decoding to the temp structures and then pretty much just taking ownership of their members during conversion winds up being acceptably fast.
parquet/src/file/metadata/reader.rs
Outdated
@@ -1102,10 +1218,388 @@ fn get_file_decryptor( | |||
} | |||
} | |||
|
|||
// temp structs used to construct RowGroupMetaData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way to avoid reading into a temp structure and directly construct a RowGroupMetaData?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above about FileMetaData
😄
I actually went down that road and it was slower. :( That may have been my incompetence, or it may be that the code got too bloated and the optimizer gave up.
parquet/src/file/metadata/reader.rs
Outdated
} | ||
); | ||
|
||
#[cfg(feature = "encryption")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I suggest moving the code that does the thrift decoding into a different module from the ParquetMetaDataReader (which is basically a state machine for reading parts of the file into local buffers and decoding it)
Maybe parquet/src/file/metadata/from_thrift.rs 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I want to consolidate at some point...I'm just not sure where it should live. There's basic
already for the low level enums and unions, with structs scattered about where they're used. Pulling them all into a single module (or maybe two) is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now (largely) done. There are some exceptions that I've left as TODOs.
to our own metadata structs. This reduces the amount of non-macro generated code.
Yes, that's the path @jhorstmann took originally, and I'm now seeing the wisdom of it. I'll try that in the next iteration. |
Which issue does this PR close?
Note: this targets a feature branch, not main
Rationale for this change
What changes are included in this PR?
This PR completes reading of the
FileMetaData
andRowGroupMetaData
pieces of theParquetMetaData
. Column indexes and encryption will be follow-on work.This replaces the macro for generating structs with a more general one that can take visibility and lifetime specifiers. Also (temporarily) adds a new function
ParquetMetaDataReader::decode_file_metadata
which should be a drop-in replacement forParquetMetaDataReader::decode_metadata
.Still todo:
ParquetMetaDataReader::decode_metadata
Are these changes tested?
Not yet
Are there any user-facing changes?
Yes