-
Notifications
You must be signed in to change notification settings - Fork 18
feat: make parquet row groups size configurable #158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: make parquet row groups size configurable #158
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kevinjqliu -- I left some suggestions for your review
@@ -19,6 +19,7 @@ | |||
//! --part <N> Which part to generate (1-based, default: 1) | |||
//! -n, --num-threads <N> Number of threads to use (default: number of CPUs) | |||
//! -c, --parquet-compression <C> Parquet compression codec, e.g., SNAPPY, ZSTD(1), UNCOMPRESSED (default: SNAPPY) | |||
//! --parquet-row-group-size <N> Number of rows per row group in Parquet files (default: 1048576) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks great
@@ -16,6 +16,9 @@ use std::sync::Arc; | |||
use tokio::sync::mpsc::{Receiver, Sender}; | |||
use tpchgen_arrow::RecordBatchIterator; | |||
|
|||
/// Type alias for a collection of row groups, where each row group contains [`ArrowColumnChunk`]s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kevinjqliu
I am worried that this code doesn't account for the 32k limit on the number of row groups nor how the parts work
I tested in two configuration:
Running with scale factor 1000 results in a file with 186k rows per group, even though the default says 1M
RUST_LOG=debug nice cargo run --release -- --scale-factor=1000 --format=parquet --tables=lineitem
# check row group size:
andrewlamb@Andrews-MacBook-Pro-3:~/Software/tpchgen-rs$ parquet meta lineitem.parquet | grep 'Row group' | head
Row group 0: count: 183603 44.08 B records start: 4 total(compressed): 7.718 MB total(uncompressed):12.408 MB
Row group 1: count: 182366 44.13 B records start: 8093084 total(compressed): 7.676 MB total(uncompressed):12.330 MB
Row group 2: count: 183754 44.03 B records start: 16141473 total(compressed): 7.717 MB total(uncompressed):12.411 MB
...
Running with 100 rows per group causes a panic:
RUST_LOG=debug nice cargo run --release -- --scale-factor=1000 --format=parquet --parquet-row-group-size=100 --tables=lineitem
thread 'tokio-runtime-worker' panicked at tpchgen-cli/src/parquet.rs:111:68:
called `Result::unwrap()` on an `Err` value: General("Parquet does not support more than 32767 row groups per file (currently: 32768)")
Instead of breaking up individual parts to limit row group size, what do you think about updating the code to generate parts? I tried to explain how this works in the following PR:
If we went with the parts approach, we could update this PR to be an adjustment to parts calculation
} | ||
|
||
#[cfg(test)] | ||
mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for these tests. I think it is also important to do an "end to end" test -- that is invoke the cli with appropriate options and then verify that the resulting parquet file is as expected
Here is a proposed way to do this kind of test
What would you think about testing using that approach (
"Test setup verification failed" | ||
); | ||
|
||
assert_parquet_generation_succeeds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed it, but I can 't find anywhere in these tests that actually verifies the output row group sizes are as requested
I think the tests should also open the resulting files, read the metadata, and then verify that the row groups are all within bounds
No description provided.