Skip to content

feat: make parquet row groups size configurable #158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

kevinjqliu
Copy link
Collaborator

No description provided.

Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kevinjqliu -- I left some suggestions for your review

@@ -19,6 +19,7 @@
//! --part <N> Which part to generate (1-based, default: 1)
//! -n, --num-threads <N> Number of threads to use (default: number of CPUs)
//! -c, --parquet-compression <C> Parquet compression codec, e.g., SNAPPY, ZSTD(1), UNCOMPRESSED (default: SNAPPY)
//! --parquet-row-group-size <N> Number of rows per row group in Parquet files (default: 1048576)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great

@@ -16,6 +16,9 @@ use std::sync::Arc;
use tokio::sync::mpsc::{Receiver, Sender};
use tpchgen_arrow::RecordBatchIterator;

/// Type alias for a collection of row groups, where each row group contains [`ArrowColumnChunk`]s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kevinjqliu

I am worried that this code doesn't account for the 32k limit on the number of row groups nor how the parts work

I tested in two configuration:

Running with scale factor 1000 results in a file with 186k rows per group, even though the default says 1M

RUST_LOG=debug nice cargo run --release -- --scale-factor=1000 --format=parquet --tables=lineitem
# check row group size:
andrewlamb@Andrews-MacBook-Pro-3:~/Software/tpchgen-rs$ parquet  meta lineitem.parquet  | grep 'Row group' | head
Row group 0:  count: 183603  44.08 B records  start: 4  total(compressed): 7.718 MB total(uncompressed):12.408 MB
Row group 1:  count: 182366  44.13 B records  start: 8093084  total(compressed): 7.676 MB total(uncompressed):12.330 MB
Row group 2:  count: 183754  44.03 B records  start: 16141473  total(compressed): 7.717 MB total(uncompressed):12.411 MB
...

Running with 100 rows per group causes a panic:

RUST_LOG=debug nice cargo run --release -- --scale-factor=1000 --format=parquet --parquet-row-group-size=100  --tables=lineitem


thread 'tokio-runtime-worker' panicked at tpchgen-cli/src/parquet.rs:111:68:
called `Result::unwrap()` on an `Err` value: General("Parquet does not support more than 32767 row groups per file (currently: 32768)")

Instead of breaking up individual parts to limit row group size, what do you think about updating the code to generate parts? I tried to explain how this works in the following PR:

If we went with the parts approach, we could update this PR to be an adjustment to parts calculation

}

#[cfg(test)]
mod tests {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for these tests. I think it is also important to do an "end to end" test -- that is invoke the cli with appropriate options and then verify that the resulting parquet file is as expected

Here is a proposed way to do this kind of test

What would you think about testing using that approach (

"Test setup verification failed"
);

assert_parquet_generation_succeeds(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed it, but I can 't find anywhere in these tests that actually verifies the output row group sizes are as requested

I think the tests should also open the resulting files, read the metadata, and then verify that the row groups are all within bounds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants