feat: make parquet row groups size configurable #158

kevinjqliu · 2025-07-15T17:21:55Z

No description provided.

alamb

Thank you @kevinjqliu -- I left some suggestions for your review

alamb · 2025-07-16T12:30:28Z

tpchgen-cli/src/main.rs

@@ -19,6 +19,7 @@
 //!         --part <N>               Which part to generate (1-based, default: 1)
 //!     -n, --num-threads <N>        Number of threads to use (default: number of CPUs)
 //!     -c, --parquet-compression <C> Parquet compression codec, e.g., SNAPPY, ZSTD(1), UNCOMPRESSED (default: SNAPPY)
+//!         --parquet-row-group-size <N> Number of rows per row group in Parquet files (default: 1048576)


this looks great

alamb · 2025-07-16T12:36:08Z

tpchgen-cli/src/parquet.rs

@@ -16,6 +16,9 @@ use std::sync::Arc;
 use tokio::sync::mpsc::{Receiver, Sender};
 use tpchgen_arrow::RecordBatchIterator;

+/// Type alias for a collection of row groups, where each row group contains [`ArrowColumnChunk`]s


Thank you @kevinjqliu

I am worried that this code doesn't account for the 32k limit on the number of row groups nor how the parts work

I tested in two configuration:

Running with scale factor 1000 results in a file with 186k rows per group, even though the default says 1M

RUST_LOG=debug nice cargo run --release -- --scale-factor=1000 --format=parquet --tables=lineitem # check row group size: andrewlamb@Andrews-MacBook-Pro-3:~/Software/tpchgen-rs$ parquet meta lineitem.parquet | grep 'Row group' | head Row group 0: count: 183603 44.08 B records start: 4 total(compressed): 7.718 MB total(uncompressed):12.408 MB Row group 1: count: 182366 44.13 B records start: 8093084 total(compressed): 7.676 MB total(uncompressed):12.330 MB Row group 2: count: 183754 44.03 B records start: 16141473 total(compressed): 7.717 MB total(uncompressed):12.411 MB ...

Running with 100 rows per group causes a panic:

RUST_LOG=debug nice cargo run --release -- --scale-factor=1000 --format=parquet --parquet-row-group-size=100 --tables=lineitem thread 'tokio-runtime-worker' panicked at tpchgen-cli/src/parquet.rs:111:68: called `Result::unwrap()` on an `Err` value: General("Parquet does not support more than 32767 row groups per file (currently: 32768)")

Instead of breaking up individual parts to limit row group size, what do you think about updating the code to generate parts? I tried to explain how this works in the following PR:

refactor: Extract plan generation to GenerationPlan, add docs and tests #157

If we went with the parts approach, we could update this PR to be an adjustment to parts calculation

alamb · 2025-07-16T12:38:45Z

tpchgen-cli/src/parquet.rs

+}
+
+#[cfg(test)]
+mod tests {


thank you for these tests. I think it is also important to do an "end to end" test -- that is invoke the cli with appropriate options and then verify that the resulting parquet file is as expected

Here is a proposed way to do this kind of test

feat: Add integration tests for tpchgen-cli #156

What would you think about testing using that approach (

alamb · 2025-07-16T12:40:10Z

tpchgen-cli/src/parquet.rs

+                "Test setup verification failed"
+            );
+
+            assert_parquet_generation_succeeds(


I may have missed it, but I can 't find anywhere in these tests that actually verifies the output row group sizes are as requested

I think the tests should also open the resulting files, read the metadata, and then verify that the row groups are all within bounds

alamb and others added 6 commits June 30, 2025 17:01

Encode using smaller row groups

b2aaf73

fix

114096f

add row_group_size as a param

6ebf95b

add tests for writing parquet row groups

98575f9

add unit tests

f5aec43

cargo fmt

0b36e16

kevinjqliu mentioned this pull request Jul 15, 2025

Alamb/smaller row groups #151

Draft

kevinjqliu added 2 commits July 15, 2025 10:27

cargo clippy

6d9247a

fix clippy

37d9fd9

alamb reviewed Jul 16, 2025

View reviewed changes

alamb mentioned this pull request Jul 17, 2025

feat: Add integration tests for tpchgen-cli #156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: make parquet row groups size configurable #158

feat: make parquet row groups size configurable #158

Uh oh!

kevinjqliu commented Jul 15, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Jul 16, 2025

Uh oh!

alamb Jul 16, 2025

Uh oh!

alamb Jul 16, 2025

Uh oh!

alamb Jul 16, 2025

Uh oh!

Uh oh!

feat: make parquet row groups size configurable #158

Are you sure you want to change the base?

feat: make parquet row groups size configurable #158

Uh oh!

Conversation

kevinjqliu commented Jul 15, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!