Incremental FileWriter with explicit row‑group control for efficient S3 range reads#32
Open
shayonj wants to merge 2 commits intonjaremko:mainfrom
Open
Incremental FileWriter with explicit row‑group control for efficient S3 range reads#32shayonj wants to merge 2 commits intonjaremko:mainfrom
shayonj wants to merge 2 commits intonjaremko:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a follow up to #31
This adds a new
Parquet::FileWriterAPI that gives deterministic control over Parquet row‑group boundaries. You can now set a target size and explicitly flush row groups, which enables predictable, small row groups that work well with S3 Range GETs and statistics‑based pruning.Problem
Parquet.write_rowsexposes batching and a memory-based flush threshold, but it does not explicitly seal a Parquet row group per batch if a user wants to stream data to a parquet file and segement/shard incoming data.Solution
row_group_target_bytes: automatically flushes the underlying writer when the buffered data reaches the target, sealing a row group.flush_row_group: explicitly seals a row group at application-defined boundaries (e.g., per shard, time window, or batch).Usage
Benefits
Implementation notes
IncrementalWriterwith a thread‑local registry, per‑row memory accounting, and sealing row groups via core writer flush._fw_create,_fw_write_rows,_fw_flush_row_group,_fw_close.Parquet.write_rowsis unchanged; this is opt‑in.