Skip to content

Incremental FileWriter with explicit row‑group control for efficient S3 range reads#32

Open
shayonj wants to merge 2 commits intonjaremko:mainfrom
shayonj:s/writer-incremental
Open

Incremental FileWriter with explicit row‑group control for efficient S3 range reads#32
shayonj wants to merge 2 commits intonjaremko:mainfrom
shayonj:s/writer-incremental

Conversation

@shayonj
Copy link

@shayonj shayonj commented Oct 2, 2025

This is a follow up to #31

This adds a new Parquet::FileWriter API that gives deterministic control over Parquet row‑group boundaries. You can now set a target size and explicitly flush row groups, which enables predictable, small row groups that work well with S3 Range GETs and statistics‑based pruning.

Problem

  • Today’s Parquet.write_rows exposes batching and a memory-based flush threshold, but it does not explicitly seal a Parquet row group per batch if a user wants to stream data to a parquet file and segement/shard incoming data.
  • Without explicit row‑group boundaries, batches can still be merged until the final writer flush/close, producing large or unpredictable row groups for efficient range based reads.
  • For S3 range reads with metadata pruning, large row groups force downloading data in larger sizes impacting latency like - tens of MB to fetch a few rows, even with column projection.

Solution

  • New incremental writer with explicit row‑group control:
    • row_group_target_bytes: automatically flushes the underlying writer when the buffered data reaches the target, sealing a row group.
    • flush_row_group: explicitly seals a row group at application-defined boundaries (e.g., per shard, time window, or batch).

Usage

writer =
  Parquet::FileWriter.new(
    schema: schema,
    write_to: path,
    compression: 'snappy',
    row_group_target_bytes: 4 * 1024 * 1024, # ~4MB
  )

writer.write_rows(chunk1)
writer.flush_row_group # optional explicit boundary
writer.write_rows(chunk2)
writer.close # writes footer

Benefits

  • Efficient S3 range reads: footer + the exact row‑group span, instead of fetching most of the file.
  • Deterministic boundaries that align with your application’s logical partitions.
  • Smaller, predictable row groups improve throughput for point lookups and catalog-driven queries.

Implementation notes

  • Adapter: IncrementalWriter with a thread‑local registry, per‑row memory accounting, and sealing row groups via core writer flush.
  • FFI: _fw_create, _fw_write_rows, _fw_flush_row_group, _fw_close.
  • Ruby: Parquet::FileWriter wrapper with a small, chainable API.
  • Backwards compatible: Parquet.write_rows is unchanged; this is opt‑in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant