Remote source (S3 Range GET) support with row‑group pruning#31
Open
shayonj wants to merge 1 commit intonjaremko:mainfrom
Open
Remote source (S3 Range GET) support with row‑group pruning#31shayonj wants to merge 1 commit intonjaremko:mainfrom
shayonj wants to merge 1 commit intonjaremko:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds first‑class support for reading Parquet from remote/random‑access sources (for example, AWS S3 or any HTTP Range endpoint) without downloading entire files, and it introduces explicit row‑group selection and pruning to minimize bytes fetched. The Ruby API remains familiar; internally we add a
RemoteSourcepath and extend the core reader to accept row‑group filters alongside column projection so callers can target only the data they need.Understanding of the problem
Solution
byte_length -> Integerandread_range(offset, length) -> binary Stringcan now be passed anywhere a path or IO was accepted, which enables true random‑access reads and unlocks pruning only the row groups you need.RemoteSourcevalidates the Ruby object;ThreadSafeRemoteSourcesynchronizes concurrent calls;RemoteRangeReaderimplementsRead + Seekover repeatedread_rangecalls with exact‑length guarantees; andCloneableChunkReadergains aRemotevariant withfrom_remotefactory.row_groups: [Integer, ...]is now supported onParquet.each_rowandParquet.each_column, and the core reader providesread_rows_with_selection/read_columns_with_selectionto combine row‑group filters with column projection.min_bytes,max_bytes,null_count) are returned to Ruby so callers can prune candidates before any large reads.lib/parquet.rbcoercescolumnstoString[]androw_groupstoInteger[]while preserving enumerator behavior; all new features are opt‑in and backwards compatible.Example: AWS S3 remote source
Case A: you already know the row group ordinal (from a catalog)
Case B: you don’t have a catalog; pick candidates via footer stats (id example)
Notes:
Performance characteristics
Tests
test/enumerator_test.rb: addstest_remote_source_integrationand arow_groupssmoke check.test/column_test.rb: adds arow_groupssmoke check for column batches.Backwards compatibility