Add bloom filter support for optimized Parquet querying#28
Open
rurabe wants to merge 2 commits intonjaremko:mainfrom
Open
Add bloom filter support for optimized Parquet querying#28rurabe wants to merge 2 commits intonjaremko:mainfrom
rurabe wants to merge 2 commits intonjaremko:mainfrom
Conversation
Add comprehensive bloom filter support to enable efficient row group
filtering when reading Parquet files. Bloom filters are probabilistic
data structures that allow query engines to skip row groups that
definitely don't contain searched values.
Key features:
- Configure bloom filters per column with customizable false positive
probability (FPP) and number of distinct values (NDV)
- Path parameter uses consistent array format for both top-level and
nested columns
- Automatic NDV capping to row group size (1M rows) to prevent
unnecessarily large filters
- Full support for both write_rows and write_columns methods
API:
```ruby
bloom_filters: [
{ path: ['uuid'], false_positive_probability: 0.01, n_distinct_values: 10_000 },
{ path: ['user', 'email'] } # Nested column with defaults
]
The implementation leverages the Rust parquet crate's native bloom
filter support, creating Split Block Bloom Filters (SBBF) per row
group for optimal query performance.
Author
Distributed Bloom Filter Test ResultsThe test simulated a data lake with 20 files × 3M rows each (60M total rows), where bloom filters help avoid unnecessary I/O: Key Performance Wins:
File/Row Group Skipping: Each file has 3 row groups (~1M rows each). The bloom filters (2MB per row group) allow the query engine to:
Storage Overhead:
|
Owner
|
Thanks for this @rurabe The wall of text reads to me as "AI-assisted", which is fine, but makes me feel like I need to review it more strictly. Before I can merge this, can you please update the tests to have stronger assertions? Right now they're kind of hand-wave-y |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds comprehensive bloom filter support to parquet-ruby, enabling significant query performance improvements when reading Parquet files with filter predicates.
What are Bloom Filters?
Bloom filters are space-efficient probabilistic data structures that can tell us if an element is definitely NOT in a set. In Parquet, they allow query engines to skip entire row groups
(up to 1M rows) that don't contain the values being searched for, dramatically improving query performance for high-cardinality columns.
Implementation Details
API Design
The bloom filter configuration uses a consistent array-based path format: