Add bloom filter support for optimized Parquet querying by rurabe · Pull Request #28 · njaremko/parquet-ruby

rurabe · 2025-09-14T22:26:25Z

Summary

This PR adds comprehensive bloom filter support to parquet-ruby, enabling significant query performance improvements when reading Parquet files with filter predicates.

What are Bloom Filters?

Bloom filters are space-efficient probabilistic data structures that can tell us if an element is definitely NOT in a set. In Parquet, they allow query engines to skip entire row groups
(up to 1M rows) that don't contain the values being searched for, dramatically improving query performance for high-cardinality columns.

Implementation Details

API Design

The bloom filter configuration uses a consistent array-based path format:

Parquet.write_rows(data,
  schema: schema,
  write_to: "output.parquet",
  bloom_filters: [
    { path: ['uuid'], false_positive_probability: 0.01, n_distinct_values: 10_000 },
    { path: ['device_id'] },  # Uses defaults: FPP=0.05, NDV=1M
    { path: ['user', 'email'] }  # Nested column support
  ]
)

Key Features

1. Consistent API: path is always an array for type consistency
  - Top-level: ['column_name']
  - Nested: ['parent', 'child']
2. Smart NDV Capping: Automatically caps n_distinct_values to row group size (1M) to prevent unnecessarily large bloom filters
3. Per Row Group: Bloom filters are created per row group, not per file, enabling fine-grained filtering
4. Configurable Parameters:
  - false_positive_probability: Target FPP (default: 0.05)
  - n_distinct_values: Expected distinct values (default: 1M)

Performance Characteristics

| Distinct Values | FPP = 0.01 | FPP = 0.05 | FPP = 0.1 |
|-----------------|------------|------------|-----------|
| 1,000           | 1.5 KB     | 1 KB       | 0.7 KB    |
| 10,000          | 16 KB      | 10 KB      | 7 KB      |
| 100,000         | 150 KB     | 100 KB     | 70 KB     |
| 1,000,000       | 1.5 MB     | 1 MB       | 700 KB    |

Testing

Added comprehensive test suite covering:
- Basic bloom filter usage with write_rows and write_columns
- Default value handling
- Single-element array paths
- Nested column support
- NDV capping behavior
- Verification that bloom filters are omitted when not configured

Documentation

Updated README with:
- Detailed bloom filter section explaining usage and configuration
- When to use bloom filters (high-cardinality columns)
- Trade-offs and performance considerations
- Size estimation table

Breaking Changes

None. This is a purely additive feature that doesn't affect existing code.

Future Enhancements

Potential future improvements could include:
- Custom row group size configuration
- Bloom filter statistics in metadata API
- Per-column-type automatic bloom filter recommendations

Checklist

- Implementation complete
- Tests passing
- Documentation updated
- No breaking changes

Add comprehensive bloom filter support to enable efficient row group filtering when reading Parquet files. Bloom filters are probabilistic data structures that allow query engines to skip row groups that definitely don't contain searched values. Key features: - Configure bloom filters per column with customizable false positive probability (FPP) and number of distinct values (NDV) - Path parameter uses consistent array format for both top-level and nested columns - Automatic NDV capping to row group size (1M rows) to prevent unnecessarily large filters - Full support for both write_rows and write_columns methods API: ```ruby bloom_filters: [ { path: ['uuid'], false_positive_probability: 0.01, n_distinct_values: 10_000 }, { path: ['user', 'email'] } # Nested column with defaults ] The implementation leverages the Rust parquet crate's native bloom filter support, creating Split Block Bloom Filters (SBBF) per row group for optimal query performance.

rurabe · 2025-09-15T19:42:04Z

Distributed Bloom Filter Test Results

The test simulated a data lake with 20 files × 3M rows each (60M total rows), where bloom filters help avoid unnecessary I/O:

Key Performance Wins:

Finding existing UUIDs (Test 1-2):
- WITH bloom: 36-37ms
- WITHOUT bloom: 171-356ms
- ~5-10x speedup when searching for specific records
Non-existent UUID search (Test 3) - The best case for bloom filters:
- WITH bloom: 1-3ms (can skip files without reading data)
- WITHOUT bloom: 164-173ms (must scan all files)
- ~55-170x speedup for negative lookups
Critical insight from EXPLAIN ANALYZE:
- Both queries read "Total Files Read: 20" but bloom filters allow DuckDB to skip row groups within files
- WITH bloom TABLE_SCAN: 0.36s
- WITHOUT bloom TABLE_SCAN: 5.17s
- ~14x faster at the scan level

File/Row Group Skipping:

Each file has 3 row groups (~1M rows each). The bloom filters (2MB per row group) allow the query engine to:

Check bloom filter metadata first (very fast)
Skip entire row groups that definitely don't contain the target value
Only read actual data from row groups that might contain matches

Storage Overhead:

Bloom filter overhead: 120MB for 2.7GB of data (4.39%)
Each row group has a 2MB bloom filter
Total: 60 row groups × 2MB = 120MB of bloom filters

njaremko · 2025-09-23T17:15:08Z

Thanks for this @rurabe

The wall of text reads to me as "AI-assisted", which is fine, but makes me feel like I need to review it more strictly.

Before I can merge this, can you please update the tests to have stronger assertions? Right now they're kind of hand-wave-y

rurabe added 2 commits September 14, 2025 12:24

remove set_bloom_filter_enabled(true) which sets blooms on all columns

82c21ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add bloom filter support for optimized Parquet querying#28

Add bloom filter support for optimized Parquet querying#28
rurabe wants to merge 2 commits intonjaremko:mainfrom
rurabe:bloom

rurabe commented Sep 14, 2025

Uh oh!

rurabe commented Sep 15, 2025

Uh oh!

njaremko commented Sep 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rurabe commented Sep 14, 2025

Summary

What are Bloom Filters?

Implementation Details

API Design

Uh oh!

rurabe commented Sep 15, 2025

Distributed Bloom Filter Test Results

Uh oh!

njaremko commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

njaremko commented Sep 23, 2025 •

edited

Loading