Skip to content

Add bloom filter support for optimized Parquet querying#28

Open
rurabe wants to merge 2 commits intonjaremko:mainfrom
rurabe:bloom
Open

Add bloom filter support for optimized Parquet querying#28
rurabe wants to merge 2 commits intonjaremko:mainfrom
rurabe:bloom

Conversation

@rurabe
Copy link

@rurabe rurabe commented Sep 14, 2025

Summary

This PR adds comprehensive bloom filter support to parquet-ruby, enabling significant query performance improvements when reading Parquet files with filter predicates.

What are Bloom Filters?

Bloom filters are space-efficient probabilistic data structures that can tell us if an element is definitely NOT in a set. In Parquet, they allow query engines to skip entire row groups
(up to 1M rows) that don't contain the values being searched for, dramatically improving query performance for high-cardinality columns.

Implementation Details

API Design

The bloom filter configuration uses a consistent array-based path format:

Parquet.write_rows(data,
  schema: schema,
  write_to: "output.parquet",
  bloom_filters: [
    { path: ['uuid'], false_positive_probability: 0.01, n_distinct_values: 10_000 },
    { path: ['device_id'] },  # Uses defaults: FPP=0.05, NDV=1M
    { path: ['user', 'email'] }  # Nested column support
  ]
)

Key Features

1. Consistent API: path is always an array for type consistency
  - Top-level: ['column_name']
  - Nested: ['parent', 'child']
2. Smart NDV Capping: Automatically caps n_distinct_values to row group size (1M) to prevent unnecessarily large bloom filters
3. Per Row Group: Bloom filters are created per row group, not per file, enabling fine-grained filtering
4. Configurable Parameters:
  - false_positive_probability: Target FPP (default: 0.05)
  - n_distinct_values: Expected distinct values (default: 1M)

Performance Characteristics

| Distinct Values | FPP = 0.01 | FPP = 0.05 | FPP = 0.1 |
|-----------------|------------|------------|-----------|
| 1,000           | 1.5 KB     | 1 KB       | 0.7 KB    |
| 10,000          | 16 KB      | 10 KB      | 7 KB      |
| 100,000         | 150 KB     | 100 KB     | 70 KB     |
| 1,000,000       | 1.5 MB     | 1 MB       | 700 KB    |

Testing

Added comprehensive test suite covering:
- Basic bloom filter usage with write_rows and write_columns
- Default value handling
- Single-element array paths
- Nested column support
- NDV capping behavior
- Verification that bloom filters are omitted when not configured

Documentation

Updated README with:
- Detailed bloom filter section explaining usage and configuration
- When to use bloom filters (high-cardinality columns)
- Trade-offs and performance considerations
- Size estimation table

Breaking Changes

None. This is a purely additive feature that doesn't affect existing code.

Future Enhancements

Potential future improvements could include:
- Custom row group size configuration
- Bloom filter statistics in metadata API
- Per-column-type automatic bloom filter recommendations

Checklist

- Implementation complete
- Tests passing
- Documentation updated
- No breaking changes

  Add comprehensive bloom filter support to enable efficient row group
  filtering when reading Parquet files. Bloom filters are probabilistic
  data structures that allow query engines to skip row groups that
  definitely don't contain searched values.

  Key features:
  - Configure bloom filters per column with customizable false positive
    probability (FPP) and number of distinct values (NDV)
  - Path parameter uses consistent array format for both top-level and
    nested columns
  - Automatic NDV capping to row group size (1M rows) to prevent
    unnecessarily large filters
  - Full support for both write_rows and write_columns methods

  API:
  ```ruby
  bloom_filters: [
    { path: ['uuid'], false_positive_probability: 0.01, n_distinct_values: 10_000 },
    { path: ['user', 'email'] }  # Nested column with defaults
  ]

  The implementation leverages the Rust parquet crate's native bloom
  filter support, creating Split Block Bloom Filters (SBBF) per row
  group for optimal query performance.
@rurabe
Copy link
Author

rurabe commented Sep 15, 2025

Distributed Bloom Filter Test Results

The test simulated a data lake with 20 files × 3M rows each (60M total rows), where bloom filters help avoid unnecessary I/O:

Key Performance Wins:

  1. Finding existing UUIDs (Test 1-2):
    - WITH bloom: 36-37ms
    - WITHOUT bloom: 171-356ms
    - ~5-10x speedup when searching for specific records
  2. Non-existent UUID search (Test 3) - The best case for bloom filters:
    - WITH bloom: 1-3ms (can skip files without reading data)
    - WITHOUT bloom: 164-173ms (must scan all files)
    - ~55-170x speedup for negative lookups
  3. Critical insight from EXPLAIN ANALYZE:
    - Both queries read "Total Files Read: 20" but bloom filters allow DuckDB to skip row groups within files
    - WITH bloom TABLE_SCAN: 0.36s
    - WITHOUT bloom TABLE_SCAN: 5.17s
    - ~14x faster at the scan level

File/Row Group Skipping:

Each file has 3 row groups (~1M rows each). The bloom filters (2MB per row group) allow the query engine to:

  • Check bloom filter metadata first (very fast)
  • Skip entire row groups that definitely don't contain the target value
  • Only read actual data from row groups that might contain matches

Storage Overhead:

  • Bloom filter overhead: 120MB for 2.7GB of data (4.39%)
  • Each row group has a 2MB bloom filter
  • Total: 60 row groups × 2MB = 120MB of bloom filters

@njaremko
Copy link
Owner

njaremko commented Sep 23, 2025

Thanks for this @rurabe

The wall of text reads to me as "AI-assisted", which is fine, but makes me feel like I need to review it more strictly.

Before I can merge this, can you please update the tests to have stronger assertions? Right now they're kind of hand-wave-y

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants