Skip to content

Conversation

@sonhmai
Copy link
Contributor

@sonhmai sonhmai commented Jan 8, 2026

Which issue does this PR close?

Rationale for this change

The RowFilter API does exist and can evaluate predicates during evaluation, but it has no examples.

What changes are included in this PR?

  • Added a rustdoc example and blog link to ParquetRecordBatchReaderBuilder::with_row_filter.
  • Added a running example in parquet/examples/read_with_row_filter.rs

Are these changes tested?

Yes

cargo run -p parquet --example read_with_row_filter
cargo test -p parquet --doc

Are there any user-facing changes?

Yes, doc only. No API changes.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 8, 2026
@sonhmai sonhmai force-pushed the doc/row-filter-usage-9096 branch from 37be4e1 to f286dfd Compare January 8, 2026 06:59
@sonhmai sonhmai changed the title doc: add example of RowFilter usage draft: doc: add example of RowFilter usage Jan 8, 2026
@sonhmai sonhmai force-pushed the doc/row-filter-usage-9096 branch from f286dfd to bc8e06f Compare January 8, 2026 07:32
@sonhmai sonhmai changed the title draft: doc: add example of RowFilter usage doc: add example of RowFilter usage Jan 8, 2026
@sonhmai
Copy link
Contributor Author

sonhmai commented Jan 8, 2026

@alamb would you mind reviewing this? Thanks!

use parquet::errors::Result;
use std::fs::File;

// RowFilter / with_row_filter usage. For background and more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we better off removing this and keeping only the doctest to reduce duplication?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree -- I think the doc examples are easier to find so I recommend removing this example file

Actually, looking at the existing examples I think many of them are redundant / would be easier to find if we moved them into the documentation:
https://github.com/apache/arrow-rs/tree/main/parquet/examples

/// more efficient skipping over data pages. See [`ArrowReaderOptions::with_page_index`].
///
/// For a running example see `parquet/examples/read_with_row_filter.rs`.
/// See <https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// See the [blog post on late materialization] for a more technical explanation.
///
/// ...
///
/// [blog post on late materialization]: https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/

Slightly nice formatting this way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sonhmai and @Jefffrey -- this is great work and a nice addition.

I think @Jefffrey and my suggestions would make this PR better, but I also think we could merge it as is and iterate as a follow on too. Just let us know what you would like to do @sonhmai

use parquet::errors::Result;
use std::fs::File;

// RowFilter / with_row_filter usage. For background and more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree -- I think the doc examples are easier to find so I recommend removing this example file

Actually, looking at the existing examples I think many of them are redundant / would be easier to find if we moved them into the documentation:
https://github.com/apache/arrow-rs/tree/main/parquet/examples

/// more efficient skipping over data pages. See [`ArrowReaderOptions::with_page_index`].
///
/// For a running example see `parquet/examples/read_with_row_filter.rs`.
/// See <https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

/// let builder = ParquetRecordBatchReaderBuilder::try_new(file)?;
/// let schema_desc = builder.metadata().file_metadata().schema_descr_ptr();
///
/// // Create predicate: column id > 4. This col has index 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// // Create predicate: column id > 4. This col has index 0.
/// // Create predicate that evaluates `id > 4`. The `id` column has index 0.

/// // Create predicate: column id > 4. This col has index 0.
/// let projection = ProjectionMask::leaves(&schema_desc, [0]);
/// let predicate = ArrowPredicateFn::new(projection, |batch| {
/// let id_col = batch.column(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a minor suggestion, I think it would make a nicer example if you picked a different column from the file other than 0 so that it is clear the batch passed to the predicate only contains the selected projection column

For example, perhaps you could use the int_col (column index 4)

> select * from './parquet-testing/data/alltypes_plain.parquet';
+----+----------+-------------+--------------+---------+------------+-----------+------------+------------------+------------+---------------------+
| id | bool_col | tinyint_col | smallint_col | int_col | bigint_col | float_col | double_col | date_string_col  | string_col | timestamp_col       |
+----+----------+-------------+--------------+---------+------------+-----------+------------+------------------+------------+---------------------+
| 4  | true     | 0           | 0            | 0       | 0          | 0.0       | 0.0        | 30332f30312f3039 | 30         | 2009-03-01T00:00:00 |
| 5  | false    | 1           | 1            | 1       | 10         | 1.1       | 10.1       | 30332f30312f3039 | 31         | 2009-03-01T00:01:00 |
| 6  | true     | 0           | 0            | 0       | 0          | 0.0       | 0.0        | 30342f30312f3039 | 30         | 2009-04-01T00:00:00 |
| 7  | false    | 1           | 1            | 1       | 10         | 1.1       | 10.1       | 30342f30312f3039 | 31         | 2009-04-01T00:01:00 |
| 2  | true     | 0           | 0            | 0       | 0          | 0.0       | 0.0        | 30322f30312f3039 | 30         | 2009-02-01T00:00:00 |
| 3  | false    | 1           | 1            | 1       | 10         | 1.1       | 10.1       | 30322f30312f3039 | 31         | 2009-02-01T00:01:00 |
| 0  | true     | 0           | 0            | 0       | 0          | 0.0       | 0.0        | 30312f30312f3039 | 30         | 2009-01-01T00:00:00 |
| 1  | false    | 1           | 1            | 1       | 10         | 1.1       | 10.1       | 30312f30312f3039 | 31         | 2009-01-01T00:01:00 |
+----+----------+-------------+--------------+---------+------------+-----------+------------+------------------+------------+---------------------+
8 row(s) fetched.
Elapsed 0.039 seconds.

> describe './parquet-testing/data/alltypes_plain.parquet';
+-----------------+---------------+-------------+
| column_name     | data_type     | is_nullable |
+-----------------+---------------+-------------+
| id              | Int32         | YES         |
| bool_col        | Boolean       | YES         |
| tinyint_col     | Int32         | YES         |
| smallint_col    | Int32         | YES         |
| int_col         | Int32         | YES         |
| bigint_col      | Int64         | YES         |
| float_col       | Float32       | YES         |
| double_col      | Float64       | YES         |
| date_string_col | BinaryView    | YES         |
| string_col      | BinaryView    | YES         |
| timestamp_col   | Timestamp(ns) | YES         |
+-----------------+---------------+-------------+
11 row(s) fetched.
Elapsed 0.005 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document / Add an example of RowFilter usage

3 participants