Row group limit pruning #18868

xudong963 · 2025-11-21T09:43:44Z

Which issue does this PR close?

Part of Support limit pruning #18860

Rationale for this change

See #18860 (comment)

What changes are included in this PR?

How to decide if we can do limit pruning without messing up the sql semantics.
Add logic to decide if a row group is fully matched, all rows in the row group are matched the predicated.
Use the fully matched row groups to return limit rows.

Are these changes tested?

Yes

Are there any user-facing changes?

No, no new configs, or API change

datafusion/datasource-parquet/src/opener.rs

datafusion/datasource-parquet/src/row_group_filter.rs

datafusion/core/tests/parquet/row_group_pruning.rs

datafusion/core/tests/parquet/mod.rs

datafusion/datasource-parquet/src/opener.rs

xudong963 · 2025-11-25T05:35:43Z

datafusion/datasource-parquet/src/row_group_filter.rs

    /// which row groups should be accessed
    access_plan: ParquetAccessPlan,
+    /// which row groups are fully contained within the pruning predicate
+    is_fully_matched: Vec<bool>,


The field tracks the row groups and marks if they're fully matched

I recommend also defining "full contained" (aka it is known for sure that ALL rows within the RowGoup pass the filter -- or conversely that no rows are filtered)

Do you mean to rename is_fully_matched as fully_contained?

xudong963 · 2025-11-25T05:37:47Z

I also like to construct some good-fit cases to show some benchmark results.

alamb · 2025-11-25T22:37:10Z

I plan to review this PR carefully tomorrow

alamb

Thank you very much for this PR @xudong963

The idea is very cool, and a nicely done PR; I have several suggestions, but I think it is very close.

Suggestions:

Rename the flag to preserve_order
Make this new flag on TableScan visible in EXPLAIN plans so we have a better chance of understanding when it is being applied/not applied
Implement an end to end test for the behavior you want to see (specifically, a test that shows a query on a parquet file and demonstrates that only a single row group is fetched with this flag, and more than one is fetched without the flag)

In my mind, 3 is the most important

alamb · 2025-11-26T20:48:50Z

datafusion/catalog/src/table.rs

+    ///
+    /// # Arguments
+    /// * `order_sensitive` - Whether the scan's limit should be order-sensitive
+    pub fn with_limit_order_sensitive(mut self, order_sensitive: bool) -> Self {


I was pretty confused by this name for awhile. After reading more of the PR I undersand now that this flag basically means that the order of the rows in the file is important for the query. When the order is not required for the query, it gives the datasource the freedom to reorder the rows if there is some way that is more efficient to apply filters / limits

I think this is a nice general concept (it doesn't only apply to limits -- it could be used to parallelize decode within a CSV reader, for example)

What do you think about renaming this field / other references to preserve_order instead of limit_order_sensitive .

The semantics would be that the scan must preserve the order of rows (basically the same practical effect of this flag now)

Yes, exactly, I like preserve_order

done 24ae747

datafusion/expr/src/logical_plan/plan.rs

alamb · 2025-11-26T20:51:07Z

datafusion/optimizer/src/optimize_projections/mod.rs

-            )
-            .map(LogicalPlan::TableScan)
-            .map(Transformed::yes);
+            let mut new_scan =


I worry people will start to miss these fields. I wonder if we should make a TableScanBuilder or something (as a follow on PR) and deprecate TableScan::try_new to make it less likely that these fields will get forgotten

yeah, I agree

alamb · 2025-11-26T20:56:23Z

datafusion/optimizer/src/push_down_limit.rs

                })),

            LogicalPlan::Sort(mut sort) => {
+                let marked_input =


Since this optimizer is already doing a top-down traversal, I think we can avoid this call (which will rewrite all inputs a second time) and just set a flag on on the PushDownLimit structure itself and reuse the existing tree rewrite

Makes sense, I refactored it e65f5a3

alamb · 2025-11-26T20:59:36Z

datafusion/optimizer/src/push_down_limit.rs

            LogicalPlan::Sort(mut sort) => {
+                let marked_input =
+                    mark_fetch_order_sensitive(Arc::unwrap_or_clone(sort.input))?;
+                sort.input = Arc::new(marked_input);


this is probably a conservative choice, but I think it will over eagerly mark some inputs -- for example I think it will mark this scan, even when it doesn't matter

SELECT COUNT(*) as cnt, x FROM t GROUP BY x ORDER BY cnt DESC LIMIT 10;

I think if you used a flag approach, you could probably avoid this (by resetting the flag when you hit an aggregate, for example)

yes, I added a test for it in the commit e65f5a3.

alamb · 2025-11-26T21:09:46Z

datafusion/datasource-parquet/src/row_group_filter.rs

+                            for (i, &original_row_group_idx) in
+                                fully_contained_candidates_original_idx.iter().enumerate()
+                            {
+                                // If the inverted predicate *also* prunes this row group (meaning inverted_values[i] is false),


This is a very clever idea -- basically if not(predicate) would prune the predicate it means that predicate evaluated to (known) true for all rows

I tried thinking of counter examples, but I couldn't come up with any and the reasoning is sound

alamb · 2025-11-26T21:10:24Z

datafusion/physical-expr/src/simplifier/not.rs

+use crate::expressions::{lit, BinaryExpr, Literal, NotExpr};
+use crate::PhysicalExpr;
+
+/// Attempts to simplify NOT expressions


I recommend we pull this code into its own PR for easier review

alamb · 2025-11-26T21:11:27Z

datafusion/physical-expr/src/simplifier/mod.rs

    type Node = Arc<dyn PhysicalExpr>;

    fn f_up(&mut self, node: Self::Node) -> Result<Transformed<Self::Node>> {
+        // Apply NOT expression simplification first


I think we should pull the NOT simplification into its own PR

I opened a PR for it #18970

alamb · 2025-11-26T21:13:52Z

datafusion/datasource-parquet/src/row_group_filter.rs

+        &self.is_fully_matched
+    }
+
+    /// Prunes the access plan based on the limit and fully contained row groups.


I think we should explain the rationale here and leave a link to the paper -- I think you can adapt the very nice description on #18860

Basically:

Why is this optimization important: (can reduce the number of RowGroups needed with a known limit)

Why is it correct (the rationale on Support limit pruning #18860)

done in fdef903

alamb · 2025-11-26T21:14:35Z

datafusion/datasource-parquet/src/row_group_filter.rs

    /// which row groups should be accessed
    access_plan: ParquetAccessPlan,
+    /// which row groups are fully contained within the pruning predicate
+    is_fully_matched: Vec<bool>,


I recommend also defining "full contained" (aka it is known for sure that ALL rows within the RowGoup pass the filter -- or conversely that no rows are filtered)

alamb · 2025-11-27T11:56:22Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing row_group_limit_pruning (24ae747) to f1ecacc diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-11-27T12:54:32Z

🤖: Benchmark completed

Details

Comparing HEAD and row_group_limit_pruning
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ row_group_limit_pruning ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  2564.84 ms │              2711.60 ms │ 1.06x slower │
│ QQuery 1     │  1177.86 ms │              1306.25 ms │ 1.11x slower │
│ QQuery 2     │  2412.87 ms │              2423.96 ms │    no change │
│ QQuery 3     │  1156.85 ms │              1145.44 ms │    no change │
│ QQuery 4     │  2318.20 ms │              2308.57 ms │    no change │
│ QQuery 5     │ 28372.67 ms │             27912.28 ms │    no change │
│ QQuery 6     │  4041.22 ms │              4170.72 ms │    no change │
│ QQuery 7     │  3812.31 ms │              3670.54 ms │    no change │
└──────────────┴─────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 45856.83ms │
│ Total Time (row_group_limit_pruning)   │ 45649.37ms │
│ Average Time (HEAD)                    │  5732.10ms │
│ Average Time (row_group_limit_pruning) │  5706.17ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │          6 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ row_group_limit_pruning ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.11 ms │                 2.20 ms │     no change │
│ QQuery 1     │    49.14 ms │                49.42 ms │     no change │
│ QQuery 2     │   134.43 ms │               137.18 ms │     no change │
│ QQuery 3     │   165.90 ms │               166.30 ms │     no change │
│ QQuery 4     │  1082.62 ms │              1093.44 ms │     no change │
│ QQuery 5     │  1509.20 ms │              1523.29 ms │     no change │
│ QQuery 6     │     2.23 ms │                 2.12 ms │     no change │
│ QQuery 7     │    53.46 ms │                53.16 ms │     no change │
│ QQuery 8     │  1464.07 ms │              1443.88 ms │     no change │
│ QQuery 9     │  1920.32 ms │              1861.46 ms │     no change │
│ QQuery 10    │   368.92 ms │               369.59 ms │     no change │
│ QQuery 11    │   425.35 ms │               415.20 ms │     no change │
│ QQuery 12    │  1360.27 ms │              1366.53 ms │     no change │
│ QQuery 13    │  2088.62 ms │              2099.38 ms │     no change │
│ QQuery 14    │  1291.80 ms │              1276.05 ms │     no change │
│ QQuery 15    │  1287.95 ms │              1249.38 ms │     no change │
│ QQuery 16    │  2687.28 ms │              2705.34 ms │     no change │
│ QQuery 17    │  2666.35 ms │              2685.82 ms │     no change │
│ QQuery 18    │  5334.10 ms │              5051.38 ms │ +1.06x faster │
│ QQuery 19    │   128.58 ms │               129.60 ms │     no change │
│ QQuery 20    │  2025.19 ms │              1974.90 ms │     no change │
│ QQuery 21    │  2326.74 ms │              2256.83 ms │     no change │
│ QQuery 22    │  3910.56 ms │              3883.88 ms │     no change │
│ QQuery 23    │ 16847.81 ms │             12664.80 ms │ +1.33x faster │
│ QQuery 24    │   218.91 ms │               218.97 ms │     no change │
│ QQuery 25    │   470.06 ms │               470.62 ms │     no change │
│ QQuery 26    │   211.18 ms │               219.75 ms │     no change │
│ QQuery 27    │  2828.80 ms │              2821.30 ms │     no change │
│ QQuery 28    │ 23333.65 ms │             24002.05 ms │     no change │
│ QQuery 29    │  1010.28 ms │               990.00 ms │     no change │
│ QQuery 30    │  1361.32 ms │              1340.64 ms │     no change │
│ QQuery 31    │  1401.38 ms │              1386.63 ms │     no change │
│ QQuery 32    │  4833.23 ms │              4590.26 ms │ +1.05x faster │
│ QQuery 33    │  5922.93 ms │              5861.58 ms │     no change │
│ QQuery 34    │  6218.31 ms │              6093.69 ms │     no change │
│ QQuery 35    │  1945.21 ms │              1892.18 ms │     no change │
│ QQuery 36    │   118.71 ms │               122.83 ms │     no change │
│ QQuery 37    │    52.92 ms │                53.12 ms │     no change │
│ QQuery 38    │   119.74 ms │               120.43 ms │     no change │
│ QQuery 39    │   195.76 ms │               198.67 ms │     no change │
│ QQuery 40    │    40.90 ms │                45.00 ms │  1.10x slower │
│ QQuery 41    │    39.19 ms │                40.69 ms │     no change │
│ QQuery 42    │    31.96 ms │                33.45 ms │     no change │
└──────────────┴─────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 99487.42ms │
│ Total Time (row_group_limit_pruning)   │ 94963.00ms │
│ Average Time (HEAD)                    │  2313.66ms │
│ Average Time (row_group_limit_pruning) │  2208.44ms │
│ Queries Faster                         │          3 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │         39 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ row_group_limit_pruning ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 130.90 ms │               136.37 ms │    no change │
│ QQuery 2     │  28.85 ms │                29.40 ms │    no change │
│ QQuery 3     │  35.92 ms │                38.73 ms │ 1.08x slower │
│ QQuery 4     │  29.91 ms │                29.59 ms │    no change │
│ QQuery 5     │  87.85 ms │                88.68 ms │    no change │
│ QQuery 6     │  19.75 ms │                19.33 ms │    no change │
│ QQuery 7     │ 224.01 ms │               217.44 ms │    no change │
│ QQuery 8     │  33.21 ms │                32.74 ms │    no change │
│ QQuery 9     │ 103.44 ms │               100.86 ms │    no change │
│ QQuery 10    │  64.34 ms │                64.25 ms │    no change │
│ QQuery 11    │  17.90 ms │                19.12 ms │ 1.07x slower │
│ QQuery 12    │  52.62 ms │                53.57 ms │    no change │
│ QQuery 13    │  47.65 ms │                47.74 ms │    no change │
│ QQuery 14    │  13.71 ms │                13.79 ms │    no change │
│ QQuery 15    │  24.87 ms │                24.60 ms │    no change │
│ QQuery 16    │  25.17 ms │                25.16 ms │    no change │
│ QQuery 17    │ 149.35 ms │               151.78 ms │    no change │
│ QQuery 18    │ 278.08 ms │               276.50 ms │    no change │
│ QQuery 19    │  37.01 ms │                38.28 ms │    no change │
│ QQuery 20    │  49.78 ms │                48.67 ms │    no change │
│ QQuery 21    │ 317.59 ms │               310.12 ms │    no change │
│ QQuery 22    │  18.42 ms │                18.49 ms │    no change │
└──────────────┴───────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1790.33ms │
│ Total Time (row_group_limit_pruning)   │ 1785.22ms │
│ Average Time (HEAD)                    │   81.38ms │
│ Average Time (row_group_limit_pruning) │   81.15ms │
│ Queries Faster                         │         0 │
│ Queries Slower                         │         2 │
│ Queries with No Change                 │        20 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

…t pruning

xudong963 marked this pull request as draft November 21, 2025 09:43

github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate execution Related to the execution crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Nov 21, 2025

xudong963 commented Nov 21, 2025

View reviewed changes

datafusion/datasource-parquet/src/opener.rs Outdated Show resolved Hide resolved

xudong963 commented Nov 21, 2025

View reviewed changes

datafusion/datasource-parquet/src/row_group_filter.rs Outdated Show resolved Hide resolved

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules catalog Related to the catalog crate proto Related to proto crate labels Nov 25, 2025

xudong963 force-pushed the row_group_limit_pruning branch from 52f012f to d075c86 Compare November 25, 2025 05:26

xudong963 marked this pull request as ready for review November 25, 2025 05:31

xudong963 commented Nov 25, 2025

View reviewed changes

xudong963 force-pushed the row_group_limit_pruning branch from aaf76c0 to cb5711d Compare November 25, 2025 06:32

alamb mentioned this pull request Nov 25, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-24 #18888

Open

31 tasks

alamb reviewed Nov 26, 2025

View reviewed changes

xudong963 added 8 commits November 27, 2025 21:24

Support row group limit pruning

20cecfa

Support row group limit pruning

f847f8b

Add fetch_order_sensitive during limit pushdown to decide if use limi…

49a5c28

…t pruning

fix test formant

c6b19c2

Rename to preserve_order

ecbfff7

refactor pushdown limit

e919879

extract some logic into identify_fully_matched_row_groups

647618a

resolve conflicts

7e1fc4f

xudong963 force-pushed the row_group_limit_pruning branch from fdef903 to 7e1fc4f Compare November 27, 2025 13:31

xudong963 mentioned this pull request Nov 27, 2025

Support simplify not for physical expr #18970

Open

Row group limit pruning #18868

Are you sure you want to change the base?

Row group limit pruning #18868

Conversation

xudong963 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xudong963 commented Nov 25, 2025

Uh oh!

alamb commented Nov 25, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 27, 2025

Uh oh!

alamb commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xudong963 commented Nov 21, 2025 •

edited

Loading