Skip to content

Conversation

@xudong963
Copy link
Member

@xudong963 xudong963 commented Nov 21, 2025

Which issue does this PR close?

Rationale for this change

See #18860 (comment)

What changes are included in this PR?

  1. How to decide if we can do limit pruning without messing up the sql semantics.
  2. Add logic to decide if a row group is fully matched, all rows in the row group are matched the predicated.
  3. Use the fully matched row groups to return limit rows.

Are these changes tested?

Yes

Are there any user-facing changes?

No, no new configs, or API change

@xudong963 xudong963 marked this pull request as draft November 21, 2025 09:43
@github-actions github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate execution Related to the execution crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Nov 21, 2025
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules catalog Related to the catalog crate proto Related to proto crate labels Nov 25, 2025
@xudong963 xudong963 force-pushed the row_group_limit_pruning branch from 52f012f to d075c86 Compare November 25, 2025 05:26
@xudong963 xudong963 marked this pull request as ready for review November 25, 2025 05:31
/// which row groups should be accessed
access_plan: ParquetAccessPlan,
/// which row groups are fully contained within the pruning predicate
is_fully_matched: Vec<bool>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field tracks the row groups and marks if they're fully matched

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend also defining "full contained" (aka it is known for sure that ALL rows within the RowGoup pass the filter -- or conversely that no rows are filtered)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to rename is_fully_matched as fully_contained?

@xudong963
Copy link
Member Author

I also like to construct some good-fit cases to show some benchmark results.

@alamb
Copy link
Contributor

alamb commented Nov 25, 2025

I plan to review this PR carefully tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR @xudong963

The idea is very cool, and a nicely done PR; I have several suggestions, but I think it is very close.

Suggestions:

  1. Rename the flag to preserve_order
  2. Make this new flag on TableScan visible in EXPLAIN plans so we have a better chance of understanding when it is being applied/not applied
  3. Implement an end to end test for the behavior you want to see (specifically, a test that shows a query on a parquet file and demonstrates that only a single row group is fetched with this flag, and more than one is fetched without the flag)

In my mind, 3 is the most important

///
/// # Arguments
/// * `order_sensitive` - Whether the scan's limit should be order-sensitive
pub fn with_limit_order_sensitive(mut self, order_sensitive: bool) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was pretty confused by this name for awhile. After reading more of the PR I undersand now that this flag basically means that the order of the rows in the file is important for the query. When the order is not required for the query, it gives the datasource the freedom to reorder the rows if there is some way that is more efficient to apply filters / limits

I think this is a nice general concept (it doesn't only apply to limits -- it could be used to parallelize decode within a CSV reader, for example)

What do you think about renaming this field / other references to preserve_order instead of limit_order_sensitive .

The semantics would be that the scan must preserve the order of rows (basically the same practical effect of this flag now)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly, I like preserve_order

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 24ae747

)
.map(LogicalPlan::TableScan)
.map(Transformed::yes);
let mut new_scan =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry people will start to miss these fields. I wonder if we should make a TableScanBuilder or something (as a follow on PR) and deprecate TableScan::try_new to make it less likely that these fields will get forgotten

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree

})),

LogicalPlan::Sort(mut sort) => {
let marked_input =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this optimizer is already doing a top-down traversal, I think we can avoid this call (which will rewrite all inputs a second time) and just set a flag on on the PushDownLimit structure itself and reuse the existing tree rewrite

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I refactored it e65f5a3

LogicalPlan::Sort(mut sort) => {
let marked_input =
mark_fetch_order_sensitive(Arc::unwrap_or_clone(sort.input))?;
sort.input = Arc::new(marked_input);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably a conservative choice, but I think it will over eagerly mark some inputs -- for example I think it will mark this scan, even when it doesn't matter

SELECT COUNT(*) as cnt, x 
FROM t 
GROUP BY x 
ORDER BY cnt DESC
LIMIT 10;

I think if you used a flag approach, you could probably avoid this (by resetting the flag when you hit an aggregate, for example)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I added a test for it in the commit e65f5a3.

for (i, &original_row_group_idx) in
fully_contained_candidates_original_idx.iter().enumerate()
{
// If the inverted predicate *also* prunes this row group (meaning inverted_values[i] is false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very clever idea -- basically if not(predicate) would prune the predicate it means that predicate evaluated to (known) true for all rows

I tried thinking of counter examples, but I couldn't come up with any and the reasoning is sound

use crate::expressions::{lit, BinaryExpr, Literal, NotExpr};
use crate::PhysicalExpr;

/// Attempts to simplify NOT expressions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we pull this code into its own PR for easier review

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

type Node = Arc<dyn PhysicalExpr>;

fn f_up(&mut self, node: Self::Node) -> Result<Transformed<Self::Node>> {
// Apply NOT expression simplification first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pull the NOT simplification into its own PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened a PR for it #18970

&self.is_fully_matched
}

/// Prunes the access plan based on the limit and fully contained row groups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explain the rationale here and leave a link to the paper -- I think you can adapt the very nice description on #18860

Basically:

  1. Why is this optimization important: (can reduce the number of RowGroups needed with a known limit)
  2. Why is it correct (the rationale on Support limit pruning #18860)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in fdef903

/// which row groups should be accessed
access_plan: ParquetAccessPlan,
/// which row groups are fully contained within the pruning predicate
is_fully_matched: Vec<bool>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend also defining "full contained" (aka it is known for sure that ALL rows within the RowGoup pass the filter -- or conversely that no rows are filtered)

@alamb
Copy link
Contributor

alamb commented Nov 27, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing row_group_limit_pruning (24ae747) to f1ecacc diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Nov 27, 2025

🤖: Benchmark completed

Details

Comparing HEAD and row_group_limit_pruning
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ row_group_limit_pruning ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  2564.84 ms │              2711.60 ms │ 1.06x slower │
│ QQuery 1     │  1177.86 ms │              1306.25 ms │ 1.11x slower │
│ QQuery 2     │  2412.87 ms │              2423.96 ms │    no change │
│ QQuery 3     │  1156.85 ms │              1145.44 ms │    no change │
│ QQuery 4     │  2318.20 ms │              2308.57 ms │    no change │
│ QQuery 5     │ 28372.67 ms │             27912.28 ms │    no change │
│ QQuery 6     │  4041.22 ms │              4170.72 ms │    no change │
│ QQuery 7     │  3812.31 ms │              3670.54 ms │    no change │
└──────────────┴─────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 45856.83ms │
│ Total Time (row_group_limit_pruning)   │ 45649.37ms │
│ Average Time (HEAD)                    │  5732.10ms │
│ Average Time (row_group_limit_pruning) │  5706.17ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │          6 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ row_group_limit_pruning ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.11 ms │                 2.20 ms │     no change │
│ QQuery 1     │    49.14 ms │                49.42 ms │     no change │
│ QQuery 2     │   134.43 ms │               137.18 ms │     no change │
│ QQuery 3     │   165.90 ms │               166.30 ms │     no change │
│ QQuery 4     │  1082.62 ms │              1093.44 ms │     no change │
│ QQuery 5     │  1509.20 ms │              1523.29 ms │     no change │
│ QQuery 6     │     2.23 ms │                 2.12 ms │     no change │
│ QQuery 7     │    53.46 ms │                53.16 ms │     no change │
│ QQuery 8     │  1464.07 ms │              1443.88 ms │     no change │
│ QQuery 9     │  1920.32 ms │              1861.46 ms │     no change │
│ QQuery 10    │   368.92 ms │               369.59 ms │     no change │
│ QQuery 11    │   425.35 ms │               415.20 ms │     no change │
│ QQuery 12    │  1360.27 ms │              1366.53 ms │     no change │
│ QQuery 13    │  2088.62 ms │              2099.38 ms │     no change │
│ QQuery 14    │  1291.80 ms │              1276.05 ms │     no change │
│ QQuery 15    │  1287.95 ms │              1249.38 ms │     no change │
│ QQuery 16    │  2687.28 ms │              2705.34 ms │     no change │
│ QQuery 17    │  2666.35 ms │              2685.82 ms │     no change │
│ QQuery 18    │  5334.10 ms │              5051.38 ms │ +1.06x faster │
│ QQuery 19    │   128.58 ms │               129.60 ms │     no change │
│ QQuery 20    │  2025.19 ms │              1974.90 ms │     no change │
│ QQuery 21    │  2326.74 ms │              2256.83 ms │     no change │
│ QQuery 22    │  3910.56 ms │              3883.88 ms │     no change │
│ QQuery 23    │ 16847.81 ms │             12664.80 ms │ +1.33x faster │
│ QQuery 24    │   218.91 ms │               218.97 ms │     no change │
│ QQuery 25    │   470.06 ms │               470.62 ms │     no change │
│ QQuery 26    │   211.18 ms │               219.75 ms │     no change │
│ QQuery 27    │  2828.80 ms │              2821.30 ms │     no change │
│ QQuery 28    │ 23333.65 ms │             24002.05 ms │     no change │
│ QQuery 29    │  1010.28 ms │               990.00 ms │     no change │
│ QQuery 30    │  1361.32 ms │              1340.64 ms │     no change │
│ QQuery 31    │  1401.38 ms │              1386.63 ms │     no change │
│ QQuery 32    │  4833.23 ms │              4590.26 ms │ +1.05x faster │
│ QQuery 33    │  5922.93 ms │              5861.58 ms │     no change │
│ QQuery 34    │  6218.31 ms │              6093.69 ms │     no change │
│ QQuery 35    │  1945.21 ms │              1892.18 ms │     no change │
│ QQuery 36    │   118.71 ms │               122.83 ms │     no change │
│ QQuery 37    │    52.92 ms │                53.12 ms │     no change │
│ QQuery 38    │   119.74 ms │               120.43 ms │     no change │
│ QQuery 39    │   195.76 ms │               198.67 ms │     no change │
│ QQuery 40    │    40.90 ms │                45.00 ms │  1.10x slower │
│ QQuery 41    │    39.19 ms │                40.69 ms │     no change │
│ QQuery 42    │    31.96 ms │                33.45 ms │     no change │
└──────────────┴─────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 99487.42ms │
│ Total Time (row_group_limit_pruning)   │ 94963.00ms │
│ Average Time (HEAD)                    │  2313.66ms │
│ Average Time (row_group_limit_pruning) │  2208.44ms │
│ Queries Faster                         │          3 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │         39 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ row_group_limit_pruning ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 130.90 ms │               136.37 ms │    no change │
│ QQuery 2     │  28.85 ms │                29.40 ms │    no change │
│ QQuery 3     │  35.92 ms │                38.73 ms │ 1.08x slower │
│ QQuery 4     │  29.91 ms │                29.59 ms │    no change │
│ QQuery 5     │  87.85 ms │                88.68 ms │    no change │
│ QQuery 6     │  19.75 ms │                19.33 ms │    no change │
│ QQuery 7     │ 224.01 ms │               217.44 ms │    no change │
│ QQuery 8     │  33.21 ms │                32.74 ms │    no change │
│ QQuery 9     │ 103.44 ms │               100.86 ms │    no change │
│ QQuery 10    │  64.34 ms │                64.25 ms │    no change │
│ QQuery 11    │  17.90 ms │                19.12 ms │ 1.07x slower │
│ QQuery 12    │  52.62 ms │                53.57 ms │    no change │
│ QQuery 13    │  47.65 ms │                47.74 ms │    no change │
│ QQuery 14    │  13.71 ms │                13.79 ms │    no change │
│ QQuery 15    │  24.87 ms │                24.60 ms │    no change │
│ QQuery 16    │  25.17 ms │                25.16 ms │    no change │
│ QQuery 17    │ 149.35 ms │               151.78 ms │    no change │
│ QQuery 18    │ 278.08 ms │               276.50 ms │    no change │
│ QQuery 19    │  37.01 ms │                38.28 ms │    no change │
│ QQuery 20    │  49.78 ms │                48.67 ms │    no change │
│ QQuery 21    │ 317.59 ms │               310.12 ms │    no change │
│ QQuery 22    │  18.42 ms │                18.49 ms │    no change │
└──────────────┴───────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1790.33ms │
│ Total Time (row_group_limit_pruning)   │ 1785.22ms │
│ Average Time (HEAD)                    │   81.38ms │
│ Average Time (row_group_limit_pruning) │   81.15ms │
│ Queries Faster                         │         0 │
│ Queries Slower                         │         2 │
│ Queries with No Change                 │        20 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate core Core DataFusion crate datasource Changes to the datasource crate execution Related to the execution crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants