Skip to content

Conversation

@hazefully
Copy link
Contributor

@hazefully hazefully commented Oct 16, 2025

This PR improves the rewriting cost model by making it consider the number of predicates in each expression at each level of the expression graph, instead of comparing graphs based on the height of highest expressions in the graph that contains any predicates.

The reasoning for this is to push the cost model to prefer expressions with the same number of predicates, where some predicates were pushed from higher levels in the QGM to lower levels, which can lead to producing plans with more specific index key comparisons in the planning phase.

Example: the rewriting cost model should consider the expression:

SELECT sq1.a FROM (SELECT a, b FROM T WHERE a = 42) sq1,  (SELECT a FROM T2) sq2

to have a lower cost than:

SELECT a FROM (SELECT a, b, d FROM T) WHERE a = 42 AND EXISTS (SELECT a FROM T2)

The rewriting cost model is improved by considering two ExpressionProperty when comparing two expressions: the existing NormalizedResidualPredicateProperty property which calculates the total number of conjuncts in the combined query predicate across the entire expression tree at all levels and a new property PredicateCountByLevelProperty which calculates the number of predicates at each level of the expression graph. In addition, considering the property PredicateComplexityProperty is no longer necessary as what it checks for (the most complex predicate across the entire expression tree) is already checked indirectly by NormalizedResidualPredicateProperty.

Performance impact

Some basic analysis of the performance impact of considering those properties can be found in #3681 (comment).

@hazefully hazefully requested a review from normen662 October 16, 2025 14:57
@hazefully hazefully added the enhancement New feature or request label Oct 16, 2025
@hazefully hazefully force-pushed the improve-rewriting-cost-model branch from 3b65482 to 65694f8 Compare October 16, 2025 15:09
@hazefully hazefully force-pushed the improve-rewriting-cost-model branch from 65694f8 to ca8d8c9 Compare December 9, 2025 23:33
Comment on lines 117 to 124
task_count: 2476
task_total_time_ms: 212
transform_count: 467
transform_time_ms: 67
transform_yield_count: 143
insert_time_ms: 21
insert_new_count: 315
insert_reused_count: 44
task_count: 1325
task_total_time_ms: 267
transform_count: 270
transform_time_ms: 153
transform_yield_count: 83
insert_time_ms: 12
insert_new_count: 165
insert_reused_count: 24
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the planning metrics for the two queries here have improved because the rewriting cost model now prefers expressions with less total number of predicates, so when the predicates in those two queries are simplified to remove the duplicated predicate, we end up with a smaller search space. Before the change here, the two expressions (the original expressions with the duplicated predicate and the simplified one) were considered equal and the semantic hashcode tie breaking resulted in the original expression being chosen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks almost suspicious that the time spent is higher even though all the counts are lower. Can you do a test for me? Remove all metrics files (metrics.binpb and metrics.yaml) try to run the entire suite in correction mode. Just glance over the time spent to see if there is a trend up or downwards. In general, this sort of stuff happens that we see confusing time durations as everyone uses a different set-up, had their Mac throttled or not, etc. -- so a sample size of one is pretty naive to comment on but still let's run this to just rule out the case that for some reason a rule has degraded. Also, you may want to do this for down stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #3681 (comment), I did multiple runs of correcting the entire test suite (including downstream) and there doesn't seem to be any consistent downtrend/uptrend in the time metrics.

@hazefully hazefully force-pushed the improve-rewriting-cost-model branch from ca8d8c9 to cd4e175 Compare December 9, 2025 23:50
@hazefully hazefully marked this pull request as ready for review December 10, 2025 00:30
Copy link
Contributor

@normen662 normen662 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, good stuff! I left detailed (I hope) comments on your changes. In general I would say that the following should be focused on:

  • take the new properties out of the expression property map, i.e. make them untracked
  • investigate the interaction with predicate complexity (would it be better to reorder and maybe get rid of predicate height)
  • evaluate if there is a planner performance regression

* deeper level or if {@code a} has a higher number of levels with predicates.
*/
public static int compare(final PredicateCountByLevelInfo a, final PredicateCountByLevelInfo b) {
final int highestLevel = Integer.max(a.getHighestLevel(), b.getHighestLevel());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more concise to use two SortedMaps (TreeMap). You can then pop them off smallest first on both sides. In that way you can get rid of the highestLevel field which is redundant in your data structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, that is indeed better! I modified the code to do that and removed the highestLevel field in the data structure. There is a way to do this with the Java Streams API but I found it to be less readable than a simple for-loop, let me know if you think otherwise or if there is a better way to implement this.

I did have to keep the getHighestLevel method on the data structure just because the ImmutableSortedMap throws an exception for empty maps if lastKey is called on them so having the method is a neat way to wrap that check.

ImmutableSet.of(),
ImmutableSet.of(ExpressionCountProperty.selectCount(), ExpressionCountProperty.tableFunctionCount(),
PredicateComplexityProperty.predicateComplexity()),
ImmutableSet.of(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think this should really selectCount and tableFunctionCount in there. I think predicateComplexity should not be in here but 🤷 . I think the new ones you are introducing should not be here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that I had to leave predicateComplexity here because it is used in the SelectMergeRule in a code path that expects it to be tracked (

public static <E extends RelationalExpression> BiFunction<ExpressionPartition<E>, ? super E, Tuple> comparisonByPropertyList(@Nonnull ExpressionProperty<?>... expressionProperties) {
return (partition, expression) ->
Tuple.fromItems(Arrays.stream(expressionProperties)
.map(property -> partition.getNonPartitioningPropertyValue(expression, property))
.collect(Collectors.toList()));
).

int bPredicateHeight = predicateHeight().evaluate(b);
if (aPredicateHeight != bPredicateHeight) {
return Integer.compare(aPredicateHeight, bPredicateHeight);
int aPredicateCount = predicateCount().evaluate(a);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does all of this (not picking out this particular line but this entire added block here) interact with the predicate complexity? Wouldn't predicate complexity be subsuming the predicate height property you are introducing? Could we put the predicate complexity in front of the count-by-layer property here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I changed this to use the NormalizedResidualPredicateProperty instead as the first thing to check, which works nicely because it takes care of both the comparison of predicate count (but in a better way because it naturally prefers simpler predicates), and it also makes it unnecessary to consider the PredicateComplexityProperty.

The reason why we don't need to consider the PredicateComplexityProperty anymore is that if there is an expression that has a query predicate that has a worse predicate complexity (i.e. large tree diameter), it would contribute in the same way to the number of conjuncts in the normalized form of the combined query predicate. Otherwise this would mean that a predicate with a higher tree diameter (i.e. lots of nested predicates) would have a smaller number of conjuncts in the normal form than another predicate that has a smaller tree diameter, which doesn't make sense.

I also confirmed downstream that removing the check for PredicateComplexityProperty doesn't change anything.

@Test
void compareReturnsInfoWithMoreLevelsInCaseOfEquality() {
final PredicateCountByLevelProperty.PredicateCountByLevelInfo aInfo = new PredicateCountByLevelProperty.PredicateCountByLevelInfo(
Map.of(1, 1, 2, 3, 3, 1), 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another weird idiosyncrasy: All maps in the record layer have to be ImmutableMap (guava) instead of java.util.Map. The reason is that their copy-constructors avoid a copy if the source is a ImmutableMap. Map does that, too, but only if the source is a Map as well. So while it would be better to have the entire codebase on Map and not on ImmutableMap, someone would have to change everything first. The same applies to lists and sets as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced all usages with that, thanks for the clarification!

Comment on lines 117 to 124
task_count: 2476
task_total_time_ms: 212
transform_count: 467
transform_time_ms: 67
transform_yield_count: 143
insert_time_ms: 21
insert_new_count: 315
insert_reused_count: 44
task_count: 1325
task_total_time_ms: 267
transform_count: 270
transform_time_ms: 153
transform_yield_count: 83
insert_time_ms: 12
insert_new_count: 165
insert_reused_count: 24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks almost suspicious that the time spent is higher even though all the counts are lower. Can you do a test for me? Remove all metrics files (metrics.binpb and metrics.yaml) try to run the entire suite in correction mode. Just glance over the time spent to see if there is a trend up or downwards. In general, this sort of stuff happens that we see confusing time durations as everyone uses a different set-up, had their Mac throttled or not, etc. -- so a sample size of one is pretty naive to comment on but still let's run this to just rule out the case that for some reason a rule has degraded. Also, you may want to do this for down stream.

Using this property is better for multiple reasons:

1) It would naturally prefer simpler query predicates over complex ones,
   as the simpler predicates would result in a simpler combined
   normalized query predicate for the entire QGM.
2) It takes into consideration predicates that were simplified to be a
   tautology, making sure these predicates are preferred over
   unsimplified ones. A simpler predicate count is not able to do this.

By using the NormalizedResidualPredicateProperty, it is no longer
necessary to use the PredicateComplexityProperty in the
RewritingCostModel as the NormalizedResidualPredicateProperty would take
care of choosing the expreission that has the least maximal query
predicate across its QGM (as that would lead to fewer conjuncts in the
normalized form of the combined query predicate).
After testing against existing yaml-tests, this leads to slightly worse
performance with no gain, due to having to calculate this property for
many expressions that will end up being pruned by other properties
considered in the rewriting cost model earlier.
For consistency with the rest of the codebase.
@github-actions
Copy link

📊 Metrics Diff Analysis Report

Summary

  • New queries: 1
  • Dropped queries: 0
  • Plan changed + metrics changed: 2
  • Plan unchanged + metrics changed: 2
ℹ️ About this analysis

This automated analysis compares query planner metrics between the base branch and this PR. It categorizes changes into:

  • New queries: Queries added in this PR
  • Dropped queries: Queries removed in this PR. These should be reviewed to ensure we are not losing coverage.
  • Plan changed + metrics changed: The query plan has changed along with planner metrics.
  • Metrics only changed: Same plan but different metrics

The last category in particular may indicate planner regressions that should be investigated.

New Queries

Count of new queries by file:

  • yaml-tests/src/test/resources/subquery-tests.metrics.yaml: 1

Plan and Metrics Changed

These queries experienced both plan and metrics changes. This generally indicates that there was some planner change
that means the planning for this query may be substantially different. Some amount of query plan metrics change is expected,
but the reviewer should still validate that these changes are not excessive.

Total: 2 queries

Statistical Summary (Plan and Metrics Changed)

task_count:

  • Average change: -14.0
  • Median change: -14
  • Standard deviation: 0.0
  • Range: -14 to -14
  • Queries changed: 2
  • No regressions! 🎉

insert_new_count:

  • Average change: -2.0
  • Median change: -2
  • Standard deviation: 0.0
  • Range: -2 to -2
  • Queries changed: 2
  • No regressions! 🎉

Significant Regressions (Plan and Metrics Changed)

There were 2 outliers detected. Outlier queries have a significant regression in at least one field. Statistically, this represents either an increase of more than two standard deviations above the mean or a large absolute increase (e.g., 100).

Only Metrics Changed

These queries experienced only metrics changes without any plan changes. If these metrics have substantially changed,
then a planner change has been made which affects planner performance but does not correlate with any new outcomes,
which could indicate a regression.

Total: 2 queries

Statistical Summary (Only Metrics Changed)

task_count:

  • Average change: -1088.5
  • Median change: -1026
  • Standard deviation: 62.5
  • Range: -1151 to -1026
  • Queries changed: 2
  • No regressions! 🎉

transform_count:

  • Average change: -187.0
  • Median change: -177
  • Standard deviation: 10.0
  • Range: -197 to -177
  • Queries changed: 2
  • No regressions! 🎉

transform_yield_count:

  • Average change: -50.5
  • Median change: -41
  • Standard deviation: 9.5
  • Range: -60 to -41
  • Queries changed: 2
  • No regressions! 🎉

insert_new_count:

  • Average change: -132.5
  • Median change: -115
  • Standard deviation: 17.5
  • Range: -150 to -115
  • Queries changed: 2
  • No regressions! 🎉

insert_reused_count:

  • Average change: -20.5
  • Median change: -20
  • Standard deviation: 0.5
  • Range: -21 to -20
  • Queries changed: 2
  • No regressions! 🎉

Significant Regressions (Only Metrics Changed)

There were 2 outliers detected. Outlier queries have a significant regression in at least one field. Statistically, this represents either an increase of more than two standard deviations above the mean or a large absolute increase (e.g., 100).

  • yaml-tests/src/test/resources/standard-tests.metrics.yaml:112: EXPLAIN select * from T1 where (COL1 = 20 OR COL1 = 10) AND (COL1 = 20 OR COL1 = 10)
    • explain: COVERING(I1 [EQUALS promote(@c9 AS LONG)] -> [COL1: KEY[0], ID: KEY[2]]) ⊎ COVERING(I1 [EQUALS promote(@c13 AS LONG)] -> [COL1: KEY[0], ID: KEY[2]]) | DISTINCT BY PK | FETCH
    • task_count: 2476 -> 1325 (-1151)
    • transform_count: 467 -> 270 (-197)
    • transform_yield_count: 143 -> 83 (-60)
    • insert_new_count: 315 -> 165 (-150)
    • insert_reused_count: 44 -> 24 (-20)
  • yaml-tests/src/test/resources/standard-tests.metrics.yaml:125: EXPLAIN select * from T1 where (COL1 = 20 OR COL1 = 10) AND (COL1 = 20 OR COL1 = 10) ORDER BY COL1
    • explain: ISCAN(I1 [EQUALS promote(@c9 AS LONG)]) ∪ ISCAN(I1 [EQUALS promote(@c13 AS LONG)]) COMPARE BY (_.COL1, recordType(_), _.ID)
    • task_count: 2218 -> 1192 (-1026)
    • transform_count: 418 -> 241 (-177)
    • transform_yield_count: 105 -> 64 (-41)
    • insert_new_count: 247 -> 132 (-115)
    • insert_reused_count: 42 -> 21 (-21)

@hazefully hazefully requested a review from normen662 December 18, 2025 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants