Exponential runtime complexity for `Expr.substitute`

Let's assume for the sake of simplicity that we have an expression graph where every expression depends on D other expressions except of the root/source expressions which do not depend on any other expression. In total, we have N expressions.

Simplifying by ignoring Fused, literals and removing boilerplate, substitute currently looks like

```python
def substitute(self, old, new):
    new_exprs = []
    for op in self.operands: # D
        if isinstance(op, Expr):
            op.substitute(old, new)
        else:
            new_exprs.append(op)
    return type(self)(*new_exprs) # with caching of tokenization/names O(D)
```


The runtime of this is then `T(N) = D * T(M) + O(1)` where M are the number of nodes the individual operands depend on and N are the total number of expressions in the graph. In a very simple case of a tree-like structure, M is approximately N/D which gives us `T(N) = D * T(N / D)`
using the recursion master theorem (a=b=D; c_crit = 1) we get `T(N) = O(N)` which is fine.
However, if the subproblem size M is not reduced as strongly but only be some constant factor, e.g. M = N - 1, we get `T(N) = D * T(N - 1) + O(D)` which reduces to (using induction) `T(N) = O(D^N)`, i.e. this is **exponential growth** which is catastrophic (whether the constant is 1 or smth else doesn't matter)

This may sound artificial but since we're dealing with generic DAGs, this condition is not impossible and not even uncommon. Whenever there is a cycle in our graph structure the reduction is only by a constant factor. Assuming that our cycles are often diamond-like structures (i.e. D=2 / two branches) and C is the number of cycles in an expression graph, this gives us a worst case runtime of `O(2^C)`

That this is not just a theoretical problem but also a practical one can be seen in Query 21 of the TPCH benchmark suite as it is [currently implemented in the coiled/benchmarks repo](https://github.com/coiled/benchmarks/blob/6709cc6a54062dc72e6709e02bfd07b4a33fe722/tests/tpch/dask_queries.py#L1128-L1213) which takes a relatively long time to run the substitution, see also https://github.com/dask-contrib/dask-expr/pull/798#issuecomment-1921785817 (If my math checks out, adding another filter to the query would double the optimize runtime, haven't tested this, yet)

I instrumented the code and could measure ~13.4M invocations of `Expr.substitute`. In this particular example it seems that these cycles are introduced by `Filter` expressions. We had a similar problem in the past with `Assign` expressions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Exponential runtime complexity for `Expr.substitute` #835

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Exponential runtime complexity for Expr.substitute #835

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Exponential runtime complexity for `Expr.substitute` #835