Skip to content

Exponential runtime complexity for Expr.substitute #835

@fjetter

Description

@fjetter

Let's assume for the sake of simplicity that we have an expression graph where every expression depends on D other expressions except of the root/source expressions which do not depend on any other expression. In total, we have N expressions.

Simplifying by ignoring Fused, literals and removing boilerplate, substitute currently looks like

def substitute(self, old, new):
    new_exprs = []
    for op in self.operands: # D
        if isinstance(op, Expr):
            op.substitute(old, new)
        else:
            new_exprs.append(op)
    return type(self)(*new_exprs) # with caching of tokenization/names O(D)

The runtime of this is then T(N) = D * T(M) + O(1) where M are the number of nodes the individual operands depend on and N are the total number of expressions in the graph. In a very simple case of a tree-like structure, M is approximately N/D which gives us T(N) = D * T(N / D)
using the recursion master theorem (a=b=D; c_crit = 1) we get T(N) = O(N) which is fine.
However, if the subproblem size M is not reduced as strongly but only be some constant factor, e.g. M = N - 1, we get T(N) = D * T(N - 1) + O(D) which reduces to (using induction) T(N) = O(D^N), i.e. this is exponential growth which is catastrophic (whether the constant is 1 or smth else doesn't matter)

This may sound artificial but since we're dealing with generic DAGs, this condition is not impossible and not even uncommon. Whenever there is a cycle in our graph structure the reduction is only by a constant factor. Assuming that our cycles are often diamond-like structures (i.e. D=2 / two branches) and C is the number of cycles in an expression graph, this gives us a worst case runtime of O(2^C)

That this is not just a theoretical problem but also a practical one can be seen in Query 21 of the TPCH benchmark suite as it is currently implemented in the coiled/benchmarks repo which takes a relatively long time to run the substitution, see also #798 (comment) (If my math checks out, adding another filter to the query would double the optimize runtime, haven't tested this, yet)

I instrumented the code and could measure ~13.4M invocations of Expr.substitute. In this particular example it seems that these cycles are introduced by Filter expressions. We had a similar problem in the past with Assign expressions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions