Implement `len` and leverage parquet statistics #102

rjzamora · 2023-05-19T18:59:24Z

Supersedes #84

Implements __len__
Adds a new Lengths expression to return a tuple of partition lengths.
~~Adds a _lengths method to Expr to track "known" partition lengths.~~
~~Adds a _lengths property and _partitioning method to Expr. These are essentially mechanisms to track "known" partition lengths and partitioning information.~~
Adds new logic to ReadParquet to use collect and use parquet-metadata statistics to implement ReadParquet._lengths ~~and ReadParquet._partitioning~~

dask_expr/io/parquet.py

mrocklin · 2023-05-21T14:37:44Z

dask_expr/expr.py

+        Return
+        ------
+        partitioning: dict
+        """


I'd welcome a conversation about this.

My initial thought is that it made sense to store some baseline information like ...

row counts of each partition

min/max values of each column in each partition

These are similar to what comes out of parquet. Then, when we wanted to ask something, we would consult that raw data.

This feels like we're now storing derivative values off of that data. This makes me slightly nervous because it opens the door to tracking lots of state. I would be more comfortable if we were to track the underlying state (counts, mins, maxes) and then decided to compute quantities like these on the fly. That feels more tightly scoped to me.

Thoughts?

This PR does a few different things, and some of those "things" I am much more confident in than others.

We add an optional _lengths attribute the Expr to store "known" partition lengths.

We add an optional _partitioning method to Expr so that an expression can check if the underlying collection is partitioned by a specific set of columns (even if that column does not include an index with known divisions).

We add logic to some Expr classes (mostly ReadParquet) to "lazily" collect the necessary statistics when _lengths or _partitioning information is requested.

The primary reason this PR is still marked as "draft" is that the current iteration will always attempt to go back and collect statistics in ReadParquet when _lengths or _partitioning are called (and the necessary statistics are missing). While it will always make sense to collect partition-length statistics in support of something like len(df), it may not always be the best idea to collect statistics. In fact, I'm already a bit uncomfortable with the fact that column-projection and predicate-pushdown optimizations currently require us to repeat the initial dataset processing, which can be slow on some systems (this is something I'd like to address separately).

Note that I also think the specific API can be improved, but the "eagerness" of the lazy-metadata collection feels like the most challenging short-term blocker.

What you seem to be uncomfortable with is the fact that we are not adding something like Expr._mins and Expr._maxes, but are instead exposing a method to provide more general (derivative) information about how the collection is partitioned. I'm very open to other approaches. My current proposal here was just the natural result of attempting to store mins/maxes, and finding that my personal attempt at doing so was not particularly clean or useful. In most cases, the original ReadParquet expression will not collect useful min/max statistics. When the expression does collect min/max statistics, the only reason we care about them is to tell us how/if the collection is partitioned. For this reason, I found it most natural to allow specific classes (like ReadParquet and Shuffle) to worry about what kinds of statistics they want to collect/track (if any).

I'll think a bit more about this.

I'm already a bit uncomfortable with the fact that column-projection and predicate-pushdown optimizations currently require us to repeat the initial dataset processing, which can be slow on some systems (this is something I'd like to address separately).

I noticed this recently. I wonder if the parquet code could benefit from a module-level lru-cache

What you seem to be uncomfortable with is the fact that we are not adding something like Expr._mins and Expr._maxes, but are instead exposing a method to provide more general (derivative) information about how the collection is partitioned

Yeah, I'm comparing this to database world where you have a reference table which is the single point of truth (SPOT) and then views on that table. This feels like we're storing the views as concrete tables. Bad things tend to result from that behavior.

As an example. I could imagine future applications aside from sortedness. We've mentioned a couple of these including filtering / partition pruning and optimizations that are based on the values. I think that storing the underlying data is more future-proof.

I probably wouldn't have separate protocols for _maxes and _mins but maybe a protocol that includes _min_maxes or all column-based statistics (if they're likely to be consistent across all systems that provide this information (it might make sense to look at what Snowflake, Parquet, and Delta all provide, for example)).

After thinking about this a bit more, I'm planning to split this work into two distinct proposals: (1) Tracking and using partition-length statistics, and (2) tracking and using min/max statistics.

I'm expecting that we will be able to agree on a design for (1) a lot faster than (2).

I also expect (1) to be a bit more valuable than (2) in the short term. In my experience, it can be useful to know column mins/maxes immediately after IO. However, it would be much more valuable to have a _partitioning-like method/utility to tell us if a collection is partitioned by a given set of columns. I'd expect such a method to consult min/max statistics (if known), but the more-common case would be that the collection was recently shuffled/joined/grouped on the columns in question.

To summarize: I think storing/using length-based statistics is useful and easier to agree on in the short term, so I will probably focus on that first. I don't personally care much about min/max statistics unless they are in support of a _partitioning-like method. So, I'll probably hold off on that work until there is some consensus on what that API should look/behave like.

dask_expr/expr.py

rjzamora · 2023-05-24T18:51:14Z

@mrocklin - Let me know if you are concerned by any of the changes remaining in this PR. I agree with your off-line suggestion that we could also introduce another method like Expr._maxima to track min/max statistics. However, I'd like to tackle that in a follow-up.

mrocklin

OK, I'm sorry to have this drag on for so long, but please bear with me here.

What if we don't create a new protocol on expressions here, but instead make a new expression, Lengths

class Lengths(Expr):
    _parameters = ["frame"]

    @property
    def _meta(self):
        return []

    def _simplify_down(self):
        if isinstance(self.frame, Elemwise):
            return Lengths(self.frame.operands[0])

        if isinstance(self.frame, ReadParquet):
            return Literal(self.frame._get_lengths())

    def _layer(self):
        return {self._name: [(self.frame._name, i): (len, (self.frame._name, i) for i in range(self.frame.npartitions]}

Does this give you what you want? If so, I like it because we use an existing extension mechanism (Operations/Expr subclasses) rather than make something new. I'm also ok adding new methods, but this seems like it might work and invent one fewer thing.

mrocklin · 2023-05-26T21:00:32Z

dask_expr/reductions.py

+        _lengths = self.frame._lengths(force=True)
+        if _lengths:
+            return Literal(sum(_lengths))
+        elif isinstance(self.frame, Elemwise):
            child = max(self.frame.dependencies(), key=lambda expr: expr.npartitions)


It looks like we're doing a forced length computation before pushing through Elemwise. This seems unwise to me. I would probably do the cheap/free thing first of pushing through Elemwise operations before doing anything else.

In general the force keyword opens up some questions I think. When do we use it, when don't we, when do we expect users to do this explicitly?

rjzamora · 2023-05-30T15:35:57Z

Does this give you what you want?

I'm not completely sure. I was pushing on a similar design earlier on, but moved away from it when I started considering how I would need min/max statistics to be accessed/used.

To understand the limitations of an Expr-based approach, consider the case that the user calls set_index, and you want to be able to check if the collection is already sorted by the selected column. If we rely on something like MinMax(Expr), (I think) we would also need some other mechanism to tell us if computing this expression would require "real" data to be read into memory. That is, I would only want to use MinMax to optimize a set_index operation if I knew it would never require any real IO/computation.

Perhaps one reasonable approach would be to call these expressions something like MinMaxStats and LengthStats, and explicitly prohibit them from reading in "real" data. This way, optimization logic can use/compute a MinMaxStats expression to query for known statistics without risking any real/expensive IO. We could also introduce the Lengths expression you have suggested, but allow ReadParquet to replace Lengths with LengthStats in _simplify_up.

Overall, I think I'm saying that I like the idea of leveraging the Expr foundation to query/leverage statistics. However, I think it is important that we clearly distinguish "pure" statistics-scanning expressions in some way. What do you think?

mrocklin · 2023-05-30T15:49:43Z

Yeah, so I think that this gets at my other question of "exactly what are the semantics around the force= keyword?" I agree generally with your framing.

Using the Expr approach in an operation like set_index we would probably do something like the following:

def set_index(self, column):
    minmaxes = self.minmax(column).optimize()
    if isinstance(minmaxes, Literal):  # 🎉
        ...  do something easy
    if hasattr(minmaxes.frame, "minmaxes"):
        ... consider calling this method
    ...

If we rely on something like MinMax(Expr), (I think) we would also need some other mechanism to tell us if computing this expression would require "real" data to be read into memory. That is, I would only want to use MinMax to optimize a set_index operation if I knew it would never require any real IO/computation.

Agreed. I think that that information is present in the optimized expression tree.

rjzamora · 2023-05-30T18:59:02Z

dask_expr/reductions.py

    def _simplify_down(self):
-        if isinstance(self.frame, Elemwise):
+        _lengths = Lengths(self.frame).optimize()
+        if isinstance(_lengths, Literal):
+            return Literal(sum(_lengths.value))
+        elif isinstance(self.frame, Elemwise):
            child = max(self.frame.dependencies(), key=lambda expr: expr.npartitions)
            return Len(child)


I'm not sure if there is a better way to design the interaction between Len and Lengths. It seemed reasonable to be to switch from Len to Lengths when we know that the Lengths expression can be optimized down to a Literal expression.

It may be that there are cases where we can move Len but not Lengths. For example:

df = dx.read_csv(...) df = df.set_index("...") len(df)

I'm not going to be able to pass Lengths through the set_index call, but I can pass Len through (the number of rows is the same after the full shufle). Because of this I think that they probably need to remain separate operations with separate optimization paths. I don't think that we can define one in terms of the other.

Also, I suspect that Len is likely to be more common than Lengths, so I'm disinclined to have it inherit any weaknesses from the other.

I think it would be better to keep the Len(Elemwise) branch of this on top. Otherwise we're constructing a Lengths object and optimizing it every pass through optimizations (and there are likely to be several of these).

In principle, I wouldn't mind removing Lengths from this entirely, and instead have a ReadParquet._simplify_up(Len) case which is similar to but simpler than the Lengths case.

mrocklin · 2023-05-30T21:00:09Z

dask_expr/expr.py

+
+    def _simplify_down(self):
+        if isinstance(self.frame, Elemwise):
+            return Lengths(self.frame.operands[0])


I was simplifying things before. You'll have to watch out for situations like x.sum() + x. I think that if you look at the implementation for Len we handle this well.

dask_expr/io/parquet.py

mrocklin · 2023-05-30T21:05:15Z

dask_expr/reductions.py

    def _simplify_down(self):
-        if isinstance(self.frame, Elemwise):
+        _lengths = Lengths(self.frame).optimize()
+        if isinstance(_lengths, Literal):
+            return Literal(sum(_lengths.value))
+        elif isinstance(self.frame, Elemwise):
            child = max(self.frame.dependencies(), key=lambda expr: expr.npartitions)
            return Len(child)


It may be that there are cases where we can move Len but not Lengths. For example:

df = dx.read_csv(...) df = df.set_index("...") len(df)

I'm not going to be able to pass Lengths through the set_index call, but I can pass Len through (the number of rows is the same after the full shufle). Because of this I think that they probably need to remain separate operations with separate optimization paths. I don't think that we can define one in terms of the other.

Also, I suspect that Len is likely to be more common than Lengths, so I'm disinclined to have it inherit any weaknesses from the other.

dask_expr/io/tests/test_io.py

dask_expr/io/parquet.py

dask_expr/collection.py

mrocklin

Minor comment. Generally I'm good to merge.

Does this satisfy your needs in projects like Merlin?

mrocklin · 2023-05-31T13:49:57Z

dask_expr/reductions.py

    def _simplify_down(self):
-        if isinstance(self.frame, Elemwise):
+        _lengths = Lengths(self.frame).optimize()
+        if isinstance(_lengths, Literal):
+            return Literal(sum(_lengths.value))
+        elif isinstance(self.frame, Elemwise):
            child = max(self.frame.dependencies(), key=lambda expr: expr.npartitions)
            return Len(child)


I think it would be better to keep the Len(Elemwise) branch of this on top. Otherwise we're constructing a Lengths object and optimizing it every pass through optimizations (and there are likely to be several of these).

In principle, I wouldn't mind removing Lengths from this entirely, and instead have a ReadParquet._simplify_up(Len) case which is similar to but simpler than the Lengths case.

dask_expr/io/parquet.py

rjzamora · 2023-05-31T14:43:13Z

Does this satisfy your needs in projects like Merlin?

Yes, this optimized len approach should satisfy the needs of Merlin (and other users who want to stream partitions into pt/tf-based data-loaders).

rjzamora added 23 commits May 15, 2023 08:54

start experimenting with parquet statistics

13b828f

Merge remote-tracking branch 'upstream/main' into pq-statistics-len

f5f4e19

adopt parts of dask#40

990ba4c

experimenting with dedicated Metadata class structure

1c62f4c

add missing file

afd59d7

go back to and remove sub-class for now

8302305

add parquet test

a3c5f2c

use assume vs inherit

cbced80

use assume vs inherit

5fe5862

split test

b0946f8

fix doc-string

bfd8710

fix typos

2d343c7

Merge remote-tracking branch 'upstream/main' into pq-statistics-len

aa27c96

use _lengths ILO statistics

4ce604d

start pushing on _column_statistics

4ad6fb2

add _collect_statistics machinery to ReadParquet

d5e93a4

move utilities out of class body

7b137c5

introduce _partitioning

f6823d1

add simple test coverage for _partitions

e600ea1

improve test and fix bug

1dbfb18

Merge remote-tracking branch 'upstream/main' into simple-statistics

5020657

remove leftover

423cfcb

fix parquet len test

58ebf5a

rjzamora changed the title ~~Simple statistics~~ Use parquet statistics for length and partitioning information May 19, 2023

mrocklin reviewed May 21, 2023

View reviewed changes

rjzamora added 5 commits May 22, 2023 08:29

fix calculate_divisions default

5790fb1

Merge remote-tracking branch 'upstream/main' into simple-statistics

4be0221

strip out _partitioning changes

cc01ebb

missing calculate_divisions default

0345d19

move _lengths to a method with force option

7052a26

rjzamora added 2 commits May 23, 2023 10:47

cache pd lengths

e26d6cd

missing annotations import

5c376b9

rjzamora changed the title ~~Use parquet statistics for length and partitioning information~~ Implement __len__ and leverage parquet statistics May 23, 2023

rjzamora marked this pull request as ready for review May 23, 2023 18:34

Merge remote-tracking branch 'upstream/main' into simple-statistics

cd6a5d6

mrocklin reviewed May 26, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into HEAD

62fbcfa

rjzamora added 2 commits May 30, 2023 12:00

use Lengths

253cfeb

Merge remote-tracking branch 'upstream/main' into simple-statistics

1318219

rjzamora commented May 30, 2023

View reviewed changes

mrocklin reviewed May 30, 2023

View reviewed changes

rjzamora added 2 commits May 30, 2023 17:13

partial fixup

32e4f94

improve testing

bd5395a

mrocklin approved these changes May 31, 2023

View reviewed changes

rjzamora added 2 commits May 31, 2023 09:28

cleanup

be4af18

remove _len for now

47ff1d3

mrocklin approved these changes May 31, 2023

View reviewed changes

rjzamora merged commit 75b8eb2 into dask:main May 31, 2023

rjzamora deleted the simple-statistics branch May 31, 2023 14:44

Uh oh!

Implement __len__ and leverage parquet statistics #102

Implement __len__ and leverage parquet statistics #102

Uh oh!

Conversation

rjzamora commented May 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rjzamora commented May 24, 2023

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora commented May 30, 2023

Uh oh!

mrocklin commented May 30, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rjzamora commented May 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement `len` and leverage parquet statistics #102

Implement `len` and leverage parquet statistics #102

rjzamora commented May 19, 2023 •

edited

Loading