Parquet Dataset cache not reliable

The parquet dataset cache, see https://github.com/dask-contrib/dask-expr/blob/13af21d1be9e0a3393f4971a0d95382188f6f248/dask_expr/io/parquet.py#L54-L57 
is currently identified with a token that is deterministic given the user input arguments but it is not sensitive to any external information. Particularly this means that there is no user accessible way to invalidate this cache. Any kind of mutation of the dataset whether this is to schema, metadata or just an append operation would go unnoticed.

This problem is amplified by us additionally using a cached property for everything derived from this making it **impossible** to reuse an existing instance correctly once the dataset is mutated (this can cause errors but also data loss).

Both of these caching layers would need to be removed to enable `Expr` objects becoming singletons as proposed in https://github.com/dask-contrib/dask-expr/pull/798


This scenario is actually tested in our unit tests, see [`test_to_parquet`](https://github.com/dask-contrib/dask-expr/blob/13af21d1be9e0a3393f4971a0d95382188f6f248/dask_expr/io/tests/test_io.py#L339-L357). This test is writing a dataset, reading it and overwriting it again. The only reason why this test currently works is because the cache is invalidated if the `overwrite` keyword is used, see [here](https://github.com/dask-contrib/dask-expr/blob/13af21d1be9e0a3393f4971a0d95382188f6f248/dask_expr/io/parquet.py#L259-L260). However, this kind of cache invalidation will not work reliably in any kind of multi interpreter environment, let alone a complex system where we are just ingesting a dataset.

The global cache should use a _modified at_ timestamp or similar that allows invalidation. If something like this is not possible we need at the very least a mechanism that allows us to ignore or invalidate the cache.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Parquet Dataset cache not reliable #800

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Parquet Dataset cache not reliable #800

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions