Skip to content

Parquet Dataset cache not reliable #800

@fjetter

Description

@fjetter

The parquet dataset cache, see https://github.com/dask-contrib/dask-expr/blob/13af21d1be9e0a3393f4971a0d95382188f6f248/dask_expr/io/parquet.py#L54-L57
is currently identified with a token that is deterministic given the user input arguments but it is not sensitive to any external information. Particularly this means that there is no user accessible way to invalidate this cache. Any kind of mutation of the dataset whether this is to schema, metadata or just an append operation would go unnoticed.

This problem is amplified by us additionally using a cached property for everything derived from this making it impossible to reuse an existing instance correctly once the dataset is mutated (this can cause errors but also data loss).

Both of these caching layers would need to be removed to enable Expr objects becoming singletons as proposed in #798

This scenario is actually tested in our unit tests, see test_to_parquet. This test is writing a dataset, reading it and overwriting it again. The only reason why this test currently works is because the cache is invalidated if the overwrite keyword is used, see here. However, this kind of cache invalidation will not work reliably in any kind of multi interpreter environment, let alone a complex system where we are just ingesting a dataset.

The global cache should use a modified at timestamp or similar that allows invalidation. If something like this is not possible we need at the very least a mechanism that allows us to ignore or invalidate the cache.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions