-
-
Notifications
You must be signed in to change notification settings - Fork 27
Description
The parquet dataset cache, see https://github.com/dask-contrib/dask-expr/blob/13af21d1be9e0a3393f4971a0d95382188f6f248/dask_expr/io/parquet.py#L54-L57
is currently identified with a token that is deterministic given the user input arguments but it is not sensitive to any external information. Particularly this means that there is no user accessible way to invalidate this cache. Any kind of mutation of the dataset whether this is to schema, metadata or just an append operation would go unnoticed.
This problem is amplified by us additionally using a cached property for everything derived from this making it impossible to reuse an existing instance correctly once the dataset is mutated (this can cause errors but also data loss).
Both of these caching layers would need to be removed to enable Expr objects becoming singletons as proposed in #798
This scenario is actually tested in our unit tests, see test_to_parquet. This test is writing a dataset, reading it and overwriting it again. The only reason why this test currently works is because the cache is invalidated if the overwrite keyword is used, see here. However, this kind of cache invalidation will not work reliably in any kind of multi interpreter environment, let alone a complex system where we are just ingesting a dataset.
The global cache should use a modified at timestamp or similar that allows invalidation. If something like this is not possible we need at the very least a mechanism that allows us to ignore or invalidate the cache.