Skip to content

Parquet checksum calculation horribly slow with arrow FileSystem wrapper #856

@fjetter

Description

@fjetter

We're calculating a checksum of the parquet file here https://github.com/dask-contrib/dask-expr/blob/d1c4ed1da01642df6802881d62998a5a81519b85/dask_expr/io/parquet.py#L550-L567 that relies on the fsspec dir_cache. This is implemented for the ordinary S3FS filesystem but not for the arrow wrapper. There are possibly other implementations where this fails as well.

Without this cache, this checksum is not feasible.

This was introduced in #798

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions