Skip to content

Reading a list of S3 parquet files with query planning enabled is ~25x slower #1061

@b-phi

Description

@b-phi

Was struggling to understand why creating a dask dataframe from a large list of parquet files was taking ages. Eventually tried disabling query planning and saw normal timing again. These are all relatively small S3 files ~1MB. There is no metadata file or similar.

Screenshot 2024-05-10 at 3 21 23 PM

Environment:

  • dask==2024.5.0
  • dask-expr==1.1.0
  • python==3.10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions