-
-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
Describe the issue:
When I run read_parquet on several files, I get a DataFrame without divisions. The issue is: if I increase npartitions with .repartition(), the resulting DataFrame has very slow partitions.
For example, calling head() starts reading from many files instead of one. This is not the case when using dask without dask-expr
Minimal Complete Verifiable Example:
minimal reproducible example:
import fsspec
import pyarrow as pa
import pyarrow.parquet as pq
fs = fsspec.filesystem("memory")
pq.write_table(pa.table({"i": [0]}), f"0.parquet", filesystem=fs)
pq.write_table(pa.table({"i": [1]}), f"1.parquet", filesystem=fs)
import dask.dataframe as dd
df = dd.read_parquet("memory://*.parquet")
df.npartitions # 2
df.divisions # (None, None, None)
# Now calling .head() should only read the first partition, so it should only read the first file
fs.delete("1.parquet")
df.head() # works
df.repartition(npartitions=5).head() # fails with dask-expr, and works without dask-expr
# FileNotFoundErrorreal world issue
import dask.dataframe as dd
df = dd.read_parquet("hf://datasets/HuggingFaceTB/finemath/finemath-3plus/train-*.parquet") # 128 files
df.npartitions # 512
df.divisions # (None,) * 513
df = df.repartition(npartitions=2048)
df.head() # super slow with dask-expr (downloads data from many many files), fast without dask-exprAnything else we need to know?:
I need this to work correctly to showcase Dask to the Hugging Face community :)
There are >200k datasets on HF in Parquet format, and making partitions fast again can be a big improvement. E.g. a fast .head() can be a big plus for DX.
Environment:
- Dask version: 2024.12.1, dask-expr 1.1.21 (also on
mainwith 2024.12.1+8.g4fb8993c and 1.1.21+1.gedb6fd5) - Python version: 3.12.2
- Operating System: MacOS
- Install method (conda, pip, source): pip
Metadata
Metadata
Assignees
Labels
No labels