-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Milestone
Description
Investigate dask for simplified parallel writer.
Consider removing the N_SHARD arg: it complexifies the API, it is used only by hf_dataset, but not in the data processing: only at writing, and in the current code, the number of shards is automatically inferred.
For the HF_dataset support, one must keep the call to generator, it is possible with dask:
def plaid_sample_generator(bag):
for d in bag.to_delayed():
samples = dask.compute(d)[0]
for s in samples:
yield sthen
gen = {"train": partial(plaid_sample_generator, bag)}
plaid.storage.save_to_disk(output_folder = "./test", generators = gen)where bag is a dask.bag object. I did not understand how parallelism is really controlled with this
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels