Skip to content

[Storage] writer: parallelization evolutions #288

@casenave

Description

@casenave

Investigate dask for simplified parallel writer.
Consider removing the N_SHARD arg: it complexifies the API, it is used only by hf_dataset, but not in the data processing: only at writing, and in the current code, the number of shards is automatically inferred.

For the HF_dataset support, one must keep the call to generator, it is possible with dask:

def plaid_sample_generator(bag):
    for d in bag.to_delayed():
        samples = dask.compute(d)[0]
        for s in samples:
            yield s

then

gen = {"train": partial(plaid_sample_generator, bag)}
plaid.storage.save_to_disk(output_folder = "./test", generators = gen)

where bag is a dask.bag object. I did not understand how parallelism is really controlled with this

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions