Memory-efficient removal of duplicated rows in Dask Dataframe

In topicmodeling.py, main --preproc, if no spark cluster is available, dask dataframes are built concatenating all training datasets. In some cases there could be duplicated documents that should be removed according to id. This could happen, e.g., if we use the concatenation of Semantic Scholar - Health and Semantic Scholar - Biology, since the overlap is not null.

In this case it would be good to remove duplicates before preprocessing and computation of the actual training corpus. I tried dask.DataFrame.drop_duplicates(), but this does not scale well, even if duplicates are searched for using just the corpus id.

Right now, duplicates are not removed, but it would be good to do so. This stackoverflow entry could help (hash-based shuffle before drop_duplicates?):
https://stackoverflow.com/questions/68019990/dask-dataframe-remove-duplicates-by-columns-a-keeping-the-row-with-the-highest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory-efficient removal of duplicated rows in Dask Dataframe #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory-efficient removal of duplicated rows in Dask Dataframe #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions