Skip to content

Memory-efficient removal of duplicated rows in Dask Dataframe #5

@jeroarenas

Description

@jeroarenas

In topicmodeling.py, main --preproc, if no spark cluster is available, dask dataframes are built concatenating all training datasets. In some cases there could be duplicated documents that should be removed according to id. This could happen, e.g., if we use the concatenation of Semantic Scholar - Health and Semantic Scholar - Biology, since the overlap is not null.

In this case it would be good to remove duplicates before preprocessing and computation of the actual training corpus. I tried dask.DataFrame.drop_duplicates(), but this does not scale well, even if duplicates are searched for using just the corpus id.

Right now, duplicates are not removed, but it would be good to do so. This stackoverflow entry could help (hash-based shuffle before drop_duplicates?):
https://stackoverflow.com/questions/68019990/dask-dataframe-remove-duplicates-by-columns-a-keeping-the-row-with-the-highest

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions