-
Notifications
You must be signed in to change notification settings - Fork 0
Description
In topicmodeling.py, main --preproc, if no spark cluster is available, dask dataframes are built concatenating all training datasets. In some cases there could be duplicated documents that should be removed according to id. This could happen, e.g., if we use the concatenation of Semantic Scholar - Health and Semantic Scholar - Biology, since the overlap is not null.
In this case it would be good to remove duplicates before preprocessing and computation of the actual training corpus. I tried dask.DataFrame.drop_duplicates(), but this does not scale well, even if duplicates are searched for using just the corpus id.
Right now, duplicates are not removed, but it would be good to do so. This stackoverflow entry could help (hash-based shuffle before drop_duplicates?):
https://stackoverflow.com/questions/68019990/dask-dataframe-remove-duplicates-by-columns-a-keeping-the-row-with-the-highest