Merged
Conversation
added drop duplicates feature where user can drop any duplicates in the dataframe, options to pass into the method can be used, such as selecting custom column names to drop duplicates, Keep either the first, last or none of the rows and whether to modify the original dataframe or create a another dataframe with the modified rows
added drop duplicates test for all of the options as well as subset and keep in the same option struct
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Drop Duplicates method
DropDuplicatesOptionstruct to allow users to further fine tune the drop duplicates method which allows the options of:Subset -> Custom column names to check for duplicates
Keep -> To keep which row when there is occurrence of duplication.
Inplace -> To allow users to select whether to modify the existing dataframe or create a new dataframe with the modified rows.
getRowKeymethod is implemented to get the unique key of the row where it is made up of the column names and values in string format to keep in a map to recognise duplicates.getSubslicemethod is implemented to get specific rows of the columns based on the row indexes to keep that is decided based on the logic before this method was called.Syntax:
Important things to note**
Subset -> All columns
Keep -> Keep only the first occurrence
Inplace -> False, creates a new dataframe without the duplicate rows.
Things to improve
please add
sort,describe,toDatabase,readFromDatabase,dropDuplicates#26