-
Notifications
You must be signed in to change notification settings - Fork 191
Inital tutorial for e2e fuzzy deduplication #1242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ayush Dattagupta <[email protected]>
Greptile OverviewGreptile SummaryThis PR adds a comprehensive Jupyter notebook tutorial for end-to-end fuzzy deduplication using the NeMo Curator API. The tutorial demonstrates a complete workflow for identifying and removing duplicate documents from the TinyStories dataset. It walks users through the entire pipeline: loading data, executing PR Description Notes:
Important Files Changed
Confidence score: 4/5
Sequence DiagramsequenceDiagram
participant User
participant FuzzyDeduplicationWorkflow
participant TextDuplicatesRemovalWorkflow
participant RayClient
participant IDGenerator
participant MinHashStage
participant LSHStage
participant ConnectedComponentsStage
participant RemovalStage
participant DataStore
User->>RayClient: "start()"
RayClient-->>User: "Ray cluster started"
User->>FuzzyDeduplicationWorkflow: "run()"
FuzzyDeduplicationWorkflow->>IDGenerator: "create_id_generator_actor()"
IDGenerator-->>FuzzyDeduplicationWorkflow: "ID generator ready"
FuzzyDeduplicationWorkflow->>DataStore: "read input data"
DataStore-->>FuzzyDeduplicationWorkflow: "text documents"
FuzzyDeduplicationWorkflow->>IDGenerator: "assign unique integer IDs"
IDGenerator-->>FuzzyDeduplicationWorkflow: "documents with _curator_dedup_id"
FuzzyDeduplicationWorkflow->>MinHashStage: "compute MinHash signatures"
MinHashStage-->>FuzzyDeduplicationWorkflow: "documents with _minhash_signature"
FuzzyDeduplicationWorkflow->>LSHStage: "perform LSH bucketing"
LSHStage-->>FuzzyDeduplicationWorkflow: "bucket_id to doc_id mappings"
FuzzyDeduplicationWorkflow->>ConnectedComponentsStage: "convert buckets to edges and find connected components"
ConnectedComponentsStage-->>FuzzyDeduplicationWorkflow: "duplicate groups with _duplicate_group_id"
FuzzyDeduplicationWorkflow->>DataStore: "save duplicate IDs list"
DataStore-->>FuzzyDeduplicationWorkflow: "FuzzyDuplicateIds saved"
FuzzyDeduplicationWorkflow-->>User: "identification complete"
User->>TextDuplicatesRemovalWorkflow: "run()"
TextDuplicatesRemovalWorkflow->>DataStore: "read original input data"
DataStore-->>TextDuplicatesRemovalWorkflow: "original documents"
TextDuplicatesRemovalWorkflow->>DataStore: "read duplicate IDs list"
DataStore-->>TextDuplicatesRemovalWorkflow: "IDs to remove"
TextDuplicatesRemovalWorkflow->>RemovalStage: "filter out duplicate documents"
RemovalStage-->>TextDuplicatesRemovalWorkflow: "deduplicated documents"
TextDuplicatesRemovalWorkflow->>DataStore: "save deduplicated dataset"
DataStore-->>TextDuplicatesRemovalWorkflow: "deduplicated dataset saved"
TextDuplicatesRemovalWorkflow-->>User: "removal complete"
User->>RayClient: "stop()"
RayClient-->>User: "Ray cluster stopped"
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 6 comments
Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format
| "1. Read original dataset\n", | ||
| "2. Compute MinHashes signatures of these documents\n", | ||
| "3. Perform LSH - Group Minhashes into bands/buckets and shuffle these bands/buckets so that documents in the same bucket are in the same batch/file.\n", | ||
| "4. Convert the LSH outputs (bucket_id -> doc_id mapping) into a edgelist in preperation for connected components. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Typo in 'preperation' should be 'preparation'
| "4. Convert the LSH outputs (bucket_id -> doc_id mapping) into a edgelist in preperation for connected components. \n", | |
| "4. Convert the LSH outputs (bucket_id -> doc_id mapping) into a edgelist in preparation for connected components. \n", |
| "6. During removal, reading the same file groups will give the same integer IDs, using the min/max ID values, we can find all corresponding duplicates in that range making the process faster.\n", | ||
| "\n", | ||
| "#### Performance Considerations\n", | ||
| "1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Typo in 'simultanesouly' should be 'simultaneously'
| "1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n", | |
| "1. LSH - Configuring bands_per_iteration controls how many bands to process simultaneously in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n", |
| "4. The removal workflow is CPU only and can be run on machines that don't have GPUs\n", | ||
| "\n", | ||
| "#### Hyperparameter Considerations\n", | ||
| "1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of 0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false postives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Typo in 'postives' should be 'positives'
| "1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of 0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false postives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n", | |
| "1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false positives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n", |
| "### Looking at Intermediate Results and Output\n", | ||
| "\n", | ||
| "#### MinHash Results\n", | ||
| "1. `_curator_dedup_id` - The IDs assigned to this dataset on the fly during the intial read.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Typo in 'intial' should be 'initial'
| "1. `_curator_dedup_id` - The IDs assigned to this dataset on the fly during the intial read.\n", | |
| "1. `_curator_dedup_id` - The IDs assigned to this dataset on the fly during the initial read.\n", |
| "source": [ | ||
| "#### Advanced: Looking at examples of duplicate documents\n", | ||
| "\n", | ||
| "1. This analysis involves re-reading the input data with the same ID mapping that was used during duplicate identification.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Typo in 'analsis' should be 'analysis'
| "1. This analysis involves re-reading the input data with the same ID mapping that was used during duplicate identification.\n", | ||
| "2. Merging the input data with the connected components results on the `_curator_dedup_id` column to associate each document which the duplicate group it belongs to which can be used for further analysis.\n", | ||
| "\n", | ||
| "NOTE: This analsis approach is itended for smaller datasets and only works for cases where the connected components dataframe is small and fits comfortable in memory. It is not recommended for larger datasets." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Typo in 'itended' should be 'intended'
| "NOTE: This analsis approach is itended for smaller datasets and only works for cases where the connected components dataframe is small and fits comfortable in memory. It is not recommended for larger datasets." | |
| "NOTE: This analysis approach is intended for smaller datasets and only works for cases where the connected components dataframe is small and fits comfortable in memory. It is not recommended for larger datasets." |
|
One suggestion, that'll make it easier in future to run this for cloud too.. I have to make the same change for SemDedup, for which I'll create a PR where you define your paths, you can do this base_path = "./"
input_path = os.path.join(base_path, "input")
output_path = os.path.join(base_path, "output")
cache_path = os.path.join(base_path, "cache")
fs = fsspec.url_to_fs(base_path)And then where you have df = pd.read_parquet(os.path.join(input_path, os.listdir(input_path)[0]))Change it to df = pd.read_parquet(fs.unstrip_protocol(fs.find(input_path)[0]))This should make the tutorial also cloud compatible, as the user just has to change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, no comments
sarahyurick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this matches the style of the semantic tutorial nicely. Added some small requests.
| "\n", | ||
| "GPU accelerated implementation of a MinHash-LSH based fuzzy deduplication. For more information about semantic deduplication in NeMo Curator, refer to the [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) section of the documentation page.\n", | ||
| "\n", | ||
| "The tutorial here shows how to run Fuzzy Duplication on text data by executing a 2 end to end workflows which does the following:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "The tutorial here shows how to run Fuzzy Duplication on text data by executing a 2 end to end workflows which does the following:\n", | |
| "The tutorial here shows how to run Fuzzy Duplication on text data by executing 2 end to end workflows which does the following:\n", |
| "\n", | ||
| "The tutorial here shows how to run Fuzzy Duplication on text data by executing a 2 end to end workflows which does the following:\n", | ||
| "\n", | ||
| "1. Read original dataset\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above comment says we are executing 2 workflows, but then lists 7 steps here, which could be confusing.
| "6. During removal, reading the same file groups will give the same integer IDs, using the min/max ID values, we can find all corresponding duplicates in that range making the process faster.\n", | ||
| "\n", | ||
| "#### Performance Considerations\n", | ||
| "1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n", | |
| "1. LSH - Configuring `bands_per_iteration` controls how many bands to process simultaneously in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n", |
| "4. The removal workflow is CPU only and can be run on machines that don't have GPUs\n", | ||
| "\n", | ||
| "#### Hyperparameter Considerations\n", | ||
| "1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of 0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false postives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of 0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false postives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n", | |
| "1. The current defaults for fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a Jaccard similarity of 0.8. For more information on selecting the number of bands/hashes, it's recommended to analyze the S curve and tolerable threshold for false positives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n", |
Do you think we would be able to include an example analysis in this tutorial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in the step-by-step tutorial.
| "from nemo_curator.core.client import RayClient\n", | ||
| "\n", | ||
| "client = RayClient(num_cpus=64, num_gpus=2) # change as needed\n", | ||
| "client.start()\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a stop for this client.
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lingering empty cell at the end of the notebook. Can you add a summary conclusion sentence/paragraph for the tutorial?
Description
Adds initial tutorial notebook that showcases fuzzy deduplication using the end to end workflow API.
Usage
# Add snippet demonstrating usageChecklist