Inital tutorial for e2e fuzzy deduplication #1242

ayushdg · 2025-11-18T22:14:05Z

Description

Adds initial tutorial notebook that showcases fuzzy deduplication using the end to end workflow API.

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Ayush Dattagupta <[email protected]>

greptile-apps · 2025-11-18T22:15:33Z

Greptile Overview

Greptile Summary

This PR adds a comprehensive Jupyter notebook tutorial for end-to-end fuzzy deduplication using the NeMo Curator API. The tutorial demonstrates a complete workflow for identifying and removing duplicate documents from the TinyStories dataset. It walks users through the entire pipeline: loading data, executing FuzzyDeduplicationWorkflow to identify duplicates through MinHash computation and LSH bucketing, followed by TextDuplicatesRemovalWorkflow to remove them. The notebook includes detailed analysis of intermediate results, visualization of duplicate document examples, and explanations of key hyperparameters and performance considerations. This tutorial integrates with the existing NeMo Curator framework by showcasing the high-level workflow API designed to simplify complex deduplication operations for users.

PR Description Notes:

Title has a typo: "Inital" should be "Initial"

Important Files Changed

Filename	Score	Overview
tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb	4/5	New comprehensive tutorial notebook demonstrating end-to-end fuzzy deduplication workflow with TinyStories dataset examples

Confidence score: 4/5

This PR is generally safe to merge with minor documentation improvements needed
Score reflects solid technical content but deducted one point for multiple spelling errors that should be fixed for professional documentation standards
Pay close attention to the tutorial notebook for spelling corrections before merging

Sequence Diagram

sequenceDiagram
    participant User
    participant FuzzyDeduplicationWorkflow
    participant TextDuplicatesRemovalWorkflow
    participant RayClient
    participant IDGenerator
    participant MinHashStage
    participant LSHStage
    participant ConnectedComponentsStage
    participant RemovalStage
    participant DataStore

    User->>RayClient: "start()"
    RayClient-->>User: "Ray cluster started"
    
    User->>FuzzyDeduplicationWorkflow: "run()"
    FuzzyDeduplicationWorkflow->>IDGenerator: "create_id_generator_actor()"
    IDGenerator-->>FuzzyDeduplicationWorkflow: "ID generator ready"
    
    FuzzyDeduplicationWorkflow->>DataStore: "read input data"
    DataStore-->>FuzzyDeduplicationWorkflow: "text documents"
    
    FuzzyDeduplicationWorkflow->>IDGenerator: "assign unique integer IDs"
    IDGenerator-->>FuzzyDeduplicationWorkflow: "documents with _curator_dedup_id"
    
    FuzzyDeduplicationWorkflow->>MinHashStage: "compute MinHash signatures"
    MinHashStage-->>FuzzyDeduplicationWorkflow: "documents with _minhash_signature"
    
    FuzzyDeduplicationWorkflow->>LSHStage: "perform LSH bucketing"
    LSHStage-->>FuzzyDeduplicationWorkflow: "bucket_id to doc_id mappings"
    
    FuzzyDeduplicationWorkflow->>ConnectedComponentsStage: "convert buckets to edges and find connected components"
    ConnectedComponentsStage-->>FuzzyDeduplicationWorkflow: "duplicate groups with _duplicate_group_id"
    
    FuzzyDeduplicationWorkflow->>DataStore: "save duplicate IDs list"
    DataStore-->>FuzzyDeduplicationWorkflow: "FuzzyDuplicateIds saved"
    
    FuzzyDeduplicationWorkflow-->>User: "identification complete"
    
    User->>TextDuplicatesRemovalWorkflow: "run()"
    TextDuplicatesRemovalWorkflow->>DataStore: "read original input data"
    DataStore-->>TextDuplicatesRemovalWorkflow: "original documents"
    
    TextDuplicatesRemovalWorkflow->>DataStore: "read duplicate IDs list"
    DataStore-->>TextDuplicatesRemovalWorkflow: "IDs to remove"
    
    TextDuplicatesRemovalWorkflow->>RemovalStage: "filter out duplicate documents"
    RemovalStage-->>TextDuplicatesRemovalWorkflow: "deduplicated documents"
    
    TextDuplicatesRemovalWorkflow->>DataStore: "save deduplicated dataset"
    DataStore-->>TextDuplicatesRemovalWorkflow: "deduplicated dataset saved"
    
    TextDuplicatesRemovalWorkflow-->>User: "removal complete"
    
    User->>RayClient: "stop()"
    RayClient-->>User: "Ray cluster stopped"

greptile-apps

_{1 file reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}
_{React with 👍 or 👎 to share your feedback on this new summary format}

greptile-apps · 2025-11-18T22:15:28Z

tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb

+    "1. Read original dataset\n",
+    "2. Compute MinHashes signatures of these documents\n",
+    "3. Perform LSH - Group Minhashes into bands/buckets and shuffle these bands/buckets so that documents in the same bucket are in the same batch/file.\n",
+    "4. Convert the LSH outputs (bucket_id -> doc_id mapping) into a edgelist in preperation for connected components. \n",


syntax: Typo in 'preperation' should be 'preparation'

Suggested change

"4. Convert the LSH outputs (bucket_id -> doc_id mapping) into a edgelist in preperation for connected components. \n",

"4. Convert the LSH outputs (bucket_id -> doc_id mapping) into a edgelist in preparation for connected components. \n",

greptile-apps · 2025-11-18T22:15:29Z

tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb

+    "6. During removal, reading the same file groups will give the same integer IDs, using the min/max ID values, we can find all corresponding duplicates in that range making the process faster.\n",
+    "\n",
+    "#### Performance Considerations\n",
+    "1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",


syntax: Typo in 'simultanesouly' should be 'simultaneously'

Suggested change

"1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",

"1. LSH - Configuring bands_per_iteration controls how many bands to process simultaneously in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",

greptile-apps · 2025-11-18T22:15:30Z

tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb

+    "4. The removal workflow is CPU only and can be run  on machines that don't have GPUs\n",
+    "\n",
+    "#### Hyperparameter Considerations\n",
+    "1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of 0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false postives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n",


syntax: Typo in 'postives' should be 'positives'

Suggested change

"1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of 0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false postives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n",

"1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false positives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n",

greptile-apps · 2025-11-18T22:15:31Z

tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb

+    "### Looking at Intermediate Results and Output\n",
+    "\n",
+    "#### MinHash Results\n",
+    "1. `_curator_dedup_id` - The IDs assigned to this dataset on the fly during the intial read.\n",


syntax: Typo in 'intial' should be 'initial'

Suggested change

"1. `_curator_dedup_id` - The IDs assigned to this dataset on the fly during the intial read.\n",

"1. `_curator_dedup_id` - The IDs assigned to this dataset on the fly during the initial read.\n",

greptile-apps · 2025-11-18T22:15:32Z

tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb

+   "source": [
+    "#### Advanced: Looking at examples of duplicate documents\n",
+    "\n",
+    "1. This analysis involves re-reading the input data with the same ID mapping that was used during duplicate identification.\n",


syntax: Typo in 'analsis' should be 'analysis'

greptile-apps · 2025-11-18T22:15:32Z

tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb

+    "1. This analysis involves re-reading the input data with the same ID mapping that was used during duplicate identification.\n",
+    "2. Merging the input data with the connected components results on the `_curator_dedup_id` column to associate each document which the duplicate group it belongs to which can be used for further analysis.\n",
+    "\n",
+    "NOTE: This analsis approach is itended for smaller datasets and only works for cases where the connected components dataframe is small and fits comfortable in memory. It is not recommended for larger datasets."


syntax: Typo in 'itended' should be 'intended'

Suggested change

"NOTE: This analsis approach is itended for smaller datasets and only works for cases where the connected components dataframe is small and fits comfortable in memory. It is not recommended for larger datasets."

"NOTE: This analysis approach is intended for smaller datasets and only works for cases where the connected components dataframe is small and fits comfortable in memory. It is not recommended for larger datasets."

praateekmahajan · 2025-11-20T08:29:25Z

One suggestion, that'll make it easier in future to run this for cloud too.. I have to make the same change for SemDedup, for which I'll create a PR

where you define your paths, you can do this

base_path = "./"
input_path = os.path.join(base_path, "input")
output_path = os.path.join(base_path, "output")
cache_path = os.path.join(base_path, "cache")

fs = fsspec.url_to_fs(base_path)

And then where you have

df = pd.read_parquet(os.path.join(input_path, os.listdir(input_path)[0]))

Change it to

df = pd.read_parquet(fs.unstrip_protocol(fs.find(input_path)[0]))

This should make the tutorial also cloud compatible, as the user just has to change base_path and remaining tutorial should just work as is.

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

sarahyurick

I think this matches the style of the semantic tutorial nicely. Added some small requests.

sarahyurick · 2025-11-20T18:19:55Z

tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb

+    "\n",
+    "GPU accelerated implementation of a MinHash-LSH based fuzzy deduplication. For more information about semantic deduplication in NeMo Curator, refer to the [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) section of the documentation page.\n",
+    "\n",
+    "The tutorial here shows how to run Fuzzy Duplication on text data by executing a 2 end to end workflows which does the following:\n",


Suggested change

"The tutorial here shows how to run Fuzzy Duplication on text data by executing a 2 end to end workflows which does the following:\n",

"The tutorial here shows how to run Fuzzy Duplication on text data by executing 2 end to end workflows which does the following:\n",

sarahyurick · 2025-11-20T18:21:01Z

tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb

+    "\n",
+    "The tutorial here shows how to run Fuzzy Duplication on text data by executing a 2 end to end workflows which does the following:\n",
+    "\n",
+    "1. Read original dataset\n",


The above comment says we are executing 2 workflows, but then lists 7 steps here, which could be confusing.

sarahyurick · 2025-11-20T18:34:10Z

tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb

+    "6. During removal, reading the same file groups will give the same integer IDs, using the min/max ID values, we can find all corresponding duplicates in that range making the process faster.\n",
+    "\n",
+    "#### Performance Considerations\n",
+    "1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",


Suggested change

"1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",

"1. LSH - Configuring `bands_per_iteration` controls how many bands to process simultaneously in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",

sarahyurick · 2025-11-20T18:36:01Z

tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb

+    "4. The removal workflow is CPU only and can be run  on machines that don't have GPUs\n",
+    "\n",
+    "#### Hyperparameter Considerations\n",
+    "1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of 0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false postives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n",


Suggested change

"1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of 0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false postives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n",

"1. The current defaults for fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a Jaccard similarity of 0.8. For more information on selecting the number of bands/hashes, it's recommended to analyze the S curve and tolerable threshold for false positives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n",

Do you think we would be able to include an example analysis in this tutorial?

Maybe in the step-by-step tutorial.

sarahyurick · 2025-11-20T18:39:37Z

tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb

+    "from nemo_curator.core.client import RayClient\n",
+    "\n",
+    "client = RayClient(num_cpus=64, num_gpus=2)  # change as needed\n",
+    "client.start()\n",


Missing a stop for this client.

sarahyurick · 2025-11-20T18:40:33Z

tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb

+   ]
+  },
+  {
+   "cell_type": "code",


Lingering empty cell at the end of the notebook. Can you add a summary conclusion sentence/paragraph for the tutorial?

Inital tutorial for e2e fuzzy deduplication

902c787

Signed-off-by: Ayush Dattagupta <[email protected]>

copy-pr-bot bot temporarily deployed to test November 18, 2025 22:14 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 18, 2025 22:14 Inactive

greptile-apps bot reviewed Nov 18, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci November 18, 2025 22:31 Inactive

Merge branch 'main' into fuzzy-e2e-tutorial

8cd2531

ayushdg requested a review from sarahyurick November 20, 2025 17:35

greptile-apps bot reviewed Nov 20, 2025

View reviewed changes

sarahyurick requested changes Nov 20, 2025

View reviewed changes

ayushdg mentioned this pull request Dec 2, 2025

Please add tutorial notebooks that demonstrate Exact deduplication and Fuzzy deduplication #1221

Open

	"4. Convert the LSH outputs (bucket_id -> doc_id mapping) into a edgelist in preperation for connected components. \n",
	"4. Convert the LSH outputs (bucket_id -> doc_id mapping) into a edgelist in preparation for connected components. \n",

	"1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",
	"1. LSH - Configuring bands_per_iteration controls how many bands to process simultaneously in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",

	"1. `_curator_dedup_id` - The IDs assigned to this dataset on the fly during the intial read.\n",
	"1. `_curator_dedup_id` - The IDs assigned to this dataset on the fly during the initial read.\n",

	"NOTE: This analsis approach is itended for smaller datasets and only works for cases where the connected components dataframe is small and fits comfortable in memory. It is not recommended for larger datasets."
	"NOTE: This analysis approach is intended for smaller datasets and only works for cases where the connected components dataframe is small and fits comfortable in memory. It is not recommended for larger datasets."

	"The tutorial here shows how to run Fuzzy Duplication on text data by executing a 2 end to end workflows which does the following:\n",
	"The tutorial here shows how to run Fuzzy Duplication on text data by executing 2 end to end workflows which does the following:\n",

	"1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",
	"1. LSH - Configuring `bands_per_iteration` controls how many bands to process simultaneously in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",

Inital tutorial for e2e fuzzy deduplication #1242

Are you sure you want to change the base?

Inital tutorial for e2e fuzzy deduplication #1242

Uh oh!

Conversation

ayushdg commented Nov 18, 2025

Description

Usage

Checklist

Uh oh!

greptile-apps bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

praateekmahajan commented Nov 20, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

sarahyurick Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

sarahyurick Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

sarahyurick Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

sarahyurick Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

sarahyurick Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

sarahyurick Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

sarahyurick Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Nov 18, 2025 •

edited

Loading