Skip to content

Conversation

@ayushdg
Copy link
Contributor

@ayushdg ayushdg commented Nov 18, 2025

Description

Adds initial tutorial notebook that showcases fuzzy deduplication using the end to end workflow API.

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 18, 2025

Greptile Overview

Greptile Summary

This PR adds a comprehensive Jupyter notebook tutorial for end-to-end fuzzy deduplication using the NeMo Curator API. The tutorial demonstrates a complete workflow for identifying and removing duplicate documents from the TinyStories dataset. It walks users through the entire pipeline: loading data, executing FuzzyDeduplicationWorkflow to identify duplicates through MinHash computation and LSH bucketing, followed by TextDuplicatesRemovalWorkflow to remove them. The notebook includes detailed analysis of intermediate results, visualization of duplicate document examples, and explanations of key hyperparameters and performance considerations. This tutorial integrates with the existing NeMo Curator framework by showcasing the high-level workflow API designed to simplify complex deduplication operations for users.

PR Description Notes:

  • Title has a typo: "Inital" should be "Initial"

Important Files Changed

Filename Score Overview
tutorials/text/deduplication/fuzzy/fuzzy_e2e.ipynb 4/5 New comprehensive tutorial notebook demonstrating end-to-end fuzzy deduplication workflow with TinyStories dataset examples

Confidence score: 4/5

  • This PR is generally safe to merge with minor documentation improvements needed
  • Score reflects solid technical content but deducted one point for multiple spelling errors that should be fixed for professional documentation standards
  • Pay close attention to the tutorial notebook for spelling corrections before merging

Sequence Diagram

sequenceDiagram
    participant User
    participant FuzzyDeduplicationWorkflow
    participant TextDuplicatesRemovalWorkflow
    participant RayClient
    participant IDGenerator
    participant MinHashStage
    participant LSHStage
    participant ConnectedComponentsStage
    participant RemovalStage
    participant DataStore

    User->>RayClient: "start()"
    RayClient-->>User: "Ray cluster started"
    
    User->>FuzzyDeduplicationWorkflow: "run()"
    FuzzyDeduplicationWorkflow->>IDGenerator: "create_id_generator_actor()"
    IDGenerator-->>FuzzyDeduplicationWorkflow: "ID generator ready"
    
    FuzzyDeduplicationWorkflow->>DataStore: "read input data"
    DataStore-->>FuzzyDeduplicationWorkflow: "text documents"
    
    FuzzyDeduplicationWorkflow->>IDGenerator: "assign unique integer IDs"
    IDGenerator-->>FuzzyDeduplicationWorkflow: "documents with _curator_dedup_id"
    
    FuzzyDeduplicationWorkflow->>MinHashStage: "compute MinHash signatures"
    MinHashStage-->>FuzzyDeduplicationWorkflow: "documents with _minhash_signature"
    
    FuzzyDeduplicationWorkflow->>LSHStage: "perform LSH bucketing"
    LSHStage-->>FuzzyDeduplicationWorkflow: "bucket_id to doc_id mappings"
    
    FuzzyDeduplicationWorkflow->>ConnectedComponentsStage: "convert buckets to edges and find connected components"
    ConnectedComponentsStage-->>FuzzyDeduplicationWorkflow: "duplicate groups with _duplicate_group_id"
    
    FuzzyDeduplicationWorkflow->>DataStore: "save duplicate IDs list"
    DataStore-->>FuzzyDeduplicationWorkflow: "FuzzyDuplicateIds saved"
    
    FuzzyDeduplicationWorkflow-->>User: "identification complete"
    
    User->>TextDuplicatesRemovalWorkflow: "run()"
    TextDuplicatesRemovalWorkflow->>DataStore: "read original input data"
    DataStore-->>TextDuplicatesRemovalWorkflow: "original documents"
    
    TextDuplicatesRemovalWorkflow->>DataStore: "read duplicate IDs list"
    DataStore-->>TextDuplicatesRemovalWorkflow: "IDs to remove"
    
    TextDuplicatesRemovalWorkflow->>RemovalStage: "filter out duplicate documents"
    RemovalStage-->>TextDuplicatesRemovalWorkflow: "deduplicated documents"
    
    TextDuplicatesRemovalWorkflow->>DataStore: "save deduplicated dataset"
    DataStore-->>TextDuplicatesRemovalWorkflow: "deduplicated dataset saved"
    
    TextDuplicatesRemovalWorkflow-->>User: "removal complete"
    
    User->>RayClient: "stop()"
    RayClient-->>User: "Ray cluster stopped"
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 6 comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

"1. Read original dataset\n",
"2. Compute MinHashes signatures of these documents\n",
"3. Perform LSH - Group Minhashes into bands/buckets and shuffle these bands/buckets so that documents in the same bucket are in the same batch/file.\n",
"4. Convert the LSH outputs (bucket_id -> doc_id mapping) into a edgelist in preperation for connected components. \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Typo in 'preperation' should be 'preparation'

Suggested change
"4. Convert the LSH outputs (bucket_id -> doc_id mapping) into a edgelist in preperation for connected components. \n",
"4. Convert the LSH outputs (bucket_id -> doc_id mapping) into a edgelist in preparation for connected components. \n",

"6. During removal, reading the same file groups will give the same integer IDs, using the min/max ID values, we can find all corresponding duplicates in that range making the process faster.\n",
"\n",
"#### Performance Considerations\n",
"1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Typo in 'simultanesouly' should be 'simultaneously'

Suggested change
"1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",
"1. LSH - Configuring bands_per_iteration controls how many bands to process simultaneously in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",

"4. The removal workflow is CPU only and can be run on machines that don't have GPUs\n",
"\n",
"#### Hyperparameter Considerations\n",
"1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of 0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false postives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Typo in 'postives' should be 'positives'

Suggested change
"1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of 0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false postives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n",
"1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false positives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n",

"### Looking at Intermediate Results and Output\n",
"\n",
"#### MinHash Results\n",
"1. `_curator_dedup_id` - The IDs assigned to this dataset on the fly during the intial read.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Typo in 'intial' should be 'initial'

Suggested change
"1. `_curator_dedup_id` - The IDs assigned to this dataset on the fly during the intial read.\n",
"1. `_curator_dedup_id` - The IDs assigned to this dataset on the fly during the initial read.\n",

"source": [
"#### Advanced: Looking at examples of duplicate documents\n",
"\n",
"1. This analysis involves re-reading the input data with the same ID mapping that was used during duplicate identification.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Typo in 'analsis' should be 'analysis'

"1. This analysis involves re-reading the input data with the same ID mapping that was used during duplicate identification.\n",
"2. Merging the input data with the connected components results on the `_curator_dedup_id` column to associate each document which the duplicate group it belongs to which can be used for further analysis.\n",
"\n",
"NOTE: This analsis approach is itended for smaller datasets and only works for cases where the connected components dataframe is small and fits comfortable in memory. It is not recommended for larger datasets."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Typo in 'itended' should be 'intended'

Suggested change
"NOTE: This analsis approach is itended for smaller datasets and only works for cases where the connected components dataframe is small and fits comfortable in memory. It is not recommended for larger datasets."
"NOTE: This analysis approach is intended for smaller datasets and only works for cases where the connected components dataframe is small and fits comfortable in memory. It is not recommended for larger datasets."

@praateekmahajan
Copy link
Contributor

One suggestion, that'll make it easier in future to run this for cloud too.. I have to make the same change for SemDedup, for which I'll create a PR

where you define your paths, you can do this

base_path = "./"
input_path = os.path.join(base_path, "input")
output_path = os.path.join(base_path, "output")
cache_path = os.path.join(base_path, "cache")

fs = fsspec.url_to_fs(base_path)

And then where you have

df = pd.read_parquet(os.path.join(input_path, os.listdir(input_path)[0]))

Change it to

df = pd.read_parquet(fs.unstrip_protocol(fs.find(input_path)[0]))

This should make the tutorial also cloud compatible, as the user just has to change base_path and remaining tutorial should just work as is.

@ayushdg ayushdg requested a review from sarahyurick November 20, 2025 17:35
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this matches the style of the semantic tutorial nicely. Added some small requests.

"\n",
"GPU accelerated implementation of a MinHash-LSH based fuzzy deduplication. For more information about semantic deduplication in NeMo Curator, refer to the [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) section of the documentation page.\n",
"\n",
"The tutorial here shows how to run Fuzzy Duplication on text data by executing a 2 end to end workflows which does the following:\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The tutorial here shows how to run Fuzzy Duplication on text data by executing a 2 end to end workflows which does the following:\n",
"The tutorial here shows how to run Fuzzy Duplication on text data by executing 2 end to end workflows which does the following:\n",

"\n",
"The tutorial here shows how to run Fuzzy Duplication on text data by executing a 2 end to end workflows which does the following:\n",
"\n",
"1. Read original dataset\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above comment says we are executing 2 workflows, but then lists 7 steps here, which could be confusing.

"6. During removal, reading the same file groups will give the same integer IDs, using the min/max ID values, we can find all corresponding duplicates in that range making the process faster.\n",
"\n",
"#### Performance Considerations\n",
"1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"1. LSH - Configuring bands_per_iteration controls how many bands to process simultanesouly in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",
"1. LSH - Configuring `bands_per_iteration` controls how many bands to process simultaneously in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.\n",

"4. The removal workflow is CPU only and can be run on machines that don't have GPUs\n",
"\n",
"#### Hyperparameter Considerations\n",
"1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of 0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false postives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"1. The current defaults for Fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a jaccard similarity of 0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false postives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n",
"1. The current defaults for fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a Jaccard similarity of 0.8. For more information on selecting the number of bands/hashes, it's recommended to analyze the S curve and tolerable threshold for false positives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).\n",

Do you think we would be able to include an example analysis in this tutorial?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the step-by-step tutorial.

"from nemo_curator.core.client import RayClient\n",
"\n",
"client = RayClient(num_cpus=64, num_gpus=2) # change as needed\n",
"client.start()\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a stop for this client.

]
},
{
"cell_type": "code",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lingering empty cell at the end of the notebook. Can you add a summary conclusion sentence/paragraph for the tutorial?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants