NVIDIA-NeMo · huvunvidia · Nov 24, 2025 · Nov 26, 2025 · Nov 26, 2025 · Nov 26, 2025
diff --git a/docs/about/concepts/video/abstractions.md b/docs/about/concepts/video/abstractions.md
@@ -27,7 +27,7 @@ NeMo Curator introduces core abstractions to organize and scale video curation w
 A pipeline orchestrates stages into an end-to-end workflow. Key characteristics:
 
 - **Stage Sequence**: Stages must follow a logical order where each stage's output feeds into the next
-- **Input Configuration**: Specifies the data source location
+- **Input Configuration**: Specify the data source location
 - **Stage Configuration**: Stages accept their own parameters, including model paths and algorithm settings
 - **Execution Mode**: Supports streaming and batch processing through the executor
 

diff --git a/docs/about/concepts/video/architecture.md b/docs/about/concepts/video/architecture.md
@@ -13,7 +13,7 @@ only: not ga
 
 # Architecture
 
-NeMo Curator's video curation system builds on Ray, a distributed computing framework that enables scalable, high-throughput data processing across clusters of machines.
+NeMo Curator's video curation system builds on Ray, a distributed framework for scalable, high‑throughput data processing across machine clusters.
 
 ## Ray Foundation
 

diff --git a/docs/about/concepts/video/data-flow.md b/docs/about/concepts/video/data-flow.md
@@ -15,11 +15,11 @@ modality: "video-only"
 Understanding how data moves through NeMo Curator's video curation pipelines is key to optimizing performance and resource usage.
 
 - Data moves between stages via Ray's distributed object store, enabling efficient, in-memory transfer between distributed actors.
-- In streaming mode, the executor returns final stage outputs while intermediate state stays in memory, reducing I/O overhead and improving throughput.
+- In streaming mode (where stages operate continuously rather than in batches), the executor returns only final-stage outputs while keeping intermediate state in memory. This reduces I/O overhead and significantly improves throughput.
 - The auto-scaling component continuously balances resources to maximize pipeline throughput, dynamically allocating workers to stages as needed.
 - Writer stages persist outputs at the end of the pipeline, including clip media, embeddings (pickle and parquet variants), and metadata JSON files.
 
-This architecture enables efficient processing of large-scale video datasets, with minimal data movement and optimal use of available hardware.
+Together, these components enable efficient processing of large-scale video datasets with minimal data movement and optimal use of available hardware.
 
 ## Writer Output Layout
 
@@ -28,10 +28,10 @@ Writer stages produce the following directories under the configured output path
 - `clips/`: MP4 clip files
 - `filtered_clips/`: MP4 files for filtered clips
 - `previews/`: WebP preview images for windows
-- `metas/v0/`: Per-clip JSON metadata
+- `metas/v0/`: Per-clip JSON metadata files
 - `iv2_embd/`: Per-clip embeddings (pickle) for InternVideo2
-- `iv2_embd_parquet/`: Per-video embeddings (parquet) for InternVideo2
+- `iv2_embd_parquet/`: Aggregated per-video embeddings (parquet) for InternVideo2
 - `ce1_embd/`: Per-clip embeddings (pickle) for Cosmos-Embed1
-- `ce1_embd_parquet/`: Per-video embeddings (parquet) for Cosmos-Embed1
-- `processed_videos/`: Per-video JSON metadata
+- `ce1_embd_parquet/`: Aggregated per-video embeddings (parquet) for Cosmos-Embed1
+- `processed_videos/`: Per-video JSON metadata files
 - `processed_clip_chunks/`: Per-clip-chunk JSON statistics
diff --git a/docs/about/concepts/video/index.md b/docs/about/concepts/video/index.md
@@ -39,7 +39,7 @@ Stages, pipelines, and execution modes in video curation workflows
 :link: about-concepts-video-data-flow
 :link-type: ref
 
-How data moves through the system, from ingestion to output
+How data moves through the system from ingestion to output
 :::
 
 ::::
@@ -58,7 +58,7 @@ The video curation concepts build on NVIDIA NeMo Curator's core infrastructure c
 :::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` Memory Management
 :link: reference-infra-memory-management
 :link-type: ref
-Optimize memory usage when processing large datasets
+Optimize memory usage for large datasets
 +++
 {bdg-secondary}`partitioning`
 {bdg-secondary}`batching`
@@ -78,7 +78,7 @@ Leverage NVIDIA GPU acceleration for faster data processing
 :::{grid-item-card} {octicon}`sync;1.5em;sd-mr-1` Resumable Processing
 :link: reference-infra-resumable-processing
 :link-type: ref
-Continue interrupted operations across large datasets
+Continue interrupted operations on large datasets
 +++
 {bdg-secondary}`checkpoints`
 {bdg-secondary}`recovery`

diff --git a/docs/curate-text/process-data/deduplication/exact.md b/docs/curate-text/process-data/deduplication/exact.md
@@ -1,3 +1,4 @@
+---
 description: "Identify and remove exact duplicates using MD5 hashing in a Ray-based workflow"
 categories: ["how-to-guides"]
 tags: ["exact-dedup", "hashing", "md5", "gpu", "ray"]
@@ -110,6 +111,8 @@ exact_workflow = ExactDeduplicationWorkflow(
 exact_workflow.run()
 ```
 
+**Note**: `perform_removal` is reserved for future feature, always set to `False`. Exact removal is performed with `TextDuplicatesRemovalWorkflow`.
+
 :::
 
 ::::

diff --git a/docs/curate-text/process-data/deduplication/fuzzy.md b/docs/curate-text/process-data/deduplication/fuzzy.md
@@ -187,7 +187,7 @@ Configure fuzzy deduplication using these key parameters:
 
 ### Similarity Threshold
 
-Control the strictness of matching with `num_bands` and `minhashes_per_band`:
+Control matching strictness with `num_bands` and `minhashes_per_band`:
 
 - **More strict matching**: Increase `num_bands` or decrease `minhashes_per_band`
 - **Less strict matching**: Decrease `num_bands` or increase `minhashes_per_band`

diff --git a/docs/curate-text/process-data/deduplication/index.md b/docs/curate-text/process-data/deduplication/index.md
@@ -418,7 +418,7 @@ For large-scale duplicate removal, persist the ID Generator for consistent docum
 
 ```python
 from nemo_curator.stages.deduplication.id_generator import (
-    create_id_generator_actor, 
+    create_id_generator_actor,
     write_id_generator_to_disk,
     kill_id_generator_actor
 )

diff --git a/docs/curate-text/process-data/deduplication/semdedup.md b/docs/curate-text/process-data/deduplication/semdedup.md
@@ -323,6 +323,7 @@ workflow = TextSemanticDeduplicationWorkflow(
 - Ensure compatibility with your data type
 - Adjust `embedding_model_inference_batch_size` for memory requirements
 - Choose models appropriate for your language or domain
+- Avoid generic decoder-only LLMs (e.g., OPT/GPT) for embeddings; prefer models trained for sentence embeddings (e.g., E5/BGE/SBERT)
 :::
 
 :::{dropdown} Advanced Configuration
@@ -362,7 +363,7 @@ workflow = TextSemanticDeduplicationWorkflow(
 
 The semantic deduplication process produces the following directory structure in your configured `cache_path`:
 
-```s
+```text
 cache_path/
 ├── embeddings/                           # Embedding outputs
 │   └── *.parquet                         # Parquet files containing document embeddings
@@ -394,8 +395,8 @@ The workflow produces these output files:
    - `embs_by_nearest_center/`: Parquet files containing cluster members
    - Format: Parquet files with columns: `[id_column, embedding_column, cluster_id]`
 
-3. **Deduplicated Results** (`output_path/duplicates/*.parquet`):
-   - Final output containing document IDs to remove after deduplication
+3. **Duplicate IDs** (`output_path/duplicates/*.parquet`):
+   - IDs of documents identified as duplicates for removal
    - Format: Parquet file with columns: `["id"]`
    - **Important**: Contains only the IDs of documents to remove, not the full document content
    - When `perform_removal=True`, clean dataset is saved to `output_path/deduplicated/`

diff --git a/docs/curate-text/process-data/language-management/language.md b/docs/curate-text/process-data/language-management/language.md
@@ -203,7 +203,7 @@ pipeline.add_stage(create_extract_language_fields_stage(min_confidence=0.7))
 A higher confidence score indicates greater certainty in the language identification. The `ScoreFilter` automatically filters documents below your specified `min_langid_score` threshold. The `extract_language_fields` stage shows how to further parse results and apply a higher threshold if needed.
 
 :::{note}
-Pipeline outputs may use the `language` field differently depending on the stage:
+Pipeline outputs may use the `language` field differently depending on different stage. For example:
 
 - In the FastText classification path (`ScoreFilter(FastTextLangId)`), the selected `score_field` (often `language`) stores a string representation of a list: `[score, code]`.
 - In HTML extraction pipelines (for example, Common Crawl), CLD2 assigns a language name (for example, "ENGLISH") to the `language` column.

diff --git a/docs/curate-text/process-data/quality-assessment/classifier.md b/docs/curate-text/process-data/quality-assessment/classifier.md
@@ -28,47 +28,47 @@ NeMo Curator supports a variety of classifier models for different filtering and
 * - FastTextQualityFilter
   - fastText (binary classifier)
   - Quality filtering, high/low quality document classification (available as filter, not distributed classifier)
-  - https://fasttext.cc/
+  - [fastText](https://fasttext.cc/)
 * - FastTextLangId
   - fastText (language identification)
   - Language identification (available as filter, not distributed classifier)
-  - https://fasttext.cc/docs/en/language-identification.html
+  - [fastText LangID](https://fasttext.cc/docs/en/language-identification.html)
 * - QualityClassifier
   - DeBERTa (transformers, HF)
   - Document quality classification (multi-class, e.g., for curation)
-  - https://huggingface.co/nvidia/quality-classifier-deberta
+  - [nvidia/quality-classifier-deberta](https://huggingface.co/nvidia/quality-classifier-deberta)
 * - DomainClassifier
   - DeBERTa (transformers, HF)
   - Domain classification (English)
-  - https://huggingface.co/nvidia/domain-classifier
+  - [nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier)
 * - MultilingualDomainClassifier
   - mDeBERTa (transformers, HF)
   - Domain classification (multilingual, 52 languages)
-  - https://huggingface.co/nvidia/multilingual-domain-classifier
+  - [nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier)
 * - ContentTypeClassifier
   - DeBERTa (transformers, HF)
   - Content type classification (11 speech types)
-  - https://huggingface.co/nvidia/content-type-classifier-deberta
+  - [nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta)
 * - AegisClassifier
   - LlamaGuard-7b (LLM, PEFT, HF)
   - Safety classification (AI content safety, requires access to LlamaGuard-7b)
-  - https://huggingface.co/meta-llama/LlamaGuard-7b
+  - [meta-llama/LlamaGuard-7b](https://huggingface.co/meta-llama/LlamaGuard-7b)
 * - InstructionDataGuardClassifier
   - Custom neural net (used with Aegis)
   - Detects instruction data poisoning
-  - https://huggingface.co/nvidia/instruction-data-guard
+  - [nvidia/instruction-data-guard](https://huggingface.co/nvidia/instruction-data-guard)
 * - FineWebEduClassifier
   - SequenceClassification (transformers, HF)
   - Educational content quality scoring (FineWeb)
-  - https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier
+  - [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
 * - FineWebMixtralEduClassifier
   - SequenceClassification (transformers, HF)
   - Educational content quality scoring (Mixtral variant)
-  - https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier
+  - [nvidia/nemocurator-fineweb-mixtral-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier)
 * - FineWebNemotronEduClassifier
   - SequenceClassification (transformers, HF)
   - Educational content quality scoring (Nemotron-4 variant)
-  - https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier
+  - [nvidia/nemocurator-fineweb-nemotron-4-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier)
 ```
 
 ## How It Works
@@ -109,7 +109,6 @@ You can prepare training data using Python scripts:
 ```python
 from nemo_curator.stages.text.io.reader import JsonlReader
 from nemo_curator.pipeline import Pipeline
-import random
 
 # Sample from low-quality dataset (e.g., raw Common Crawl)
 def sample_documents(input_path, output_path, num_samples, label):

diff --git a/docs/curate-text/process-data/quality-assessment/distributed-classifier.md b/docs/curate-text/process-data/quality-assessment/distributed-classifier.md
@@ -122,7 +122,7 @@ results = pipeline.run()  # Uses XennaExecutor by default
 The exact label categories returned by the Quality Classifier depend on the model configuration. Check the prediction column in your results to see the available labels for filtering with the `filter_by` parameter.
 :::
 
-### AEGIS Safety Model
+### AEGIS Safety Classifier
 
 The AEGIS classifier detects unsafe content across 13 critical risk categories. It requires a HuggingFace token for access to Llama Guard.
 

diff --git a/docs/curate-text/process-data/quality-assessment/index.md b/docs/curate-text/process-data/quality-assessment/index.md
@@ -14,7 +14,7 @@ modality: "text-only"
 
 Score and remove low-quality content using heuristics and ML classifiers to prepare your data for model training using NeMo Curator's tools and utilities.
 
-Large datasets often contain many documents considered to be "low quality." In this context, "low quality" data simply means data we don't want a downstream model to learn from, and "high quality" data is data that we do want a downstream model to learn from. The metrics that define quality can vary widely.
+Large datasets often contain many documents considered "low quality." In this context, "low quality" means data we do not want downstream models to learn from, and "high quality" is data we do want them to learn from. The metrics that define quality can vary widely.
 
 ## How It Works
 
@@ -112,7 +112,7 @@ You can combine these modules in pipelines:
 ```python
 from nemo_curator.pipeline import Pipeline
 from nemo_curator.stages.text.modules import Score, Filter
-
+# Assume `word_counter` and `symbol_counter` are callables that return numeric scores
 pipeline = Pipeline(name="multi_stage_filtering")
 pipeline.add_stage(Score(word_counter, score_field="word_count"))
 pipeline.add_stage(Score(symbol_counter, score_field="symbol_ratio"))

diff --git a/docs/curate-text/process-data/specialized-processing/code.md b/docs/curate-text/process-data/specialized-processing/code.md
@@ -82,7 +82,7 @@ NeMo Curator offers several specialized filters for code content:
 | **PythonCommentToCodeFilter** | Filters Python files based on comment-to-code ratio | `min_comment_to_code_ratio`, `max_comment_to_code_ratio` | min=0.01, max=0.85 |
 | **GeneralCommentToCodeFilter** | Similar filter for other languages | `language`, `min_comment_to_code_ratio`, `max_comment_to_code_ratio` | min=0.01, max=0.85 |
 
-The comment-to-code ratio is an important metric for code quality. Low comment ratios may indicate poor documentation, while high comment ratios might suggest automatically generated code or tutorials:
+The comment-to-code ratio is an important metric for code quality. Low comment ratios may indicate poor documentation, while high comment ratios might suggest automatically generated code or tutorials. These ratios should be adjusted based on specific progamming language:
 
 ```python
 # For Python files with docstrings
@@ -112,14 +112,16 @@ The `GeneralCommentToCodeFilter` supports various language MIME types:
 - `text/javascript` for JavaScript
 - `text/x-ruby` for Ruby
 - `text/x-csharp` for C#
+- `text/x-c` for C
+- `text/x-asm` for Assembly
 
 ### Code Structure Filters
 
 | Filter | Description | Key Parameters | Default Values |
 |--------|-------------|----------------|---------------|
-| **NumberOfLinesOfCodeFilter** | Filters based on the number of lines | `min_lines`, `max_lines` | min=10, max=20000 |
-| **AlphaFilter** | Ensures code has sufficient alphabetic content | `min_alpha_ratio` | 0.25 |
-| **TokenizerFertilityFilter** | Measures token efficiency | `path_to_tokenizer` (required), `min_char_to_token_ratio` | ratio=2.5 |
+| **NumberOfLinesOfCodeFilter** | Filters based on the number of lines | `min_lines`, `max_lines` | min_lines=10, max_lines=20000 |
+| **AlphaFilter** | Ensures code has sufficient alphabetic content | `min_alpha_ratio` | min_alpha_ratio0.25 |
+| **TokenizerFertilityFilter** | Measures token efficiency | `path_to_tokenizer` (required), `min_char_to_token_ratio` | min_char_to_token_ratio=2.5 |
 
 Code structure filters help identify problematic patterns:
 
@@ -237,6 +239,9 @@ When filtering code datasets, consider these best practices:
 1. **Language-specific configurations**: Adjust thresholds based on the programming language
 
    ```python
+   from nemo_curator.stages.text.modules import ScoreFilter
+   from nemo_curator.stages.text.filters import PythonCommentToCodeFilter, GeneralCommentToCodeFilter
+
    # Python tends to have more comments than C
    python_comment_filter = ScoreFilter(
        score_fn=PythonCommentToCodeFilter(min_comment_to_code_ratio=0.05),
@@ -251,6 +256,9 @@ When filtering code datasets, consider these best practices:
 2. **Preserve code structure**: Ensure filters don't inadvertently remove valid coding patterns
 
    ```python
+   from nemo_curator.stages.text.modules import ScoreFilter
+   from nemo_curator.stages.text.filters import GeneralCommentToCodeFilter
+
    # Some languages naturally have low comment ratios
    assembly_filter = ScoreFilter(
        score_fn=GeneralCommentToCodeFilter(
@@ -267,6 +275,7 @@ When filtering code datasets, consider these best practices:
    # First check if the content is actually Python using FastText language ID
    from nemo_curator.stages.text.filters import FastTextLangId
    from nemo_curator.pipeline import Pipeline
+   from nemo_curator.stages.text.modules import ScoreFilter
 
    # Create pipeline for Python code filtering with language detection
    pipeline = Pipeline(name="python_code_filtering")
@@ -321,6 +330,7 @@ When filtering code datasets, consider these best practices:
 ```python
 from nemo_curator.pipeline import Pipeline
 from nemo_curator.stages.text.modules import ScoreFilter
+from nemo_curator.stages.text.filters import NumberOfLinesOfCodeFilter, XMLHeaderFilter, GeneralCommentToCodeFilter
 
 # Create pipeline to filter non-functional code snippets
 pipeline = Pipeline(name="code_cleaning")
@@ -351,6 +361,7 @@ pipeline.add_stage(ScoreFilter(
 ```python
 from nemo_curator.pipeline import Pipeline
 from nemo_curator.stages.text.modules import ScoreFilter
+from nemo_curator.stages.text.filters import AlphaFilter, TokenizerFertilityFilter, HTMLBoilerplateFilter
 
 # Create pipeline for training data preparation
 pipeline = Pipeline(name="training_data_prep")

diff --git a/docs/curate-text/process-data/specialized-processing/index.md b/docs/curate-text/process-data/specialized-processing/index.md
@@ -16,7 +16,7 @@ Domain-specific processing for code and advanced curation tasks using NeMo Curat
 
 This section covers advanced processing techniques for specific data types and use cases that require specialized handling beyond general text processing. These tools are designed for specific domains like programming content.
 
-## How it Works
+## How It Works
 
 Specialized processing modules in NeMo Curator are designed for specific data types and use cases:
 

diff --git a/docs/curate-video/tutorials/split-dedup.md b/docs/curate-video/tutorials/split-dedup.md
@@ -149,6 +149,12 @@ pipe.add_stage(
 pipe.run()
 ```
 
+`which_to_keep` selects the representative within each cluster: "hard" keeps outliers far from the centroid, "easy" keeps the nearest to the centroid, and "random" ignores distance and picks randomly.
+
+`sim_metric` sets the distance used for similarity: "cosine" uses cosine distance (1 − cosine similarity), while "l2" uses Euclidean distance.
+
+`pairwise_batch_size` controls how many items are processed per GPU batch during pairwise similarity; larger values can be faster but require more GPU memory.
+
 ---
 
 ## 3. Inspect Results
@@ -167,8 +173,8 @@ After duplicate removal, export curated clips and metadata for training. Common
 Video-specific pointers:
 
 - Use `ClipWriterStage` path helpers to locate outputs: `nemo_curator/stages/video/io/clip_writer.py`.
-  - Processed videos: `get_output_path_processed_videos(OUT_DIR)`
-  - Clip chunks and previews: `get_output_path_processed_clip_chunks(OUT_DIR)`, `get_output_path_previews(OUT_DIR)`
+  - Processed videos: `get_output_path_processed_videos(${OUT_DIR})`
+  - Clip chunks and previews: `get_output_path_processed_clip_chunks(${OUT_DIR})`, `get_output_path_previews(${OUT_DIR})`
   - Embeddings parquet: `${OUT_DIR}/iv2_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`)
 
 ### Example Export