Skip to content
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/about/concepts/video/abstractions.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ NeMo Curator introduces core abstractions to organize and scale video curation w
A pipeline orchestrates stages into an end-to-end workflow. Key characteristics:

- **Stage Sequence**: Stages must follow a logical order where each stage's output feeds into the next
- **Input Configuration**: Specifies the data source location
- **Input Configuration**: Specify the data source location
- **Stage Configuration**: Stages accept their own parameters, including model paths and algorithm settings
- **Execution Mode**: Supports streaming and batch processing through the executor

Expand Down
2 changes: 1 addition & 1 deletion docs/about/concepts/video/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ only: not ga

# Architecture

NeMo Curator's video curation system builds on Ray, a distributed computing framework that enables scalable, high-throughput data processing across clusters of machines.
NeMo Curator's video curation system builds on Ray, a distributed framework for scalable, highthroughput data processing across machine clusters.

## Ray Foundation

Expand Down
12 changes: 6 additions & 6 deletions docs/about/concepts/video/data-flow.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ modality: "video-only"
Understanding how data moves through NeMo Curator's video curation pipelines is key to optimizing performance and resource usage.

- Data moves between stages via Ray's distributed object store, enabling efficient, in-memory transfer between distributed actors.
- In streaming mode, the executor returns final stage outputs while intermediate state stays in memory, reducing I/O overhead and improving throughput.
- In streaming mode (where stages operate continuously rather than in batches), the executor returns only final-stage outputs while keeping intermediate state in memory. This reduces I/O overhead and significantly improves throughput.
- The auto-scaling component continuously balances resources to maximize pipeline throughput, dynamically allocating workers to stages as needed.
- Writer stages persist outputs at the end of the pipeline, including clip media, embeddings (pickle and parquet variants), and metadata JSON files.

This architecture enables efficient processing of large-scale video datasets, with minimal data movement and optimal use of available hardware.
Together, these components enable efficient processing of large-scale video datasets with minimal data movement and optimal use of available hardware.

## Writer Output Layout

Expand All @@ -28,10 +28,10 @@ Writer stages produce the following directories under the configured output path
- `clips/`: MP4 clip files
- `filtered_clips/`: MP4 files for filtered clips
- `previews/`: WebP preview images for windows
- `metas/v0/`: Per-clip JSON metadata
- `metas/v0/`: Per-clip JSON metadata files
- `iv2_embd/`: Per-clip embeddings (pickle) for InternVideo2
- `iv2_embd_parquet/`: Per-video embeddings (parquet) for InternVideo2
- `iv2_embd_parquet/`: Aggregated per-video embeddings (parquet) for InternVideo2
- `ce1_embd/`: Per-clip embeddings (pickle) for Cosmos-Embed1
- `ce1_embd_parquet/`: Per-video embeddings (parquet) for Cosmos-Embed1
- `processed_videos/`: Per-video JSON metadata
- `ce1_embd_parquet/`: Aggregated per-video embeddings (parquet) for Cosmos-Embed1
- `processed_videos/`: Per-video JSON metadata files
- `processed_clip_chunks/`: Per-clip-chunk JSON statistics
6 changes: 3 additions & 3 deletions docs/about/concepts/video/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Stages, pipelines, and execution modes in video curation workflows
:link: about-concepts-video-data-flow
:link-type: ref

How data moves through the system, from ingestion to output
How data moves through the system from ingestion to output
:::

::::
Expand All @@ -58,7 +58,7 @@ The video curation concepts build on NVIDIA NeMo Curator's core infrastructure c
:::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` Memory Management
:link: reference-infra-memory-management
:link-type: ref
Optimize memory usage when processing large datasets
Optimize memory usage for large datasets
+++
{bdg-secondary}`partitioning`
{bdg-secondary}`batching`
Expand All @@ -78,7 +78,7 @@ Leverage NVIDIA GPU acceleration for faster data processing
:::{grid-item-card} {octicon}`sync;1.5em;sd-mr-1` Resumable Processing
:link: reference-infra-resumable-processing
:link-type: ref
Continue interrupted operations across large datasets
Continue interrupted operations on large datasets
+++
{bdg-secondary}`checkpoints`
{bdg-secondary}`recovery`
Expand Down
3 changes: 3 additions & 0 deletions docs/curate-text/process-data/deduplication/exact.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
---
description: "Identify and remove exact duplicates using MD5 hashing in a Ray-based workflow"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally haven't tested TextDuplicatesRemovalWorkflow with exact deduplication. Were you able to verify it?

categories: ["how-to-guides"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page still needs more updates imo.

tags: ["exact-dedup", "hashing", "md5", "gpu", "ray"]
Expand Down Expand Up @@ -110,6 +111,8 @@ exact_workflow = ExactDeduplicationWorkflow(
exact_workflow.run()
```

**Note**: `perform_removal` is reserved for future feature, always set to `False`. Exact removal is performed with `TextDuplicatesRemovalWorkflow`.

:::

::::
Expand Down
2 changes: 1 addition & 1 deletion docs/curate-text/process-data/deduplication/fuzzy.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we confirmed cloud storage works with the deduplication modules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested with cloud storage due to it's require more complicated settings.
Although I have discussed to Lawrence Lane, due to the cloud storage is not Dedup-exclusive, but for text-curator in general, it is not needed in Dedup docs. This would help the Dedup docs more concise and to the point.

Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ Configure fuzzy deduplication using these key parameters:

### Similarity Threshold

Control the strictness of matching with `num_bands` and `minhashes_per_band`:
Control matching strictness with `num_bands` and `minhashes_per_band`:

- **More strict matching**: Increase `num_bands` or decrease `minhashes_per_band`
- **Less strict matching**: Decrease `num_bands` or increase `minhashes_per_band`
Expand Down
2 changes: 1 addition & 1 deletion docs/curate-text/process-data/deduplication/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -418,7 +418,7 @@ For large-scale duplicate removal, persist the ID Generator for consistent docum

```python
from nemo_curator.stages.deduplication.id_generator import (
create_id_generator_actor,
create_id_generator_actor,
write_id_generator_to_disk,
kill_id_generator_actor
)
Expand Down
7 changes: 4 additions & 3 deletions docs/curate-text/process-data/deduplication/semdedup.md
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,7 @@ workflow = TextSemanticDeduplicationWorkflow(
- Ensure compatibility with your data type
- Adjust `embedding_model_inference_batch_size` for memory requirements
- Choose models appropriate for your language or domain
- Avoid generic decoder-only LLMs (e.g., OPT/GPT) for embeddings; prefer models trained for sentence embeddings (e.g., E5/BGE/SBERT)
:::

:::{dropdown} Advanced Configuration
Expand Down Expand Up @@ -362,7 +363,7 @@ workflow = TextSemanticDeduplicationWorkflow(

The semantic deduplication process produces the following directory structure in your configured `cache_path`:

```s
```text
cache_path/
├── embeddings/ # Embedding outputs
│ └── *.parquet # Parquet files containing document embeddings
Expand Down Expand Up @@ -394,8 +395,8 @@ The workflow produces these output files:
- `embs_by_nearest_center/`: Parquet files containing cluster members
- Format: Parquet files with columns: `[id_column, embedding_column, cluster_id]`

3. **Deduplicated Results** (`output_path/duplicates/*.parquet`):
- Final output containing document IDs to remove after deduplication
3. **Duplicate IDs** (`output_path/duplicates/*.parquet`):
- IDs of documents identified as duplicates for removal
- Format: Parquet file with columns: `["id"]`
- **Important**: Contains only the IDs of documents to remove, not the full document content
- When `perform_removal=True`, clean dataset is saved to `output_path/deduplicated/`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ pipeline.add_stage(create_extract_language_fields_stage(min_confidence=0.7))
A higher confidence score indicates greater certainty in the language identification. The `ScoreFilter` automatically filters documents below your specified `min_langid_score` threshold. The `extract_language_fields` stage shows how to further parse results and apply a higher threshold if needed.

:::{note}
Pipeline outputs may use the `language` field differently depending on the stage:
Pipeline outputs may use the `language` field differently depending on different stage. For example:

- In the FastText classification path (`ScoreFilter(FastTextLangId)`), the selected `score_field` (often `language`) stores a string representation of a list: `[score, code]`.
- In HTML extraction pipelines (for example, Common Crawl), CLD2 assigns a language name (for example, "ENGLISH") to the `language` column.
Expand Down
23 changes: 11 additions & 12 deletions docs/curate-text/process-data/quality-assessment/classifier.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,47 +28,47 @@ NeMo Curator supports a variety of classifier models for different filtering and
* - FastTextQualityFilter
- fastText (binary classifier)
- Quality filtering, high/low quality document classification (available as filter, not distributed classifier)
- https://fasttext.cc/
- [fastText](https://fasttext.cc/)
* - FastTextLangId
- fastText (language identification)
- Language identification (available as filter, not distributed classifier)
- https://fasttext.cc/docs/en/language-identification.html
- [fastText LangID](https://fasttext.cc/docs/en/language-identification.html)
* - QualityClassifier
- DeBERTa (transformers, HF)
- Document quality classification (multi-class, e.g., for curation)
- https://huggingface.co/nvidia/quality-classifier-deberta
- [nvidia/quality-classifier-deberta](https://huggingface.co/nvidia/quality-classifier-deberta)
* - DomainClassifier
- DeBERTa (transformers, HF)
- Domain classification (English)
- https://huggingface.co/nvidia/domain-classifier
- [nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier)
* - MultilingualDomainClassifier
- mDeBERTa (transformers, HF)
- Domain classification (multilingual, 52 languages)
- https://huggingface.co/nvidia/multilingual-domain-classifier
- [nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier)
* - ContentTypeClassifier
- DeBERTa (transformers, HF)
- Content type classification (11 speech types)
- https://huggingface.co/nvidia/content-type-classifier-deberta
- [nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta)
* - AegisClassifier
- LlamaGuard-7b (LLM, PEFT, HF)
- Safety classification (AI content safety, requires access to LlamaGuard-7b)
- https://huggingface.co/meta-llama/LlamaGuard-7b
- [meta-llama/LlamaGuard-7b](https://huggingface.co/meta-llama/LlamaGuard-7b)
* - InstructionDataGuardClassifier
- Custom neural net (used with Aegis)
- Detects instruction data poisoning
- https://huggingface.co/nvidia/instruction-data-guard
- [nvidia/instruction-data-guard](https://huggingface.co/nvidia/instruction-data-guard)
* - FineWebEduClassifier
- SequenceClassification (transformers, HF)
- Educational content quality scoring (FineWeb)
- https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier
- [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
* - FineWebMixtralEduClassifier
- SequenceClassification (transformers, HF)
- Educational content quality scoring (Mixtral variant)
- https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier
- [nvidia/nemocurator-fineweb-mixtral-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier)
* - FineWebNemotronEduClassifier
- SequenceClassification (transformers, HF)
- Educational content quality scoring (Nemotron-4 variant)
- https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier
- [nvidia/nemocurator-fineweb-nemotron-4-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier)
```

## How It Works
Expand Down Expand Up @@ -109,7 +109,6 @@ You can prepare training data using Python scripts:
```python
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.pipeline import Pipeline
import random

# Sample from low-quality dataset (e.g., raw Common Crawl)
def sample_documents(input_path, output_path, num_samples, label):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ results = pipeline.run() # Uses XennaExecutor by default
The exact label categories returned by the Quality Classifier depend on the model configuration. Check the prediction column in your results to see the available labels for filtering with the `filter_by` parameter.
:::

### AEGIS Safety Model
### AEGIS Safety Classifier

The AEGIS classifier detects unsafe content across 13 critical risk categories. It requires a HuggingFace token for access to Llama Guard.

Expand Down
4 changes: 2 additions & 2 deletions docs/curate-text/process-data/quality-assessment/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ modality: "text-only"

Score and remove low-quality content using heuristics and ML classifiers to prepare your data for model training using NeMo Curator's tools and utilities.

Large datasets often contain many documents considered to be "low quality." In this context, "low quality" data simply means data we don't want a downstream model to learn from, and "high quality" data is data that we do want a downstream model to learn from. The metrics that define quality can vary widely.
Large datasets often contain many documents considered "low quality." In this context, "low quality" means data we do not want downstream models to learn from, and "high quality" is data we do want them to learn from. The metrics that define quality can vary widely.

## How It Works

Expand Down Expand Up @@ -112,7 +112,7 @@ You can combine these modules in pipelines:
```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.modules import Score, Filter

# Assume `word_counter` and `symbol_counter` are callables that return numeric scores
pipeline = Pipeline(name="multi_stage_filtering")
pipeline.add_stage(Score(word_counter, score_field="word_count"))
pipeline.add_stage(Score(symbol_counter, score_field="symbol_ratio"))
Expand Down
19 changes: 15 additions & 4 deletions docs/curate-text/process-data/specialized-processing/code.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ NeMo Curator offers several specialized filters for code content:
| **PythonCommentToCodeFilter** | Filters Python files based on comment-to-code ratio | `min_comment_to_code_ratio`, `max_comment_to_code_ratio` | min=0.01, max=0.85 |
| **GeneralCommentToCodeFilter** | Similar filter for other languages | `language`, `min_comment_to_code_ratio`, `max_comment_to_code_ratio` | min=0.01, max=0.85 |

The comment-to-code ratio is an important metric for code quality. Low comment ratios may indicate poor documentation, while high comment ratios might suggest automatically generated code or tutorials:
The comment-to-code ratio is an important metric for code quality. Low comment ratios may indicate poor documentation, while high comment ratios might suggest automatically generated code or tutorials. These ratios should be adjusted based on specific progamming language:

```python
# For Python files with docstrings
Expand Down Expand Up @@ -112,14 +112,16 @@ The `GeneralCommentToCodeFilter` supports various language MIME types:
- `text/javascript` for JavaScript
- `text/x-ruby` for Ruby
- `text/x-csharp` for C#
- `text/x-c` for C
- `text/x-asm` for Assembly

### Code Structure Filters

| Filter | Description | Key Parameters | Default Values |
|--------|-------------|----------------|---------------|
| **NumberOfLinesOfCodeFilter** | Filters based on the number of lines | `min_lines`, `max_lines` | min=10, max=20000 |
| **AlphaFilter** | Ensures code has sufficient alphabetic content | `min_alpha_ratio` | 0.25 |
| **TokenizerFertilityFilter** | Measures token efficiency | `path_to_tokenizer` (required), `min_char_to_token_ratio` | ratio=2.5 |
| **NumberOfLinesOfCodeFilter** | Filters based on the number of lines | `min_lines`, `max_lines` | min_lines=10, max_lines=20000 |
| **AlphaFilter** | Ensures code has sufficient alphabetic content | `min_alpha_ratio` | min_alpha_ratio0.25 |
| **TokenizerFertilityFilter** | Measures token efficiency | `path_to_tokenizer` (required), `min_char_to_token_ratio` | min_char_to_token_ratio=2.5 |

Code structure filters help identify problematic patterns:

Expand Down Expand Up @@ -237,6 +239,9 @@ When filtering code datasets, consider these best practices:
1. **Language-specific configurations**: Adjust thresholds based on the programming language

```python
from nemo_curator.stages.text.modules import ScoreFilter
from nemo_curator.stages.text.filters import PythonCommentToCodeFilter, GeneralCommentToCodeFilter

# Python tends to have more comments than C
python_comment_filter = ScoreFilter(
score_fn=PythonCommentToCodeFilter(min_comment_to_code_ratio=0.05),
Expand All @@ -251,6 +256,9 @@ When filtering code datasets, consider these best practices:
2. **Preserve code structure**: Ensure filters don't inadvertently remove valid coding patterns

```python
from nemo_curator.stages.text.modules import ScoreFilter
from nemo_curator.stages.text.filters import GeneralCommentToCodeFilter

# Some languages naturally have low comment ratios
assembly_filter = ScoreFilter(
score_fn=GeneralCommentToCodeFilter(
Expand All @@ -267,6 +275,7 @@ When filtering code datasets, consider these best practices:
# First check if the content is actually Python using FastText language ID
from nemo_curator.stages.text.filters import FastTextLangId
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.modules import ScoreFilter

# Create pipeline for Python code filtering with language detection
pipeline = Pipeline(name="python_code_filtering")
Expand Down Expand Up @@ -321,6 +330,7 @@ When filtering code datasets, consider these best practices:
```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.modules import ScoreFilter
from nemo_curator.stages.text.filters import NumberOfLinesOfCodeFilter, XMLHeaderFilter, GeneralCommentToCodeFilter

# Create pipeline to filter non-functional code snippets
pipeline = Pipeline(name="code_cleaning")
Expand Down Expand Up @@ -351,6 +361,7 @@ pipeline.add_stage(ScoreFilter(
```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.modules import ScoreFilter
from nemo_curator.stages.text.filters import AlphaFilter, TokenizerFertilityFilter, HTMLBoilerplateFilter

# Create pipeline for training data preparation
pipeline = Pipeline(name="training_data_prep")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Domain-specific processing for code and advanced curation tasks using NeMo Curat

This section covers advanced processing techniques for specific data types and use cases that require specialized handling beyond general text processing. These tools are designed for specific domains like programming content.

## How it Works
## How It Works

Specialized processing modules in NeMo Curator are designed for specific data types and use cases:

Expand Down
10 changes: 8 additions & 2 deletions docs/curate-video/tutorials/split-dedup.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,12 @@ pipe.add_stage(
pipe.run()
```

`which_to_keep` selects the representative within each cluster: "hard" keeps outliers far from the centroid, "easy" keeps the nearest to the centroid, and "random" ignores distance and picks randomly.

`sim_metric` sets the distance used for similarity: "cosine" uses cosine distance (1 − cosine similarity), while "l2" uses Euclidean distance.

`pairwise_batch_size` controls how many items are processed per GPU batch during pairwise similarity; larger values can be faster but require more GPU memory.

---

## 3. Inspect Results
Expand All @@ -167,8 +173,8 @@ After duplicate removal, export curated clips and metadata for training. Common
Video-specific pointers:

- Use `ClipWriterStage` path helpers to locate outputs: `nemo_curator/stages/video/io/clip_writer.py`.
- Processed videos: `get_output_path_processed_videos(OUT_DIR)`
- Clip chunks and previews: `get_output_path_processed_clip_chunks(OUT_DIR)`, `get_output_path_previews(OUT_DIR)`
- Processed videos: `get_output_path_processed_videos(${OUT_DIR})`
- Clip chunks and previews: `get_output_path_processed_clip_chunks(${OUT_DIR})`, `get_output_path_previews(${OUT_DIR})`
- Embeddings parquet: `${OUT_DIR}/iv2_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`)

### Example Export
Expand Down