diff --git a/docs/about/concepts/video/architecture.md b/docs/about/concepts/video/architecture.md index 017ed8e2a..cbcb1bd64 100644 --- a/docs/about/concepts/video/architecture.md +++ b/docs/about/concepts/video/architecture.md @@ -13,7 +13,7 @@ only: not ga # Architecture -NeMo Curator's video curation system builds on Ray, a distributed computing framework that enables scalable, high-throughput data processing across clusters of machines. +NeMo Curator's video curation system builds on Ray, a distributed framework for scalable, high‑throughput data processing across machine clusters. ## Ray Foundation diff --git a/docs/about/concepts/video/data-flow.md b/docs/about/concepts/video/data-flow.md index 6cd1fac64..cce9ef88a 100644 --- a/docs/about/concepts/video/data-flow.md +++ b/docs/about/concepts/video/data-flow.md @@ -15,11 +15,11 @@ modality: "video-only" Understanding how data moves through NeMo Curator's video curation pipelines is key to optimizing performance and resource usage. - Data moves between stages via Ray's distributed object store, enabling efficient, in-memory transfer between distributed actors. -- In streaming mode, the executor returns final stage outputs while intermediate state stays in memory, reducing I/O overhead and improving throughput. +- In streaming mode (where stages operate continuously rather than in batches), the executor returns only final-stage outputs while keeping intermediate state in memory. This reduces I/O overhead and significantly improves throughput. - The auto-scaling component continuously balances resources to maximize pipeline throughput, dynamically allocating workers to stages as needed. - Writer stages persist outputs at the end of the pipeline, including clip media, embeddings (pickle and parquet variants), and metadata JSON files. -This architecture enables efficient processing of large-scale video datasets, with minimal data movement and optimal use of available hardware. +Together, these components enable efficient processing of large-scale video datasets with minimal data movement and optimal use of available hardware. ## Writer Output Layout @@ -28,10 +28,10 @@ Writer stages produce the following directories under the configured output path - `clips/`: MP4 clip files - `filtered_clips/`: MP4 files for filtered clips - `previews/`: WebP preview images for windows -- `metas/v0/`: Per-clip JSON metadata +- `metas/v0/`: Per-clip JSON metadata files - `iv2_embd/`: Per-clip embeddings (pickle) for InternVideo2 -- `iv2_embd_parquet/`: Per-video embeddings (parquet) for InternVideo2 +- `iv2_embd_parquet/`: Aggregated per-video embeddings (parquet) for InternVideo2 - `ce1_embd/`: Per-clip embeddings (pickle) for Cosmos-Embed1 -- `ce1_embd_parquet/`: Per-video embeddings (parquet) for Cosmos-Embed1 -- `processed_videos/`: Per-video JSON metadata +- `ce1_embd_parquet/`: Aggregated per-video embeddings (parquet) for Cosmos-Embed1 +- `processed_videos/`: Per-video JSON metadata files - `processed_clip_chunks/`: Per-clip-chunk JSON statistics diff --git a/docs/about/concepts/video/index.md b/docs/about/concepts/video/index.md index a93506334..ed7d3d8f9 100644 --- a/docs/about/concepts/video/index.md +++ b/docs/about/concepts/video/index.md @@ -39,7 +39,7 @@ Stages, pipelines, and execution modes in video curation workflows :link: about-concepts-video-data-flow :link-type: ref -How data moves through the system, from ingestion to output +How data moves through the system from ingestion to output ::: :::: @@ -58,7 +58,7 @@ The video curation concepts build on NVIDIA NeMo Curator's core infrastructure c :::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` Memory Management :link: reference-infra-memory-management :link-type: ref -Optimize memory usage when processing large datasets +Optimize memory usage for large datasets +++ {bdg-secondary}`partitioning` {bdg-secondary}`batching` @@ -78,7 +78,7 @@ Leverage NVIDIA GPU acceleration for faster data processing :::{grid-item-card} {octicon}`sync;1.5em;sd-mr-1` Resumable Processing :link: reference-infra-resumable-processing :link-type: ref -Continue interrupted operations across large datasets +Continue interrupted operations on large datasets +++ {bdg-secondary}`checkpoints` {bdg-secondary}`recovery` diff --git a/docs/curate-text/process-data/deduplication/exact.md b/docs/curate-text/process-data/deduplication/exact.md index c183ca6cb..82f74de9a 100644 --- a/docs/curate-text/process-data/deduplication/exact.md +++ b/docs/curate-text/process-data/deduplication/exact.md @@ -1,3 +1,4 @@ +--- description: "Identify and remove exact duplicates using MD5 hashing in a Ray-based workflow" categories: ["how-to-guides"] tags: ["exact-dedup", "hashing", "md5", "gpu", "ray"] diff --git a/docs/curate-text/process-data/deduplication/fuzzy.md b/docs/curate-text/process-data/deduplication/fuzzy.md index 64492199a..a9590eb87 100644 --- a/docs/curate-text/process-data/deduplication/fuzzy.md +++ b/docs/curate-text/process-data/deduplication/fuzzy.md @@ -187,7 +187,7 @@ Configure fuzzy deduplication using these key parameters: ### Similarity Threshold -Control the strictness of matching with `num_bands` and `minhashes_per_band`: +Control matching strictness with `num_bands` and `minhashes_per_band`: - **More strict matching**: Increase `num_bands` or decrease `minhashes_per_band` - **Less strict matching**: Decrease `num_bands` or increase `minhashes_per_band` diff --git a/docs/curate-text/process-data/deduplication/index.md b/docs/curate-text/process-data/deduplication/index.md index 838309edc..c200aac51 100644 --- a/docs/curate-text/process-data/deduplication/index.md +++ b/docs/curate-text/process-data/deduplication/index.md @@ -418,7 +418,7 @@ For large-scale duplicate removal, persist the ID Generator for consistent docum ```python from nemo_curator.stages.deduplication.id_generator import ( - create_id_generator_actor, + create_id_generator_actor, write_id_generator_to_disk, kill_id_generator_actor ) diff --git a/docs/curate-text/process-data/deduplication/semdedup.md b/docs/curate-text/process-data/deduplication/semdedup.md index 7a6bf3ee9..79880782c 100644 --- a/docs/curate-text/process-data/deduplication/semdedup.md +++ b/docs/curate-text/process-data/deduplication/semdedup.md @@ -323,6 +323,7 @@ workflow = TextSemanticDeduplicationWorkflow( - Ensure compatibility with your data type - Adjust `embedding_model_inference_batch_size` for memory requirements - Choose models appropriate for your language or domain +- Avoid generic decoder-only LLMs (e.g., OPT/GPT) for embeddings; prefer models trained for sentence embeddings (e.g., E5/BGE/SBERT) ::: :::{dropdown} Advanced Configuration @@ -362,7 +363,7 @@ workflow = TextSemanticDeduplicationWorkflow( The semantic deduplication process produces the following directory structure in your configured `cache_path`: -```s +```text cache_path/ ├── embeddings/ # Embedding outputs │ └── *.parquet # Parquet files containing document embeddings @@ -394,8 +395,8 @@ The workflow produces these output files: - `embs_by_nearest_center/`: Parquet files containing cluster members - Format: Parquet files with columns: `[id_column, embedding_column, cluster_id]` -3. **Deduplicated Results** (`output_path/duplicates/*.parquet`): - - Final output containing document IDs to remove after deduplication +3. **Duplicate IDs** (`output_path/duplicates/*.parquet`): + - IDs of documents identified as duplicates for removal - Format: Parquet file with columns: `["id"]` - **Important**: Contains only the IDs of documents to remove, not the full document content - When `perform_removal=True`, clean dataset is saved to `output_path/deduplicated/` diff --git a/docs/curate-text/process-data/language-management/language.md b/docs/curate-text/process-data/language-management/language.md index a149b9237..473c77367 100644 --- a/docs/curate-text/process-data/language-management/language.md +++ b/docs/curate-text/process-data/language-management/language.md @@ -203,7 +203,7 @@ pipeline.add_stage(create_extract_language_fields_stage(min_confidence=0.7)) A higher confidence score indicates greater certainty in the language identification. The `ScoreFilter` automatically filters documents below your specified `min_langid_score` threshold. The `extract_language_fields` stage shows how to further parse results and apply a higher threshold if needed. :::{note} -Pipeline outputs may use the `language` field differently depending on the stage: +Pipeline outputs may use the `language` field differently depending on different stages. For example: - In the FastText classification path (`ScoreFilter(FastTextLangId)`), the selected `score_field` (often `language`) stores a string representation of a list: `[score, code]`. - In HTML extraction pipelines (for example, Common Crawl), CLD2 assigns a language name (for example, "ENGLISH") to the `language` column. diff --git a/docs/curate-text/process-data/quality-assessment/classifier.md b/docs/curate-text/process-data/quality-assessment/classifier.md index cc871a11f..8998c2dc4 100644 --- a/docs/curate-text/process-data/quality-assessment/classifier.md +++ b/docs/curate-text/process-data/quality-assessment/classifier.md @@ -28,47 +28,47 @@ NeMo Curator supports a variety of classifier models for different filtering and * - FastTextQualityFilter - fastText (binary classifier) - Quality filtering, high/low quality document classification (available as filter, not distributed classifier) - - https://fasttext.cc/ + - [fastText](https://fasttext.cc/) * - FastTextLangId - fastText (language identification) - Language identification (available as filter, not distributed classifier) - - https://fasttext.cc/docs/en/language-identification.html + - [fastText LangID](https://fasttext.cc/docs/en/language-identification.html) * - QualityClassifier - DeBERTa (transformers, HF) - Document quality classification (multi-class, e.g., for curation) - - https://huggingface.co/nvidia/quality-classifier-deberta + - [nvidia/quality-classifier-deberta](https://huggingface.co/nvidia/quality-classifier-deberta) * - DomainClassifier - DeBERTa (transformers, HF) - Domain classification (English) - - https://huggingface.co/nvidia/domain-classifier + - [nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier) * - MultilingualDomainClassifier - mDeBERTa (transformers, HF) - Domain classification (multilingual, 52 languages) - - https://huggingface.co/nvidia/multilingual-domain-classifier + - [nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) * - ContentTypeClassifier - DeBERTa (transformers, HF) - Content type classification (11 speech types) - - https://huggingface.co/nvidia/content-type-classifier-deberta + - [nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta) * - AegisClassifier - LlamaGuard-7b (LLM, PEFT, HF) - Safety classification (AI content safety, requires access to LlamaGuard-7b) - - https://huggingface.co/meta-llama/LlamaGuard-7b + - [meta-llama/LlamaGuard-7b](https://huggingface.co/meta-llama/LlamaGuard-7b) * - InstructionDataGuardClassifier - Custom neural net (used with Aegis) - Detects instruction data poisoning - - https://huggingface.co/nvidia/instruction-data-guard + - [nvidia/instruction-data-guard](https://huggingface.co/nvidia/instruction-data-guard) * - FineWebEduClassifier - SequenceClassification (transformers, HF) - Educational content quality scoring (FineWeb) - - https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier + - [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) * - FineWebMixtralEduClassifier - SequenceClassification (transformers, HF) - Educational content quality scoring (Mixtral variant) - - https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier + - [nvidia/nemocurator-fineweb-mixtral-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier) * - FineWebNemotronEduClassifier - SequenceClassification (transformers, HF) - Educational content quality scoring (Nemotron-4 variant) - - https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier + - [nvidia/nemocurator-fineweb-nemotron-4-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) ``` ## How It Works @@ -109,7 +109,6 @@ You can prepare training data using Python scripts: ```python from nemo_curator.stages.text.io.reader import JsonlReader from nemo_curator.pipeline import Pipeline -import random # Sample from low-quality dataset (e.g., raw Common Crawl) def sample_documents(input_path, output_path, num_samples, label): diff --git a/docs/curate-text/process-data/quality-assessment/distributed-classifier.md b/docs/curate-text/process-data/quality-assessment/distributed-classifier.md index 3d54975d6..50b005054 100644 --- a/docs/curate-text/process-data/quality-assessment/distributed-classifier.md +++ b/docs/curate-text/process-data/quality-assessment/distributed-classifier.md @@ -122,7 +122,7 @@ results = pipeline.run() # Uses XennaExecutor by default The exact label categories returned by the Quality Classifier depend on the model configuration. Check the prediction column in your results to see the available labels for filtering with the `filter_by` parameter. ::: -### AEGIS Safety Model +### AEGIS Safety Classifier The AEGIS classifier detects unsafe content across 13 critical risk categories. It requires a HuggingFace token for access to Llama Guard. diff --git a/docs/curate-text/process-data/quality-assessment/index.md b/docs/curate-text/process-data/quality-assessment/index.md index b5c2f8d71..266a4bdc1 100644 --- a/docs/curate-text/process-data/quality-assessment/index.md +++ b/docs/curate-text/process-data/quality-assessment/index.md @@ -14,7 +14,7 @@ modality: "text-only" Score and remove low-quality content using heuristics and ML classifiers to prepare your data for model training using NeMo Curator's tools and utilities. -Large datasets often contain many documents considered to be "low quality." In this context, "low quality" data simply means data we don't want a downstream model to learn from, and "high quality" data is data that we do want a downstream model to learn from. The metrics that define quality can vary widely. +Large datasets often contain many documents considered "low quality." In this context, "low quality" means data we do not want downstream models to learn from, and "high quality" is data we do want them to learn from. The metrics that define quality can vary widely. ## How It Works @@ -112,7 +112,7 @@ You can combine these modules in pipelines: ```python from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.modules import Score, Filter - +# Assume `word_counter` and `symbol_counter` are callables that return numeric scores pipeline = Pipeline(name="multi_stage_filtering") pipeline.add_stage(Score(word_counter, score_field="word_count")) pipeline.add_stage(Score(symbol_counter, score_field="symbol_ratio")) diff --git a/docs/curate-text/process-data/specialized-processing/code.md b/docs/curate-text/process-data/specialized-processing/code.md index cfa639767..d8ad8b5c7 100644 --- a/docs/curate-text/process-data/specialized-processing/code.md +++ b/docs/curate-text/process-data/specialized-processing/code.md @@ -45,6 +45,7 @@ pipeline.add_stage(reader) # Add filter stages for code quality pipeline.add_stage(ScoreFilter( + ## NEED FIX: TypeError: ScoreFilter.__init__() got an unexpected keyword argument 'score_fn' score_fn=PythonCommentToCodeFilter( min_comment_to_code_ratio=0.01, max_comment_to_code_ratio=0.8 @@ -82,7 +83,7 @@ NeMo Curator offers several specialized filters for code content: | **PythonCommentToCodeFilter** | Filters Python files based on comment-to-code ratio | `min_comment_to_code_ratio`, `max_comment_to_code_ratio` | min=0.01, max=0.85 | | **GeneralCommentToCodeFilter** | Similar filter for other languages | `language`, `min_comment_to_code_ratio`, `max_comment_to_code_ratio` | min=0.01, max=0.85 | -The comment-to-code ratio is an important metric for code quality. Low comment ratios may indicate poor documentation, while high comment ratios might suggest automatically generated code or tutorials: +The comment-to-code ratio is an important metric for code quality. Low comment ratios may indicate poor documentation, while high comment ratios might suggest automatically generated code or tutorials. These ratios should be adjusted based on specific programming languages: ```python # For Python files with docstrings @@ -112,14 +113,16 @@ The `GeneralCommentToCodeFilter` supports various language MIME types: - `text/javascript` for JavaScript - `text/x-ruby` for Ruby - `text/x-csharp` for C# +- `text/x-c` for C +- `text/x-asm` for Assembly ### Code Structure Filters | Filter | Description | Key Parameters | Default Values | |--------|-------------|----------------|---------------| -| **NumberOfLinesOfCodeFilter** | Filters based on the number of lines | `min_lines`, `max_lines` | min=10, max=20000 | -| **AlphaFilter** | Ensures code has sufficient alphabetic content | `min_alpha_ratio` | 0.25 | -| **TokenizerFertilityFilter** | Measures token efficiency | `path_to_tokenizer` (required), `min_char_to_token_ratio` | ratio=2.5 | +| **NumberOfLinesOfCodeFilter** | Filters based on the number of lines | `min_lines`, `max_lines` | min_lines=10, max_lines=20000 | +| **AlphaFilter** | Ensures code has sufficient alphabetic content | `min_alpha_ratio` | min_alpha_ratio=0.25 | +| **TokenizerFertilityFilter** | Measures token efficiency | `path_to_tokenizer` (required), `min_char_to_token_ratio` | min_char_to_token_ratio=2.5 | Code structure filters help identify problematic patterns: @@ -237,6 +240,9 @@ When filtering code datasets, consider these best practices: 1. **Language-specific configurations**: Adjust thresholds based on the programming language ```python + from nemo_curator.stages.text.modules import ScoreFilter + from nemo_curator.stages.text.filters import PythonCommentToCodeFilter, GeneralCommentToCodeFilter + # Python tends to have more comments than C python_comment_filter = ScoreFilter( score_fn=PythonCommentToCodeFilter(min_comment_to_code_ratio=0.05), @@ -251,6 +257,9 @@ When filtering code datasets, consider these best practices: 2. **Preserve code structure**: Ensure filters don't inadvertently remove valid coding patterns ```python + from nemo_curator.stages.text.modules import ScoreFilter + from nemo_curator.stages.text.filters import GeneralCommentToCodeFilter + # Some languages naturally have low comment ratios assembly_filter = ScoreFilter( score_fn=GeneralCommentToCodeFilter( @@ -267,6 +276,7 @@ When filtering code datasets, consider these best practices: # First check if the content is actually Python using FastText language ID from nemo_curator.stages.text.filters import FastTextLangId from nemo_curator.pipeline import Pipeline + from nemo_curator.stages.text.modules import ScoreFilter # Create pipeline for Python code filtering with language detection pipeline = Pipeline(name="python_code_filtering") @@ -321,6 +331,7 @@ When filtering code datasets, consider these best practices: ```python from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.modules import ScoreFilter +from nemo_curator.stages.text.filters import NumberOfLinesOfCodeFilter, XMLHeaderFilter, GeneralCommentToCodeFilter # Create pipeline to filter non-functional code snippets pipeline = Pipeline(name="code_cleaning") @@ -351,6 +362,7 @@ pipeline.add_stage(ScoreFilter( ```python from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.modules import ScoreFilter +from nemo_curator.stages.text.filters import AlphaFilter, TokenizerFertilityFilter, HTMLBoilerplateFilter # Create pipeline for training data preparation pipeline = Pipeline(name="training_data_prep") diff --git a/docs/curate-text/process-data/specialized-processing/index.md b/docs/curate-text/process-data/specialized-processing/index.md index 7a98149e0..c6e8e2e3f 100644 --- a/docs/curate-text/process-data/specialized-processing/index.md +++ b/docs/curate-text/process-data/specialized-processing/index.md @@ -16,7 +16,7 @@ Domain-specific processing for code and advanced curation tasks using NeMo Curat This section covers advanced processing techniques for specific data types and use cases that require specialized handling beyond general text processing. These tools are designed for specific domains like programming content. -## How it Works +## How It Works Specialized processing modules in NeMo Curator are designed for specific data types and use cases: @@ -76,6 +76,7 @@ code_pipeline = Pipeline( ) ]) +## NEED FIX: NameError: name 'code_dataset' is not defined filtered_code = code_pipeline(code_dataset) ``` diff --git a/docs/curate-video/tutorials/split-dedup.md b/docs/curate-video/tutorials/split-dedup.md index 4846ad934..c8523562b 100644 --- a/docs/curate-video/tutorials/split-dedup.md +++ b/docs/curate-video/tutorials/split-dedup.md @@ -149,6 +149,12 @@ pipe.add_stage( pipe.run() ``` +`which_to_keep` selects the representative within each cluster: "hard" keeps outliers far from the centroid, "easy" keeps the nearest to the centroid, and "random" ignores distance and picks randomly. + +`sim_metric` sets the distance used for similarity: "cosine" uses cosine distance (1 − cosine similarity), while "l2" uses Euclidean distance. + +`pairwise_batch_size` controls how many items are processed per GPU batch during pairwise similarity; larger values can be faster but require more GPU memory. + --- ## 3. Inspect Results @@ -167,8 +173,8 @@ After duplicate removal, export curated clips and metadata for training. Common Video-specific pointers: - Use `ClipWriterStage` path helpers to locate outputs: `nemo_curator/stages/video/io/clip_writer.py`. - - Processed videos: `get_output_path_processed_videos(OUT_DIR)` - - Clip chunks and previews: `get_output_path_processed_clip_chunks(OUT_DIR)`, `get_output_path_previews(OUT_DIR)` + - Processed videos: `get_output_path_processed_videos(${OUT_DIR})` + - Clip chunks and previews: `get_output_path_processed_clip_chunks(${OUT_DIR})`, `get_output_path_previews(${OUT_DIR})` - Embeddings parquet: `${OUT_DIR}/iv2_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`) ### Example Export