Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 12 additions & 15 deletions docs/curate-audio/tutorials/beginner.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,19 +40,17 @@ The complete working code for this tutorial is located at:

```
<nemo_curator_repository>/tutorials/audio/fleurs/
├── run.py # Main tutorial script
├── README.md # Tutorial documentation
└── requirements.txt # Python dependencies
├── pipeline.py # Main tutorial script
├── pipeline.yaml # Configuration file for run.py
└── run.py # Same as pipeline.py, but defines pipeline using YAML file instead
```

**Accessing the code:**
```bash
# Clone NeMo Curator repository
git clone https://github.com/NVIDIA/NeMo-Curator.git
cd NeMo-Curator/tutorials/audio/fleurs/

# Install dependencies
pip install -r requirements.txt
```

## Prerequisites
Expand Down Expand Up @@ -222,16 +220,15 @@ To run the working tutorial:
```bash
cd tutorials/audio/fleurs/

# Basic run with default settings
python run.py --raw_data_dir /data/fleurs_output

# Customize parameters
python run.py \
--raw_data_dir /data/fleurs_output \
--lang ko_kr \
--split train \
--model_name nvidia/stt_ko_fastconformer_hybrid_large_pc \
--wer_threshold 50.0
python tutorials/audio/fleurs/pipeline.py \
--raw_data_dir ./example_audio/fleurs \
--model_name nvidia/stt_hy_fastconformer_hybrid_large_pc \
--lang hy_am \
--split dev \
--wer_threshold 75 \
--gpus 1 \
--clean \
--verbose
```

**Command-line options:**
Expand Down
4 changes: 2 additions & 2 deletions docs/get-started/video.md
Original file line number Diff line number Diff line change
Expand Up @@ -266,10 +266,10 @@ Organize input videos and output locations before running the pipeline.

## Run the Splitting Pipeline Example

Use the following example script to read videos, split into clips, and write outputs. This runs a Ray pipeline with `XennaExecutor` under the hood.
Use the example script from https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/video/getting-started to read videos, split into clips, and write outputs. This runs a Ray pipeline with `XennaExecutor` under the hood.

```bash
python -m nemo_curator.examples.video.video_split_clip_example \
python tutorials/video/getting-started/video_split_clip_example.py \
--video-dir "$DATA_DIR" \
--model-dir "$MODEL_DIR" \
--output-clip-path "$OUT_DIR" \
Expand Down
16 changes: 8 additions & 8 deletions docs/reference/infrastructure/execution-backends.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,9 +108,7 @@ results = pipeline.run(executor)

For more details, refer to the official [NVIDIA Cosmos-Xenna project](https://github.com/nvidia-cosmos/cosmos-xenna/tree/main).

### `RayActorPoolExecutor`

Executor using Ray Actor pools for custom distributed processing patterns such as deduplication.
### `RayDataExecutor`

`RayDataExecutor` uses Ray Data, a scalable data processing library built on Ray Core. Ray Data provides a familiar DataFrame-like API for distributed data transformations. This executor is experimental and best suited for large-scale batch processing tasks that benefit from Ray Data's optimized data loading and transformation pipelines.

Expand All @@ -120,21 +118,23 @@ Executor using Ray Actor pools for custom distributed processing patterns such a
- **Experimental status**: API and performance characteristics may change

```python
from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor
from nemo_curator.backends.experimental.ray_data import RayDataExecutor

executor = RayActorPoolExecutor()
executor = RayDataExecutor()
results = pipeline.run(executor)
```

:::{note}`RayDataExecutor` currently has limited configuration options. For more control over execution, consider using `XennaExecutor` or `RayActorPoolExecutor`.
:::

### `RayActorPoolExecutor` (experimental)
### `RayActorPoolExecutor`

Executor using Ray Actor pools for custom distributed processing patterns such as deduplication.

```python
from nemo_curator.backends.experimental.ray_data import RayDataExecutor
from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor

executor = RayDataExecutor()
executor = RayActorPoolExecutor()
results = pipeline.run(executor)
```

Expand Down
Loading