Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 177 additions & 24 deletions docs/curate-text/process-data/language-management/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,57 +12,210 @@ modality: "text-only"

# Language Management

Handle multilingual content and language-specific processing requirements using NeMo Curator's tools and utilities.
Identify document languages, filter multilingual content, and apply language-specific processing to create high-quality monolingual or multilingual text datasets.

NeMo Curator provides robust tools for managing multilingual text datasets through language detection, stop word management, and specialized handling for non-spaced languages. These tools are essential for creating high-quality monolingual datasets and applying language-specific processing.
## Overview

## Before You Start
NeMo Curator provides robust tools for managing multilingual text datasets through language detection, stop word management, and specialized handling for language-specific requirements. These capabilities are essential for:

- The `FastTextLangId` filter (used with the `ScoreFilter` stage) requires a FastText language identification model file. Download `lid.176.bin` (or `lid.176.ftz`) from FastText: [Language identification](https://fasttext.cc/docs/en/language-identification.html).
- On a cluster, ensure the FastText model file is accessible to all workers (for example, a shared filesystem or object storage path).
- Provide newline-delimited JSON (`.jsonl`) with a `text` field, or set `text_field` in `ScoreFilter(...)`.
- For HTML extraction workflows (for example, Common Crawl), Curator uses CLD2 to provide language hints.
- **Monolingual Dataset Creation**: Filter documents by language to create single-language training datasets
- **Multilingual Dataset Curation**: Identify and tag languages for balanced multilingual corpora
- **Quality Filtering**: Apply language-specific quality checks and stop word filtering
- **Non-Spaced Language Support**: Handle Chinese, Japanese, Thai, and Korean text with specialized tokenization

---
## Language Processing Capabilities

### Language Detection

- **FastText Model**: Supports 176 languages with confidence scores
- **CLD2 Integration**: Used automatically in Common Crawl text extraction pipeline
- **Configurable Thresholds**: Filter documents by minimum confidence scores

### Stop Word Management

- **Built-in Stop Word Lists**: Pre-configured lists for common languages
- **Customizable Filtering**: Adjust thresholds for stop word density
- **Content Quality Enhancement**: Remove low-information documents

### Special Language Handling

- **Non-Spaced Languages**: Specialized tokenization for Chinese, Japanese, Thai, Korean
- **Script Detection**: Identify and process different writing systems
- **Language-Specific Processing**: Apply custom rules per language

## Prerequisites

Before implementing language management in your pipeline:

### Required Resources

* **FastText Model File**: Download the language identification model
- Model options: `lid.176.bin` (full model, ~131MB) or `lid.176.ftz` (compressed model, ~917KB)
- Download from: [FastText Language Identification](https://fasttext.cc/docs/en/language-identification.html)
- Save to an accessible location (local path or shared storage)

* **Data Format**: JSONL (JSON Lines) input with text content
- Default field name: `text`
- Custom field support: Specify with `text_field` parameter

* **Cluster Setup** (if applicable):
- Ensure FastText model file is accessible to all workers
- Use shared filesystem, network storage, or object storage (S3, GCS, etc.)

### Installation Dependencies

## Basic Language Filtering
Comment on lines +63 to +67
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Use shared filesystem, network storage, or object storage (S3, GCS, etc.)
### Installation Dependencies
## Basic Language Filtering
- Use shared filesystem, network storage, or object storage (S3, GCS, etc.)
## Basic Language Filtering

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove empty section header.


## How it Works
### Quick Start Example

Language management in NeMo Curator typically follows this pattern using the Pipeline API:
Filter documents by language using FastText:

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.text.modules import ScoreFilter
from nemo_curator.stages.text.filters import FastTextLangId

# 1) Build the pipeline
pipeline = Pipeline(name="language_management")
# Create language filtering pipeline
pipeline = Pipeline(name="language_filtering")

# Read JSONL files into document batches
# 1. Read JSONL input files
pipeline.add_stage(
JsonlReader(file_paths="input_data/*.jsonl", files_per_partition=2)
JsonlReader(
file_paths="input_data/",
files_per_partition=2 # Process 2 files per partition
)
)

# Identify languages and keep docs above a confidence threshold
# 2. Identify languages and filter by confidence threshold
pipeline.add_stage(
ScoreFilter(
FastTextLangId(model_path="/path/to/lid.176.bin", min_langid_score=0.3),
score_field="language",
FastTextLangId(
model_path="/path/to/lid.176.bin", # Path to FastText model
min_langid_score=0.3 # Minimum confidence (0.0-1.0)
),
score_field="language" # Output field for language code
)
)

# 2) Execute
# 3. Write filtered results
pipeline.add_stage(
JsonlWriter(path="output_filtered/")
)

# Execute pipeline (uses XennaExecutor by default)
results = pipeline.run()
```
**Parameters explained:**
- `model_path`: Absolute path to FastText model file (`lid.176.bin` or `lid.176.ftz`)
- `min_langid_score`: Minimum confidence score (0.0 to 1.0). Documents below this threshold are filtered out
- `score_field`: Field name to store detected language code (e.g., "en", "es", "zh")
- `files_per_partition`: Number of files to process per partition (tune based on file sizes)

---
**Output format:**
Each document will include a `language` field with the detected language code:
```json
{"text": "This is an English document.", "language": "en"}
{"text": "Este es un documento en español.", "language": "es"}
```
## Integration with HTML Extraction

## Language Processing Capabilities
When processing HTML content (e.g., Common Crawl), CLD2 provides language hints automatically:

```python
from nemo_curator.stages.text.download import CommonCrawlWarcDownloader
from nemo_curator.stages.text.filters import UrlFilter
Comment on lines +127 to +128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: CommonCrawlWarcDownloader does not exist. The class is CommonCrawlWARCDownloader (internal) or CommonCrawlDownloadExtractStage (public API). Also UrlFilter is imported but never used.

Suggested change
from nemo_curator.stages.text.download import CommonCrawlWarcDownloader
from nemo_curator.stages.text.filters import UrlFilter
from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage


# HTML extraction automatically uses CLD2 for language hints
pipeline.add_stage(CommonCrawlWarcDownloader(...))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of this code snippet, but ideally users should be interacting with CommonCrawlDownloadExtractStage directly. Maybe the comment can mention CommonCrawlWarcDownloader specifically if you think that would be helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Use CommonCrawlDownloadExtractStage with required parameters:

Suggested change
pipeline.add_stage(CommonCrawlWarcDownloader(...))
pipeline.add_stage(CommonCrawlDownloadExtractStage(
start_snapshot="2024-01",
end_snapshot="2024-01",
download_dir="/tmp/cc_downloads"
))


# Additional FastText filtering for refined language detection
pipeline.add_stage(
ScoreFilter(
FastTextLangId(model_path="/path/to/lid.176.bin", min_langid_score=0.5),
score_field="language"
)
)
```

- **Language detection** using FastText (176 languages) and CLD2 (used in HTML extraction pipelines)
- **Stop word management** with built-in lists and customizable thresholds
- **Special handling** for non-spaced languages (Chinese, Japanese, Thai, Korean)
- **Language-specific** text processing and quality filtering
**CLD2 vs FastText:**
- **CLD2**: Fast, lightweight, used for initial hints during HTML extraction
- **FastText**: More accurate, supports 176 languages, recommended for final filtering

## Complete Language Management Example

Here's a comprehensive pipeline demonstrating language detection and filtering:

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.text.modules import ScoreFilter
from nemo_curator.stages.text.filters import FastTextLangId
from nemo_curator.stages.function_decorators import processing_stage
from nemo_curator.tasks import DocumentBatch

# Create comprehensive language management pipeline
pipeline = Pipeline(name="language_management_complete")

# 1. Load input data
pipeline.add_stage(
JsonlReader(file_paths="raw_data/", files_per_partition=4)
)

# 2. Detect languages with FastText
pipeline.add_stage(
ScoreFilter(
FastTextLangId(
model_path="/models/lid.176.bin",
min_langid_score=0.6 # Medium-high confidence
),
score_field="language"
)
)

# 3. Filter to English documents only
@processing_stage(name="keep_english")
def filter_english(batch: DocumentBatch) -> DocumentBatch:
df = batch.data
df = df[df['language'] == 'en']
return DocumentBatch(data=df, task_id=batch.task_id, dataset_name=batch.dataset_name)
Comment on lines +180 to +183
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: This filter won't work. FastTextLangId stores results as "[score, 'CODE']" (e.g., "[0.95, 'EN']"), not just the language code. The language code is also uppercase. See language.md for the correct pattern using ast.literal_eval.

Suggested change
def filter_english(batch: DocumentBatch) -> DocumentBatch:
df = batch.data
df = df[df['language'] == 'en']
return DocumentBatch(data=df, task_id=batch.task_id, dataset_name=batch.dataset_name)
@processing_stage(name="keep_english")
def filter_english(batch: DocumentBatch) -> DocumentBatch:
import ast
df = batch.data
parsed = df["language"].apply(lambda v: ast.literal_eval(v) if isinstance(v, str) else v)
df["lang_code"] = parsed.apply(lambda p: str(p[1]))
df = df[df['lang_code'] == 'EN']
return DocumentBatch(data=df, task_id=batch.task_id, dataset_name=batch.dataset_name)


pipeline.add_stage(filter_english)

# 4. Export filtered, high-quality English documents
pipeline.add_stage(JsonlWriter(path="curated_english/"))

# Execute pipeline (uses XennaExecutor by default)
results = pipeline.run()
print("Language management pipeline completed!")
```

**Expected workflow:**
1. Load multilingual JSONL documents
2. Detect language with 60% minimum confidence
3. Keep only English documents
4. Export high-quality English dataset

## Troubleshooting

### Common Issues

**FastText model not found:**
```
FileNotFoundError: [Errno 2] No such file or directory: '/path/to/lid.176.bin'
```
**Solution:** Download the model from [FastText Language Identification](https://fasttext.cc/docs/en/language-identification.html) and provide the correct absolute path.

**Low detection accuracy:**
```
Many documents classified incorrectly
```
**Solution:**
- Increase `min_langid_score` to filter low-confidence predictions
- Ensure input text is clean (remove HTML tags, special characters)
- Check for very short documents (<50 chars) which are harder to classify

## Available Tools

Expand Down