Update language-management/index.md #1260

arhamm1 · 2025-11-20T20:53:19Z

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Arham Mehta <[email protected]>

greptile-apps · 2025-11-20T20:54:27Z

Greptile Overview

Greptile Summary

This PR significantly expands the language management documentation with improved structure, detailed explanations, and practical code examples. The changes include:

Added comprehensive overview of language processing capabilities
Expanded prerequisites section with clearer resource requirements
Added multiple code examples demonstrating FastText language filtering
Added troubleshooting section for common issues

Issues found:

Incorrect import path: CommonCrawlWarcDownloader doesn't exist (should be CommonCrawlDownloadExtractStage)
Unused import: UrlFilter is imported but never used
Logic error in English filter example: The language field contains "[score, 'CODE']" format, not just the language code. The filter df['language'] == 'en' will never match.

Confidence Score: 2/5

Documentation-only changes that won't break runtime, but contain code examples that won't work as written
The PR improves documentation structure significantly, but contains an incorrect import path and a logic error in the English filter example that will cause code to not function as documented
docs/curate-text/process-data/language-management/index.md - lines 127-131 (incorrect import) and lines 180-183 (filter logic error)

Important Files Changed

File Analysis

Filename	Score	Overview
docs/curate-text/process-data/language-management/index.md	2/5	Documentation rewrite with improved structure but contains incorrect import path and incorrect usage example for filtering by language code

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Pipeline as Pipeline
    participant JsonlReader as JsonlReader
    participant ScoreFilter as ScoreFilter
    participant FastTextLangId as FastTextLangId
    participant JsonlWriter as JsonlWriter

    User->>Pipeline: Create pipeline
    User->>Pipeline: add_stage(JsonlReader)
    User->>Pipeline: add_stage(ScoreFilter(FastTextLangId))
    User->>Pipeline: add_stage(JsonlWriter)
    User->>Pipeline: run()
    
    Pipeline->>JsonlReader: Read JSONL files
    JsonlReader-->>Pipeline: DocumentBatch
    
    Pipeline->>ScoreFilter: Process batch
    ScoreFilter->>FastTextLangId: score_document(text)
    FastTextLangId-->>ScoreFilter: "[score, 'LANG_CODE']"
    ScoreFilter->>FastTextLangId: keep_document(score)
    FastTextLangId-->>ScoreFilter: bool (score >= threshold)
    ScoreFilter-->>Pipeline: Filtered DocumentBatch
    
    Pipeline->>JsonlWriter: Write filtered results
    JsonlWriter-->>Pipeline: Complete

greptile-apps

Additional Comments (1)

docs/curate-text/process-data/language-management/index.md, line 221-226 (link)

style: Duplicate content - this Language Processing Capabilities section repeats information already covered in lines 24-42

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{1 file reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

docs/curate-text/process-data/language-management/index.md

sarahyurick

Content mostly looks good. I added some mostly minor requests.

docs/curate-text/process-data/language-management/index.md

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Arham Mehta <[email protected]>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arham Mehta <[email protected]>

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Arham Mehta <[email protected]>

greptile-apps

Additional Comments (1)

docs/curate-text/process-data/language-management/index.md, line 1-21 (link)

syntax: Duplicate front matter sections will cause documentation build errors. Remove the duplicate YAML front matter block (lines 13-21) and the duplicate reference label on line 23.

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]>

greptile-apps

Additional Comments (1)

docs/curate-text/process-data/language-management/index.md, line 1-23 (link)

syntax: Duplicate metadata headers - lines 1-12 and 13-23 are identical. Remove one of them.

_{1 file reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

docs/curate-text/process-data/language-management/index.md

Signed-off-by: Lawrence Lane <[email protected]>

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Lawrence Lane <[email protected]>

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Arham Mehta <[email protected]>

greptile-apps

_{1 file reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-01T18:42:12Z

docs/curate-text/process-data/language-management/index.md

+from nemo_curator.stages.text.download import CommonCrawlWarcDownloader
+from nemo_curator.stages.text.filters import UrlFilter


syntax: CommonCrawlWarcDownloader does not exist. The class is CommonCrawlWARCDownloader (internal) or CommonCrawlDownloadExtractStage (public API). Also UrlFilter is imported but never used.

Suggested change

from nemo_curator.stages.text.download import CommonCrawlWarcDownloader

from nemo_curator.stages.text.filters import UrlFilter

from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage

greptile-apps · 2025-12-01T18:42:13Z

docs/curate-text/process-data/language-management/index.md

+from nemo_curator.stages.text.filters import UrlFilter
+
+# HTML extraction automatically uses CLD2 for language hints
+pipeline.add_stage(CommonCrawlWarcDownloader(...))


syntax: Use CommonCrawlDownloadExtractStage with required parameters:

Suggested change

pipeline.add_stage(CommonCrawlWarcDownloader(...))

pipeline.add_stage(CommonCrawlDownloadExtractStage(

start_snapshot="2024-01",

end_snapshot="2024-01",

download_dir="/tmp/cc_downloads"

))

greptile-apps · 2025-12-01T18:42:13Z

docs/curate-text/process-data/language-management/index.md

+def filter_english(batch: DocumentBatch) -> DocumentBatch:
+    df = batch.data
+    df = df[df['language'] == 'en']
+    return DocumentBatch(data=df, task_id=batch.task_id, dataset_name=batch.dataset_name)


logic: This filter won't work. FastTextLangId stores results as "[score, 'CODE']" (e.g., "[0.95, 'EN']"), not just the language code. The language code is also uppercase. See language.md for the correct pattern using ast.literal_eval.

Suggested change

def filter_english(batch: DocumentBatch) -> DocumentBatch:

df = batch.data

df = df[df['language'] == 'en']

return DocumentBatch(data=df, task_id=batch.task_id, dataset_name=batch.dataset_name)

@processing_stage(name="keep_english")

def filter_english(batch: DocumentBatch) -> DocumentBatch:

import ast

df = batch.data

parsed = df["language"].apply(lambda v: ast.literal_eval(v) if isinstance(v, str) else v)

df["lang_code"] = parsed.apply(lambda p: str(p[1]))

df = df[df['lang_code'] == 'EN']

return DocumentBatch(data=df, task_id=batch.task_id, dataset_name=batch.dataset_name)

sarahyurick · 2025-12-01T19:23:10Z

docs/curate-text/process-data/language-management/index.md

+  - Use shared filesystem, network storage, or object storage (S3, GCS, etc.)
+
+### Installation Dependencies
+
+## Basic Language Filtering


Suggested change

- Use shared filesystem, network storage, or object storage (S3, GCS, etc.)

### Installation Dependencies

## Basic Language Filtering

- Use shared filesystem, network storage, or object storage (S3, GCS, etc.)

## Basic Language Filtering

Remove empty section header.

Update language-management/index.md

0c745d3

Signed-off-by: Arham Mehta <[email protected]>

arhamm1 requested review from lbliii and sarahyurick November 20, 2025 20:53

greptile-apps bot reviewed Nov 20, 2025

View reviewed changes

docs/curate-text/process-data/language-management/index.md Show resolved Hide resolved

docs/curate-text/process-data/language-management/index.md Outdated Show resolved Hide resolved

sarahyurick requested changes Nov 20, 2025

View reviewed changes

arhamm1 and others added 3 commits November 20, 2025 16:40

Update docs/curate-text/process-data/language-management/index.md

737e274

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Arham Mehta <[email protected]>

Update docs/curate-text/process-data/language-management/index.md

ac00ea6

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arham Mehta <[email protected]>

Update docs/curate-text/process-data/language-management/index.md

cdfc53a

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Arham Mehta <[email protected]>

greptile-apps bot reviewed Nov 21, 2025

View reviewed changes

lbliii and others added 2 commits November 25, 2025 11:15

Apply suggestion from @sarahyurick

5fd09b0

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]>

Merge branch 'main' into arhamm1-patch-12

4345024

greptile-apps bot reviewed Nov 25, 2025

View reviewed changes

docs/curate-text/process-data/language-management/index.md Outdated Show resolved Hide resolved

feedback

1426d5f

Signed-off-by: Lawrence Lane <[email protected]>

greptile-apps bot reviewed Nov 25, 2025

View reviewed changes

lbliii approved these changes Nov 25, 2025

View reviewed changes

updates

fe77c2b

Signed-off-by: Lawrence Lane <[email protected]>

lbliii requested a review from sarahyurick November 25, 2025 16:27

greptile-apps bot reviewed Nov 25, 2025

View reviewed changes

Update index.md

b28a74b

Signed-off-by: Arham Mehta <[email protected]>

copy-pr-bot bot temporarily deployed to test December 1, 2025 18:39 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 1, 2025 18:39 Inactive

greptile-apps bot reviewed Dec 1, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci December 1, 2025 18:56 Inactive

sarahyurick reviewed Dec 1, 2025

View reviewed changes

		from nemo_curator.stages.text.download import CommonCrawlWarcDownloader
		from nemo_curator.stages.text.filters import UrlFilter

-pipeline.add_stage(CommonCrawlWarcDownloader(...))
+pipeline.add_stage(CommonCrawlDownloadExtractStage(
+    start_snapshot="2024-01",
+    end_snapshot="2024-01",
+    download_dir="/tmp/cc_downloads"
+))

Update language-management/index.md #1260

Are you sure you want to change the base?

Update language-management/index.md #1260

Conversation

arhamm1 commented Nov 20, 2025

Description

Usage

Checklist

Uh oh!

greptile-apps bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

Uh oh!

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

sarahyurick Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

sarahyurick Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps bot commented Nov 20, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading