Skip to content

Conversation

@arhamm1
Copy link
Contributor

@arhamm1 arhamm1 commented Nov 20, 2025

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 20, 2025

Greptile Overview

Greptile Summary

This PR significantly expands the language management documentation with improved structure, detailed explanations, and practical code examples. The changes include:

  • Added comprehensive overview of language processing capabilities
  • Expanded prerequisites section with clearer resource requirements
  • Added multiple code examples demonstrating FastText language filtering
  • Added troubleshooting section for common issues

Issues found:

  • Incorrect import path: CommonCrawlWarcDownloader doesn't exist (should be CommonCrawlDownloadExtractStage)
  • Unused import: UrlFilter is imported but never used
  • Logic error in English filter example: The language field contains "[score, 'CODE']" format, not just the language code. The filter df['language'] == 'en' will never match.

Confidence Score: 2/5

  • Documentation-only changes that won't break runtime, but contain code examples that won't work as written
  • The PR improves documentation structure significantly, but contains an incorrect import path and a logic error in the English filter example that will cause code to not function as documented
  • docs/curate-text/process-data/language-management/index.md - lines 127-131 (incorrect import) and lines 180-183 (filter logic error)

Important Files Changed

File Analysis

Filename Score Overview
docs/curate-text/process-data/language-management/index.md 2/5 Documentation rewrite with improved structure but contains incorrect import path and incorrect usage example for filtering by language code

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Pipeline as Pipeline
    participant JsonlReader as JsonlReader
    participant ScoreFilter as ScoreFilter
    participant FastTextLangId as FastTextLangId
    participant JsonlWriter as JsonlWriter

    User->>Pipeline: Create pipeline
    User->>Pipeline: add_stage(JsonlReader)
    User->>Pipeline: add_stage(ScoreFilter(FastTextLangId))
    User->>Pipeline: add_stage(JsonlWriter)
    User->>Pipeline: run()
    
    Pipeline->>JsonlReader: Read JSONL files
    JsonlReader-->>Pipeline: DocumentBatch
    
    Pipeline->>ScoreFilter: Process batch
    ScoreFilter->>FastTextLangId: score_document(text)
    FastTextLangId-->>ScoreFilter: "[score, 'LANG_CODE']"
    ScoreFilter->>FastTextLangId: keep_document(score)
    FastTextLangId-->>ScoreFilter: bool (score >= threshold)
    ScoreFilter-->>Pipeline: Filtered DocumentBatch
    
    Pipeline->>JsonlWriter: Write filtered results
    JsonlWriter-->>Pipeline: Complete
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. docs/curate-text/process-data/language-management/index.md, line 221-226 (link)

    style: Duplicate content - this Language Processing Capabilities section repeats information already covered in lines 24-42

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content mostly looks good. I added some mostly minor requests.

arhamm1 and others added 3 commits November 20, 2025 16:40
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Arham Mehta <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. docs/curate-text/process-data/language-management/index.md, line 1-21 (link)

    syntax: Duplicate front matter sections will cause documentation build errors. Remove the duplicate YAML front matter block (lines 13-21) and the duplicate reference label on line 23.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. docs/curate-text/process-data/language-management/index.md, line 1-23 (link)

    syntax: Duplicate metadata headers - lines 1-12 and 13-23 are identical. Remove one of them.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Lawrence Lane <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Lawrence Lane <[email protected]>
@lbliii lbliii requested a review from sarahyurick November 25, 2025 16:27
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Arham Mehta <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +127 to +128
from nemo_curator.stages.text.download import CommonCrawlWarcDownloader
from nemo_curator.stages.text.filters import UrlFilter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: CommonCrawlWarcDownloader does not exist. The class is CommonCrawlWARCDownloader (internal) or CommonCrawlDownloadExtractStage (public API). Also UrlFilter is imported but never used.

Suggested change
from nemo_curator.stages.text.download import CommonCrawlWarcDownloader
from nemo_curator.stages.text.filters import UrlFilter
from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage

from nemo_curator.stages.text.filters import UrlFilter

# HTML extraction automatically uses CLD2 for language hints
pipeline.add_stage(CommonCrawlWarcDownloader(...))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Use CommonCrawlDownloadExtractStage with required parameters:

Suggested change
pipeline.add_stage(CommonCrawlWarcDownloader(...))
pipeline.add_stage(CommonCrawlDownloadExtractStage(
start_snapshot="2024-01",
end_snapshot="2024-01",
download_dir="/tmp/cc_downloads"
))

Comment on lines +180 to +183
def filter_english(batch: DocumentBatch) -> DocumentBatch:
df = batch.data
df = df[df['language'] == 'en']
return DocumentBatch(data=df, task_id=batch.task_id, dataset_name=batch.dataset_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: This filter won't work. FastTextLangId stores results as "[score, 'CODE']" (e.g., "[0.95, 'EN']"), not just the language code. The language code is also uppercase. See language.md for the correct pattern using ast.literal_eval.

Suggested change
def filter_english(batch: DocumentBatch) -> DocumentBatch:
df = batch.data
df = df[df['language'] == 'en']
return DocumentBatch(data=df, task_id=batch.task_id, dataset_name=batch.dataset_name)
@processing_stage(name="keep_english")
def filter_english(batch: DocumentBatch) -> DocumentBatch:
import ast
df = batch.data
parsed = df["language"].apply(lambda v: ast.literal_eval(v) if isinstance(v, str) else v)
df["lang_code"] = parsed.apply(lambda p: str(p[1]))
df = df[df['lang_code'] == 'EN']
return DocumentBatch(data=df, task_id=batch.task_id, dataset_name=batch.dataset_name)

Comment on lines +63 to +67
- Use shared filesystem, network storage, or object storage (S3, GCS, etc.)

### Installation Dependencies

## Basic Language Filtering
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Use shared filesystem, network storage, or object storage (S3, GCS, etc.)
### Installation Dependencies
## Basic Language Filtering
- Use shared filesystem, network storage, or object storage (S3, GCS, etc.)
## Basic Language Filtering

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove empty section header.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants