-
Notifications
You must be signed in to change notification settings - Fork 191
Update language-management/index.md #1260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Arham Mehta <[email protected]>
Greptile OverviewGreptile SummaryThis PR significantly expands the language management documentation with improved structure, detailed explanations, and practical code examples. The changes include:
Issues found:
Confidence Score: 2/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User as User Code
participant Pipeline as Pipeline
participant JsonlReader as JsonlReader
participant ScoreFilter as ScoreFilter
participant FastTextLangId as FastTextLangId
participant JsonlWriter as JsonlWriter
User->>Pipeline: Create pipeline
User->>Pipeline: add_stage(JsonlReader)
User->>Pipeline: add_stage(ScoreFilter(FastTextLangId))
User->>Pipeline: add_stage(JsonlWriter)
User->>Pipeline: run()
Pipeline->>JsonlReader: Read JSONL files
JsonlReader-->>Pipeline: DocumentBatch
Pipeline->>ScoreFilter: Process batch
ScoreFilter->>FastTextLangId: score_document(text)
FastTextLangId-->>ScoreFilter: "[score, 'LANG_CODE']"
ScoreFilter->>FastTextLangId: keep_document(score)
FastTextLangId-->>ScoreFilter: bool (score >= threshold)
ScoreFilter-->>Pipeline: Filtered DocumentBatch
Pipeline->>JsonlWriter: Write filtered results
JsonlWriter-->>Pipeline: Complete
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
docs/curate-text/process-data/language-management/index.md, line 221-226 (link)style: Duplicate content - this Language Processing Capabilities section repeats information already covered in lines 24-42
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
1 file reviewed, 3 comments
sarahyurick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Content mostly looks good. I added some mostly minor requests.
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Arham Mehta <[email protected]>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arham Mehta <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Arham Mehta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
docs/curate-text/process-data/language-management/index.md, line 1-21 (link)syntax: Duplicate front matter sections will cause documentation build errors. Remove the duplicate YAML front matter block (lines 13-21) and the duplicate reference label on line 23.
1 file reviewed, 1 comment
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
docs/curate-text/process-data/language-management/index.md, line 1-23 (link)syntax: Duplicate metadata headers - lines 1-12 and 13-23 are identical. Remove one of them.
1 file reviewed, 2 comments
Signed-off-by: Lawrence Lane <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, no comments
Signed-off-by: Lawrence Lane <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, no comments
Signed-off-by: Arham Mehta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 3 comments
| from nemo_curator.stages.text.download import CommonCrawlWarcDownloader | ||
| from nemo_curator.stages.text.filters import UrlFilter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: CommonCrawlWarcDownloader does not exist. The class is CommonCrawlWARCDownloader (internal) or CommonCrawlDownloadExtractStage (public API). Also UrlFilter is imported but never used.
| from nemo_curator.stages.text.download import CommonCrawlWarcDownloader | |
| from nemo_curator.stages.text.filters import UrlFilter | |
| from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage |
| from nemo_curator.stages.text.filters import UrlFilter | ||
|
|
||
| # HTML extraction automatically uses CLD2 for language hints | ||
| pipeline.add_stage(CommonCrawlWarcDownloader(...)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Use CommonCrawlDownloadExtractStage with required parameters:
| pipeline.add_stage(CommonCrawlWarcDownloader(...)) | |
| pipeline.add_stage(CommonCrawlDownloadExtractStage( | |
| start_snapshot="2024-01", | |
| end_snapshot="2024-01", | |
| download_dir="/tmp/cc_downloads" | |
| )) |
| def filter_english(batch: DocumentBatch) -> DocumentBatch: | ||
| df = batch.data | ||
| df = df[df['language'] == 'en'] | ||
| return DocumentBatch(data=df, task_id=batch.task_id, dataset_name=batch.dataset_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: This filter won't work. FastTextLangId stores results as "[score, 'CODE']" (e.g., "[0.95, 'EN']"), not just the language code. The language code is also uppercase. See language.md for the correct pattern using ast.literal_eval.
| def filter_english(batch: DocumentBatch) -> DocumentBatch: | |
| df = batch.data | |
| df = df[df['language'] == 'en'] | |
| return DocumentBatch(data=df, task_id=batch.task_id, dataset_name=batch.dataset_name) | |
| @processing_stage(name="keep_english") | |
| def filter_english(batch: DocumentBatch) -> DocumentBatch: | |
| import ast | |
| df = batch.data | |
| parsed = df["language"].apply(lambda v: ast.literal_eval(v) if isinstance(v, str) else v) | |
| df["lang_code"] = parsed.apply(lambda p: str(p[1])) | |
| df = df[df['lang_code'] == 'EN'] | |
| return DocumentBatch(data=df, task_id=batch.task_id, dataset_name=batch.dataset_name) |
| - Use shared filesystem, network storage, or object storage (S3, GCS, etc.) | ||
|
|
||
| ### Installation Dependencies | ||
|
|
||
| ## Basic Language Filtering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - Use shared filesystem, network storage, or object storage (S3, GCS, etc.) | |
| ### Installation Dependencies | |
| ## Basic Language Filtering | |
| - Use shared filesystem, network storage, or object storage (S3, GCS, etc.) | |
| ## Basic Language Filtering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove empty section header.
Description
Usage
# Add snippet demonstrating usageChecklist