Skip to content

Conversation

@huvunvidia
Copy link
Contributor

@huvunvidia huvunvidia commented Nov 24, 2025

📄 Pages to Review

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Updated section title to indicate review process.

Signed-off-by: Huy Vu <[email protected]>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 24, 2025

Greptile Overview

Greptile Summary

This PR implements documentation improvements across video concepts and text processing sections, enhancing clarity, consistency, and completeness. Most changes are minor editorial improvements including wording clarifications, grammar fixes, markdown link formatting, and added explanatory content for parameters and workflows.

Key improvements:

  • Enhanced video concept documentation with clearer explanations of streaming mode and data flow
  • Added valuable guidance on embedding model selection (avoiding decoder-only LLMs)
  • Improved code filtering documentation with additional language support and parameter clarifications
  • Converted raw URLs to proper markdown links for better documentation formatting
  • Fixed frontmatter formatting and whitespace issues

Critical issue:

  • Two debugging comments (## NEED FIX:) were left in production documentation files and must be removed before merge

Confidence Score: 3/5

  • PR cannot be merged with debugging comments present in production docs
  • Score reflects high-quality editorial improvements throughout, but two critical syntax issues block merging: debugging comments with "NEED FIX" markers were left in code.md and index.md files that must be removed
  • Pay close attention to docs/curate-text/process-data/specialized-processing/code.md and docs/curate-text/process-data/specialized-processing/index.md - both contain debugging comments that must be removed

Important Files Changed

File Analysis

Filename Score Overview
docs/about/concepts/video/data-flow.md 5/5 Enhanced explanations of streaming mode, writer outputs, and data flow
docs/curate-text/process-data/deduplication/semdedup.md 5/5 Added embedding model guidance, improved output descriptions, fixed code block syntax
docs/curate-text/process-data/specialized-processing/code.md 2/5 Added language support, parameter clarity, imports - but contains debug comments that must be removed
docs/curate-text/process-data/specialized-processing/index.md 2/5 Fixed heading capitalization but contains debug comment that must be removed

Sequence Diagram

sequenceDiagram
    participant Author as Documentation Author
    participant PR as Pull Request
    participant Review as Review Process
    participant Docs as Documentation System
    
    Author->>PR: Submit doc improvements
    Note over PR: 14 files changed<br/>Video concepts + Text processing
    
    PR->>Review: Trigger review
    Review->>Docs: Check video/architecture.md
    Docs-->>Review: ✓ Wording improvements
    
    Review->>Docs: Check video/data-flow.md
    Docs-->>Review: ✓ Enhanced explanations
    
    Review->>Docs: Check deduplication docs
    Docs-->>Review: ✓ Frontmatter fix, clarifications
    
    Review->>Docs: Check quality-assessment docs
    Docs-->>Review: ✓ URL formatting, improvements
    
    Review->>Docs: Check specialized-processing/code.md
    Docs-->>Review: ⚠️ Debug comment found (line 48)
    
    Review->>Docs: Check specialized-processing/index.md
    Docs-->>Review: ⚠️ Debug comment found (line 79)
    
    Review->>PR: Report findings
    Note over PR: 2 critical issues<br/>Must remove debug comments
    
    PR-->>Author: Request changes
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@huvunvidia huvunvidia changed the title [Doc Review 25.09] Huy - Video Concepts + Text Advanced (17 pages) [Doc Review 25.09] Huy - Video Concepts + Text Advanced Nov 26, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@huvunvidia
Copy link
Contributor Author

/ok to test 93ad6f6

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile


# Add filter stages for code quality
pipeline.add_stage(ScoreFilter(
## NEED FIX: TypeError: ScoreFilter.__init__() got an unexpected keyword argument 'score_fn'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: debugging comment left in production docs

Suggested change
## NEED FIX: TypeError: ScoreFilter.__init__() got an unexpected keyword argument 'score_fn'

)
])

## NEED FIX: NameError: name 'code_dataset' is not defined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: debugging comment left in production docs

Suggested change
## NEED FIX: NameError: name 'code_dataset' is not defined

Copy link
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It seems like there is a lot of unnecessary repetitions in the deduplication pages, maybe they can be cut down?

@@ -1,3 +1,4 @@
---
description: "Identify and remove exact duplicates using MD5 hashing in a Ray-based workflow"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally haven't tested TextDuplicatesRemovalWorkflow with exact deduplication. Were you able to verify it?

@@ -1,3 +1,4 @@
---
description: "Identify and remove exact duplicates using MD5 hashing in a Ray-based workflow"
categories: ["how-to-guides"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page still needs more updates imo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we confirmed cloud storage works with the deduplication modules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested with cloud storage due to it's require more complicated settings.
Although I have discussed to Lawrence Lane, due to the cloud storage is not Dedup-exclusive, but for text-curator in general, it is not needed in Dedup docs. This would help the Dedup docs more concise and to the point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants