-
Notifications
You must be signed in to change notification settings - Fork 205
Tutorials for showing how to add SentenceTransformer Model and modifying gpu resources for Embedding Generator #1266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Praateek <[email protected]>
Greptile OverviewGreptile SummaryThis PR adds three educational tutorials demonstrating advanced embedding generation techniques: GPU resource optimization through fractional GPU allocation (achieving 28% speedup), workaround implementation for SentenceTransformer models, and a comprehensive step-by-step semantic deduplication workflow. Key Changes:
Issues Found:
Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant Pipeline
participant EmbeddingCreatorStage
participant TokenizerStage
participant ModelStage
participant GPU
Note over User,GPU: Tutorial 1: Fast Embedding Generation
User->>Pipeline: Configure with Resources(gpus=0.5)
User->>Pipeline: Run with RayDataExecutor
Pipeline->>EmbeddingCreatorStage: Process documents
EmbeddingCreatorStage->>GPU: Schedule 2 actors per GPU
GPU-->>User: Complete in 153s vs 213s (28% speedup)
Note over User,GPU: Tutorial 2: SentenceTransformer Integration
User->>EmbeddingCreatorStage: Create custom stage
EmbeddingCreatorStage->>TokenizerStage: Tokenize text
TokenizerStage-->>EmbeddingCreatorStage: Return input_ids, attention_mask
EmbeddingCreatorStage->>ModelStage: SentenceTransformerEmbeddingModelStage
ModelStage->>ModelStage: setup() loads SentenceTransformer
ModelStage->>GPU: Run inference with unpack_inference_batch=False
GPU-->>ModelStage: Return outputs["sentence_embedding"]
ModelStage-->>User: Generate embeddings
Note over User,GPU: Tutorial 3: Semantic Deduplication Workflow
User->>Pipeline: Step 1: Create ID Generator
User->>Pipeline: Step 2: Generate Embeddings
Pipeline->>EmbeddingCreatorStage: Process with SentenceTransformer
EmbeddingCreatorStage-->>User: Write embeddings to cache
User->>Pipeline: Step 3: K-Means + Pairwise Similarity
Pipeline-->>User: Cluster and compute similarities
User->>Pipeline: Step 4: Identify Duplicates (eps=0.1)
Pipeline-->>User: Find 583,721 duplicates (27.54%)
User->>Pipeline: Step 5: Remove Duplicates
Pipeline-->>User: Output 1,535,998 deduplicated rows
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
| " model_inference_batch_size: int = 1024,\n", | ||
| " has_seq_order: bool = True,\n", | ||
| " padding_side: Literal[\"left\", \"right\"] = \"right\",\n", | ||
| " autocast: bool = True,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: typo in markdown cell: "composte" should be "composite"
Signed-off-by: Praateek <[email protected]>
|
@sarahyurick / @ayushdg I'm okay to merge this or not.. needed this tutorial to show the customer how to do it if they have to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
| " hf_token=hf_token,\n", | ||
| " model_inference_batch_size=model_inference_batch_size,\n", | ||
| " has_seq_order=has_seq_order,\n", | ||
| " padding_side=padding_side,\n", | ||
| " autocast=autocast,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: SentenceTransformer doesn't return a dict with sentence_embedding key when called directly. The parent ModelStage._model_forward passes the batch as self.model(model_input_batch) (without unpacking), but SentenceTransformer doesn't accept input_ids/attention_mask directly. Need to override _model_forward to use the .encode() method instead.
| " hf_token=hf_token,\n", | |
| " model_inference_batch_size=model_inference_batch_size,\n", | |
| " has_seq_order=has_seq_order,\n", | |
| " padding_side=padding_side,\n", | |
| " autocast=autocast,\n", | |
| def _model_forward(self, model_input_batch: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]: | |
| """Forward pass through SentenceTransformer model.""" | |
| # SentenceTransformer.encode expects input_ids and attention_mask as dict | |
| embeddings = self.model.encode( | |
| model_input_batch, | |
| convert_to_tensor=True, | |
| show_progress_bar=False, | |
| ) | |
| return {"sentence_embedding": embeddings} | |
| def process_model_output( | |
| self, | |
| outputs: dict[str, torch.Tensor], | |
| model_input_batch: dict[str, torch.Tensor] | None = None, # noqa: ARG002 | |
| ) -> torch.Tensor: | |
| return outputs["sentence_embedding"].cpu() |
| "\n", | ||
| "class SentenceTransformerEmbeddingModelStage(EmbeddingModelStage):\n", | ||
| " def __init__( # noqa: PLR0913\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: setting unpack_inference_batch = False causes the parent to call self.model(model_input_batch) where model_input_batch is a dict. This relies on SentenceTransformer.__call__ accepting a dict, which is non-standard. The standard API is .encode() or .forward() with unpacked kwargs. Consider documenting this behavior or using the standard API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love how easy this is. IMO we should add it to the codebase (and eventually our documentation) instead of a tutorial, since it is so straightforward.
Signed-off-by: Sarah Yurick <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, no comments
|
Closing in favor of #1346, tysm! |
Description
Speedup Embedding Generator by using 1/2 half GPU
Sentence Transformers
SentenceTransformerinside our existing frameworkUsage
# Add snippet demonstrating usageChecklist