Skip to content

Conversation

@praateekmahajan
Copy link
Contributor

Description

Speedup Embedding Generator by using 1/2 half GPU

  1. Assuming the loaded embedding generation model takes less GPU resources, we can schedule more than 1 actor on the same GPU to get a speedup.
  2. Please note this is dependent on model used and GPU sku.
  3. The tutorial shows how to modify the stage and also measures the time taken

Sentence Transformers

  1. Until Support Sentence Transformer Models in Embedding Generation and possibly other places #1265 is resolved this tutorial shows how to use SentenceTransformer inside our existing framework

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 21, 2025

Greptile Overview

Greptile Summary

This PR adds three educational tutorials demonstrating advanced embedding generation techniques: GPU resource optimization through fractional GPU allocation (achieving 28% speedup), workaround implementation for SentenceTransformer models, and a comprehensive step-by-step semantic deduplication workflow.

Key Changes:

  • fast_embedding_generation.ipynb: Shows how to use Resources(gpus=0.5) to schedule multiple actors per GPU when models have low GPU utilization
  • implement_sentence_transformer.ipynb: Provides a temporary workaround for issue Support Sentence Transformer Models in Embedding Generation and possibly other places #1265 by extending EmbeddingModelStage to work with SentenceTransformer library
  • semantic_step_by_step.ipynb: Breaks down the semantic deduplication process into discrete stages (ID generation, embedding creation, K-means clustering, duplicate identification, and removal) for better understanding and control

Issues Found:

  • Previous review comments note potential API compatibility concerns in the SentenceTransformer implementation, though the notebook shows it working correctly with test data

Confidence Score: 4/5

Important Files Changed

File Analysis

Filename Score Overview
tutorials/text/embedding-generation/fast_embedding_generation.ipynb 4/5 Tutorial demonstrating GPU resource optimization by using 0.5 GPU per actor to speed up embedding generation, achieving 28% speedup (213s to 153s)
tutorials/text/embedding-generation/implement_sentence_transformer.ipynb 3/5 Tutorial showing SentenceTransformer integration by extending EmbeddingModelStage; existing review comments note potential API compatibility issues with direct model calling
tutorials/text/embedding-generation/semantic_step_by_step.ipynb 4/5 Comprehensive semantic deduplication tutorial breaking down workflow into discrete steps (ID generation, embeddings, K-means, duplicate identification, removal), achieving 27.54% data reduction

Sequence Diagram

sequenceDiagram
    participant User
    participant Pipeline
    participant EmbeddingCreatorStage
    participant TokenizerStage
    participant ModelStage
    participant GPU

    Note over User,GPU: Tutorial 1: Fast Embedding Generation
    User->>Pipeline: Configure with Resources(gpus=0.5)
    User->>Pipeline: Run with RayDataExecutor
    Pipeline->>EmbeddingCreatorStage: Process documents
    EmbeddingCreatorStage->>GPU: Schedule 2 actors per GPU
    GPU-->>User: Complete in 153s vs 213s (28% speedup)

    Note over User,GPU: Tutorial 2: SentenceTransformer Integration
    User->>EmbeddingCreatorStage: Create custom stage
    EmbeddingCreatorStage->>TokenizerStage: Tokenize text
    TokenizerStage-->>EmbeddingCreatorStage: Return input_ids, attention_mask
    EmbeddingCreatorStage->>ModelStage: SentenceTransformerEmbeddingModelStage
    ModelStage->>ModelStage: setup() loads SentenceTransformer
    ModelStage->>GPU: Run inference with unpack_inference_batch=False
    GPU-->>ModelStage: Return outputs["sentence_embedding"]
    ModelStage-->>User: Generate embeddings

    Note over User,GPU: Tutorial 3: Semantic Deduplication Workflow
    User->>Pipeline: Step 1: Create ID Generator
    User->>Pipeline: Step 2: Generate Embeddings
    Pipeline->>EmbeddingCreatorStage: Process with SentenceTransformer
    EmbeddingCreatorStage-->>User: Write embeddings to cache
    User->>Pipeline: Step 3: K-Means + Pairwise Similarity
    Pipeline-->>User: Cluster and compute similarities
    User->>Pipeline: Step 4: Identify Duplicates (eps=0.1)
    Pipeline-->>User: Find 583,721 duplicates (27.54%)
    User->>Pipeline: Step 5: Remove Duplicates
    Pipeline-->>User: Output 1,535,998 deduplicated rows
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

" model_inference_batch_size: int = 1024,\n",
" has_seq_order: bool = True,\n",
" padding_side: Literal[\"left\", \"right\"] = \"right\",\n",
" autocast: bool = True,\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: typo in markdown cell: "composte" should be "composite"

Signed-off-by: Praateek <[email protected]>
@praateekmahajan
Copy link
Contributor Author

@sarahyurick / @ayushdg I'm okay to merge this or not.. needed this tutorial to show the customer how to do it if they have to

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +48 to +52
" hf_token=hf_token,\n",
" model_inference_batch_size=model_inference_batch_size,\n",
" has_seq_order=has_seq_order,\n",
" padding_side=padding_side,\n",
" autocast=autocast,\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: SentenceTransformer doesn't return a dict with sentence_embedding key when called directly. The parent ModelStage._model_forward passes the batch as self.model(model_input_batch) (without unpacking), but SentenceTransformer doesn't accept input_ids/attention_mask directly. Need to override _model_forward to use the .encode() method instead.

Suggested change
" hf_token=hf_token,\n",
" model_inference_batch_size=model_inference_batch_size,\n",
" has_seq_order=has_seq_order,\n",
" padding_side=padding_side,\n",
" autocast=autocast,\n",
def _model_forward(self, model_input_batch: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
"""Forward pass through SentenceTransformer model."""
# SentenceTransformer.encode expects input_ids and attention_mask as dict
embeddings = self.model.encode(
model_input_batch,
convert_to_tensor=True,
show_progress_bar=False,
)
return {"sentence_embedding": embeddings}
def process_model_output(
self,
outputs: dict[str, torch.Tensor],
model_input_batch: dict[str, torch.Tensor] | None = None, # noqa: ARG002
) -> torch.Tensor:
return outputs["sentence_embedding"].cpu()

Comment on lines +34 to +36
"\n",
"class SentenceTransformerEmbeddingModelStage(EmbeddingModelStage):\n",
" def __init__( # noqa: PLR0913\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: setting unpack_inference_batch = False causes the parent to call self.model(model_input_batch) where model_input_batch is a dict. This relies on SentenceTransformer.__call__ accepting a dict, which is non-standard. The standard API is .encode() or .forward() with unpacked kwargs. Consider documenting this behavior or using the standard API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love how easy this is. IMO we should add it to the codebase (and eventually our documentation) instead of a tutorial, since it is so straightforward.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@sarahyurick
Copy link
Contributor

Closing in favor of #1346, tysm!

@sarahyurick sarahyurick closed this Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants