-
Notifications
You must be signed in to change notification settings - Fork 101
Add nucleotide tokenizer for llama3 model to recipes #1314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
savitha-eng
wants to merge
6
commits into
main
Choose a base branch
from
savitha/llama3-recipes-dataloader-add-tokenizer
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pstjohn
reviewed
Nov 12, 2025
pstjohn
reviewed
Nov 12, 2025
pstjohn
reviewed
Nov 12, 2025
pstjohn
reviewed
Nov 12, 2025
pstjohn
reviewed
Nov 12, 2025
…nd unit test Signed-off-by: savitha-eng <[email protected]>
Signed-off-by: savitha-eng <[email protected]>
…ts to the suite Signed-off-by: savitha-eng <[email protected]>
Signed-off-by: savitha-eng <[email protected]>
f9ee9f8 to
e88bf5c
Compare
pstjohn
approved these changes
Nov 17, 2025
Collaborator
|
/ok to test bc7100f |
…1318) ### Description This PR adds a genomic dataset module for training Llama3 models on nucleotide sequences, following the ESM2 native TE pattern (much of the code is similar to the ESM2 dataset). **Key features:** - **Streaming Parquet datasets**: Efficient loading of large genomic datasets using HuggingFace `datasets` library with `streaming=True` to avoid loading entire datasets into memory - **Windowing/Strided sampling**: Automatic creation of overlapping windows from long genomic sequences using the tokenizer's built-in `return_overflowing_tokens=True` parameter - **Shuffle buffer**: Large shuffle buffer (500K samples by default) for better randomization during streaming - **Distributed training support**: Built-in support for multi-GPU training with `split_dataset_by_node` - **Causal LM collation**: Uses `DataCollatorForLanguageModeling` with `mlm=False` for next-token prediction **Implementation details:** - `create_tokenized_dataset()`: Loads Parquet data, handles dataset splits, applies tokenization with windowing - `create_bshd_dataloader()`: Creates PyTorch DataLoader with appropriate sampler and collator - Supports both streaming and non-streaming modes - Supports both lazy and eager tokenization **Testing:** - 8 dataset tests covering windowing, streaming, lazy tokenization, and batch structure - Mock data fixtures in `conftest.py` for CI/CD compatibility - Note that the tests focus on testing the single node behavior/functionality of the dataloader, distributed dataset tests following the esm2 pattern are a TODO #### Usage ```python from dataset import create_bshd_dataloader from distributed_config import DistributedConfig # Configure distributed training (defaults to single GPU if env vars not set) distributed_config = DistributedConfig() # Create dataloader for genomic sequences dataloader, sampler = create_bshd_dataloader( distributed_config=distributed_config, tokenizer_path="/path/to/nucleotide_fast_tokenizer", load_dataset_kwargs={ "path": "parquet", "data_files": "/path/to/genomic_sequences.parquet", "split": "train", "streaming": True, # Memory-efficient streaming }, micro_batch_size=4, max_seq_length=8192, # Window size stride=200, # Overlap between windows (200 tokens) buffer_size=500_000, # Shuffle buffer size ) # Train for batch in dataloader: # batch contains: input_ids, attention_mask, labels # labels = input_ids (for causal LM, DataCollator handles shifting) outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() ``` **Key parameters:** - `max_seq_length`: Window size for genomic sequences (e.g., 8192 for full Llama3 context) - `stride`: Overlap between consecutive windows in tokens (e.g., 200 for 200bp overlap) - `streaming`: Set to `True` for large datasets to avoid loading everything into memory - `buffer_size`: Shuffle buffer size for streaming mode (larger = better randomization) ### Type of changes - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run. - [ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip) - Skip all CI tests for this PR - [ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks) - Run Jupyter notebooks execution tests for bionemo2 - [ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow) - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 - [ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all) - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2. - [ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes) - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes. Unit tests marked as `@pytest.mark.multi_gpu` or `@pytest.mark.distributed` are not run in the PR pipeline. For more details, see [CONTRIBUTING](CONTRIBUTING.md) > [!NOTE] > By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Pre-submit Checklist - [x] I have tested these changes locally - [x] I have updated the documentation accordingly - [x] I have added/updated tests as needed - [x] All existing tests pass successfully --------- Signed-off-by: savitha-eng <[email protected]>
pull bot
pushed a commit
to mahdi-shafiei/bionemo-framework
that referenced
this pull request
Nov 17, 2025
merges NVIDIA#1318 and NVIDIA#1314 to main to start the llama3 recipe, fixes a few pre-commit lints --------- Signed-off-by: Peter St. John <[email protected]> Co-authored-by: savitha-eng <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds a nucleotide tokenizer for the llama3 model, following HuggingFace's PreTrainedTokenizerFast pattern with Rust backend for performance. It follow's NeMo special token conventions (i.e., for EOS, BOS, PAD etc.).
Key features:
nucleotide_fast_tokenizer/directorycreate_tokenizer.pyscript for reproducibilityThis tokenizer will be used in genomic data loading pipelines for training llama3 models on nucleotide sequences.
Usage
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Pre-submit Checklist