Skip to content

Conversation

@jomitchellnv
Copy link
Collaborator

@jomitchellnv jomitchellnv commented Nov 20, 2025

Add context based parallelism to ESM2 through the addition of a CPAware Dataloader

  • Adds a unit test: test_cp_dataloader file that tests the dataloader outputs.
  • Adds CPAwareDataloader to dataset.py which brings in all the CP stuff
  • Adds a training script for context parallel + DDP called train_ddp_cp.py that runs both DDP and CP.
  • Adds updates to the collator file (docs)
  • Adds CP docs to the readme.md (but will make a new .md file for CP soon)

Usage
train_dataloader, dataset_or_sampler = create_cp_dataloader(dist_config, cp_world_size=torch.distributed.get_world_size(group=cp_group), cp_group=cp_group, cp_rank=cp_rank, **args.dataset)

Description

Usage

TODO: Add code snippet

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jomitchellnv jomitchellnv force-pushed the jm/context-parallel-esm2-recipe branch from e1b0489 to ce56fce Compare November 20, 2025 21:02
@jomitchellnv jomitchellnv force-pushed the jm/context-parallel-esm2-recipe branch from 75e4023 to e0038dd Compare November 24, 2025 18:39
- removed models dir
- rebased 11/24/2025

Signed-off-by: Jonathan Mitchell <[email protected]>
@jomitchellnv jomitchellnv force-pushed the jm/context-parallel-esm2-recipe branch from e0038dd to cc4824b Compare November 24, 2025 19:20
jomitchellnv and others added 3 commits November 24, 2025 11:33
Signed-off-by: Jonathan Mitchell <[email protected]>
Signed-off-by: Jonathan Mitchell <[email protected]>
Signed-off-by: Jonathan Mitchell <[email protected]>
Jonathan Mitchell and others added 3 commits November 24, 2025 14:10
Signed-off-by: Jonathan Mitchell <[email protected]>
Signed-off-by: Jonathan Mitchell <[email protected]>
Signed-off-by: Jonathan Mitchell <[email protected]>
@jomitchellnv
Copy link
Collaborator Author

/ok to test 4f26cd8

@jomitchellnv jomitchellnv added this pull request to the merge queue Nov 24, 2025
Merged via the queue into main with commit e005649 Nov 24, 2025
19 checks passed
@jomitchellnv jomitchellnv deleted the jm/context-parallel-esm2-recipe branch November 24, 2025 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants