Adds THD + CP for ESM2 #1320

jomitchellnv · 2025-11-13T21:35:46Z

Description

Add context based parallelism to ESM2 through the addition of a CPAware Dataloader as well as several edits to the testing model file.

Usage

train_dataloader, dataset_or_sampler = create_cp_dataloader(dist_config, cp_world_size=torch.distributed.get_world_size(group=cp_group), cp_group=cp_group, cp_rank=cp_rank, **args.dataset)

There's also a bunch of other changes as well that need to happen to run this.

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

ciflow:skip - Skip all CI tests for this PR
ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Signed-off-by: Jonathan Mitchell <[email protected]>

copy-pr-bot · 2025-11-13T21:35:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Jonathan Mitchell <[email protected]>

pstjohn · 2025-11-14T18:43:26Z

bionemo-recipes/recipes/esm2_native_te/utils.py

No files named utils.py 😆 !!

but more seriously, this looks like a mix of test data utilities (those should go in the tests folder) and actual useable code

I can put the get_batch_on_this_cp_rank function inside dataset.py then? Since its going to take ~2 months to get it from TE after I push it: NVIDIA/TransformerEngine#2387

Signed-off-by: Jonathan Mitchell <[email protected]>

pstjohn · 2025-11-14T21:43:07Z

bionemo-recipes/recipes/esm2_native_te/dataset.py

+    """A dataloader that is aware of context parallelism."""
+    def __init__(self, dataloader: StatefulDataLoader,
+                    cp_group: torch.distributed.ProcessGroup,
+                    cp_rank: int,


this you could probably get from torch.distributed right? rather than asking for it here?

cp_rank comes from the device_mesh which isn't available here

pstjohn · 2025-11-14T21:49:53Z

bionemo-recipes/recipes/esm2_native_te/dataset.py

+            combined_batch = []
+            for cp_rank in range(self.num_cp_ranks):
+                input_ids_sharded, labels_sharded = get_batch_on_this_cp_rank(
+                    cu_seqlens_padded=batch["cu_seq_lens_q_padded"],
+                    input_ids_padded=batch["input_ids"],
+                    labels_padded=batch["labels"],
+                    cp_group=self.cp_group,
+                    qvk_format="thd",
+                    cp_rank=cp_rank,
+                )
+                batch_shard = dict(batch)
+                batch_shard["input_ids"] = input_ids_sharded
+                batch_shard["labels"] = labels_sharded
+                combined_batch.append(batch_shard)
+        else:
+            combined_batch = None


we wanted to do this as a dataset.map call, right? otherwise this wont be done as part of the dataloader's prefetch

Uh yea -- also I didn't use a generator.

This isn't a real dataloader tho -- its a wrapper class. In order to use map wouldn't that need to be a legit dataloader

pstjohn · 2025-11-14T21:50:56Z

bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py

+            if self.config.use_cp:
+                hidden_states = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    rotary_pos_emb=te_rope_emb,
+                    cu_seqlens_q=kwargs.get("cu_seq_lens_q", None),
+                    cu_seqlens_kv=kwargs.get("cu_seq_lens_k", None),
+                    cu_seqlens_q_padded=kwargs.get("cu_seq_lens_q_padded", None),
+                    cu_seqlens_kv_padded=kwargs.get("cu_seq_lens_k_padded", None),
+                    pad_between_seqs=kwargs.get("pad_between_seqs", None),
+                    max_seqlen_q=kwargs.get("max_length_q", None),
+                    max_seqlen_kv=kwargs.get("max_length_k", None),
+                )


why does this have to be a separate block? we can just pass those None values through in the non-CP case, right?

pstjohn · 2025-11-14T21:52:27Z

bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py

+
                n_masked_per_seq = torch.nested.nested_tensor_from_jagged(
-                    is_masked, offsets=kwargs["cu_seq_lens_q"]
-                ).sum(1)
+                        is_masked, offsets=kwargs["cu_seq_lens_q"]
+                    ).sum(1)
                mask_ratio_observed = n_masked_per_seq.float() / src_lengths


pstjohn · 2025-11-14T21:53:41Z

bionemo-recipes/recipes/esm2_native_te/tests/test_train.py

+def test_sanity_convergence_ddp_cp(tmp_path, recipe_path):
+    """Test that the main function can be invoked wrapping the model in DDP."""
+
+    # Run the training script with Hydra configuration overrides
+    with initialize_config_dir(config_dir=str(recipe_path / "hydra_config"), version_base="1.2"):
+        sanity_config = compose(
+            config_name="L0_sanity_cp",
+            overrides=[
+                f"+wandb_init_args.dir={tmp_path}",
+                f"checkpoint.ckpt_dir={tmp_path}",
+                f"cp_size=2",
+            ],
+        )
+
+    final_loss = main_ddp(sanity_config)
+    assert final_loss < 3.0, f"Final loss {final_loss} is too high"


wait this doesn't make any sense -- wouldn't we need two GPUs for this convergence test to work? shouldn't this just hang on a single device?

sorry this was WIP I haven't written this part yet.

all i did was change the functoin name lawl

pstjohn · 2025-11-14T21:54:56Z

bionemo-recipes/recipes/esm2_native_te/dataset.py

+        batch['pad_between_seqs'] = True
+        return batch
+
+    def _get_data_scatter_sharded(self):


maybe like, send_data_to_cp_ranks

Signed-off-by: Jonathan Mitchell <[email protected]>

pstjohn · 2025-11-17T20:01:01Z

bionemo-recipes/recipes/esm2_native_te/dataset.py

+    buffer_size: int = 10_000,
+    use_stateful_dataloader: bool = False,
+    mlm_probability: float = 0.15,
+    pad_sequences_to_be_divisible_by: int | None = None,


since you know the cp_world_size, can't we initialize this to the correct value for folks?

We could -- but if you also want to do FP8 + CP then this would need to be higher right? Since CP=2, Divisibility_factor=4, but you would need Divisibility_factor=16 for MXFP8 right? I can set it, but also make it togggleable.

It's currently just set from the config

pstjohn · 2025-11-17T20:03:17Z

bionemo-recipes/recipes/esm2_native_te/dataset.py

+            group=self.cp_group,
+            group_src=0,
+        )
+        torch.distributed.barrier(group=self.cp_group)  # TODO(@jomitchell): Might not need this since its sync.


i also don't think this is the right call for an async op, i'd remove

pstjohn · 2025-11-17T20:03:40Z

bionemo-recipes/recipes/esm2_native_te/dataset.py

+
+
+class CPAwareDataloader:
+    """A dataloader that is aware of context parallelism."""


again, just a quick summary of the main steps here -- This class handles synchronizing a single dataloader across multiple CP ranks. it materializes a dataloader instance on CP rank 0, which is responsible for splitting its inputs into sub-batches for each CP rank. It then uses torch.distributed.scatter to send the data to all cp ranks

pstjohn · 2025-11-17T20:04:20Z

bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py

+                hidden_states = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    rotary_pos_emb=te_rope_emb,
+                    cu_seqlens_q=kwargs.get("cu_seq_lens_q", None),
+                    cu_seqlens_kv=kwargs.get("cu_seq_lens_k", None),
+                    cu_seqlens_q_padded=kwargs.get("cu_seq_lens_q_padded", None),
+                    cu_seqlens_kv_padded=kwargs.get("cu_seq_lens_k_padded", None),
+                    pad_between_seqs=kwargs.get("pad_between_seqs", None),
+                    # TODO(@jomitchell): Add `max_seqlen_q` and `max_seqlen_kv` by finding the largest padded sequence length. torch.diff(cu_seqlens_q_padded).max().item()


does this not also work for non-cp cases?

pstjohn · 2025-11-17T20:04:55Z

bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py

+                        te_rope_emb = self.rotary_embeddings(max_seq_len=kwargs["cu_seq_lens_q_padded"][-1])
+                    else:
+                        te_rope_emb = self.rotary_embeddings(max_seq_len=kwargs["cu_seq_lens_q"][-1])


id just check for a cu_seq_lens_q_padded and use cu_seq_lens_q of its not there

pstjohn · 2025-11-17T20:05:10Z

bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py

        micro_batch_size: Optional[int] = None,
        max_seq_length: Optional[int] = None,
        padded_vocab_size: Optional[int] = 64,
+        use_cp: bool = False,


i'm not sure you need this

pstjohn · 2025-11-17T20:06:54Z

bionemo-recipes/recipes/esm2_native_te/context_parallel.py

context_parallel.py

pstjohn · 2025-11-17T20:14:38Z

bionemo-recipes/recipes/esm2_native_te/dataset.py

+    else:
+        assert micro_batch_size is None, "Only one of micro_batch_size or token_micro_batch_size can be provided."
+        assert token_micro_batch_size >= max_seq_length, "token_micro_batch_size must be greater than max_seq_length."
+


Suggested change

# For context parallelism, we need each sequence...

if pad_sequences_to_be_divisible_by is None:

pad_sequences_to_be_divisible_by = 2 * cp_world_size

pstjohn · 2025-11-17T20:19:39Z

bionemo-recipes/recipes/esm2_native_te/collator.py

+            batch["labels"] = labels_padded.unsqueeze(0)
+            batch["cu_seq_lens_q_padded"] = cu_seqlens_padded.to(torch.int32)
+            batch["cu_seq_lens_k_padded"] = cu_seqlens_padded.to(torch.int32)
+


pop the max_seq_lens stuff here, rather than in model.forward()

pstjohn · 2025-11-17T20:20:56Z

bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py

+                    if self.config.use_cp:
+                        te_rope_emb = self.rotary_embeddings(max_seq_len=kwargs["cu_seq_lens_q_padded"][-1])
+                    else:
+                        te_rope_emb = self.rotary_embeddings(max_seq_len=kwargs["cu_seq_lens_q"][-1])


max_seq_len = kwargs["cu_seq_lens_q_padded"][-1] if "cu_seq_lens_q_padded" in kwargs else ...

pstjohn · 2025-11-17T20:23:50Z

bionemo-recipes/recipes/esm2_native_te/train_ddp_cp.py

+    )
+
+    # Create an empty ESM-2 model with a masked language model head, e.g. "nvidia/esm2_t6_8M_UR50D".
+    config = AutoConfig.from_pretrained(args.model_tag, trust_remote_code=True, token_dropout=False, use_cp=True, dtype=torch.bfloat16)


Suggested change

config = AutoConfig.from_pretrained(args.model_tag, trust_remote_code=True, token_dropout=False, use_cp=True, dtype=torch.bfloat16)

config = AutoConfig.from_pretrained(

args.model_tag,

trust_remote_code=True,

token_dropout=False, # Token dropout isn't supported with CP, since it requires reduction over the entire sequence.

dtype=torch.bfloat16

)

pstjohn · 2025-11-17T20:47:43Z

for models/esm2:

update collator, add test that the _padded keys are set correctly

test_thd.py:

def test_thd_vs_padded_thd_equivalence(input_data_thd):
    input_data_padded_thd = ...

    outputs_thd = model(**input_data_thd)
    outputs_padded_thd = model(**input_data_padded_thd)
    
    torch.testing.assert_close(...)

test_cp.py:

@requires_multi_gpu
def test_grads_are_equal():
    cmd = "torchrun --standalone --nproc-per-node 2 {__file__} { ?? }"

...

if __name__ == "__main__":

    # argparse?

    data = "some mock input protein sequence"

    # run the model with no CP, get gradients
    model()

    # run the model with CP = 2, get gradients
    set the model layers to use cp
    use the collator, copy the scatter code over.

    # compare gradients and logits where you have them

Signed-off-by: Jonathan Mitchell <[email protected]>

Jonathan Mitchell added 6 commits November 12, 2025 11:59

context parallel start

bca505b

Signed-off-by: Jonathan Mitchell <[email protected]>

good early signals of CP training

d6d1a8d

Signed-off-by: Jonathan Mitchell <[email protected]>

notes

dd741bc

Signed-off-by: Jonathan Mitchell <[email protected]>

async on

39a771e

Signed-off-by: Jonathan Mitchell <[email protected]>

scatter object list implementation

187ca33

Signed-off-by: Jonathan Mitchell <[email protected]>

double check

5270370

Signed-off-by: Jonathan Mitchell <[email protected]>

jomitchellnv requested review from cspades, dorotat-nv, jstjohn, jwilber, pstjohn and trvachov as code owners November 13, 2025 21:35

adds unit test for cp with no padding

4ba1bc1

Signed-off-by: Jonathan Mitchell <[email protected]>

pstjohn reviewed Nov 14, 2025

View reviewed changes

Jonathan Mitchell added 3 commits November 14, 2025 11:30

adds test for CP shard slicing

9909e26

Signed-off-by: Jonathan Mitchell <[email protected]>

adds second unit test

493a0e9

Signed-off-by: Jonathan Mitchell <[email protected]>

cleanup

f65b491

Signed-off-by: Jonathan Mitchell <[email protected]>

jomitchellnv changed the title ~~[DRAFT] Jm/context parallel esm2~~ Adds THD + CP for ESM2 Nov 14, 2025

Jonathan Mitchell added 2 commits November 14, 2025 13:39

x

1915662

Signed-off-by: Jonathan Mitchell <[email protected]>

x

877b98c

Signed-off-by: Jonathan Mitchell <[email protected]>

pstjohn reviewed Nov 14, 2025

View reviewed changes

Jonathan Mitchell added 3 commits November 14, 2025 14:15

fix in review

b4d422c

Signed-off-by: Jonathan Mitchell <[email protected]>

adds TODO for max seqlen in CP mode

1a58228

Signed-off-by: Jonathan Mitchell <[email protected]>

Updates CPDataset to do work in collation

c5b7f06

Signed-off-by: Jonathan Mitchell <[email protected]>

pstjohn reviewed Nov 17, 2025

View reviewed changes

bionemo-recipes/recipes/esm2_native_te/context_parallel.py Outdated

Copy link

Collaborator

pstjohn Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context_parallel.py

pstjohn reviewed Nov 17, 2025

View reviewed changes

Jonathan Mitchell and others added 18 commits November 17, 2025 14:07

removes use_cp from modeling file

f8baf99

Signed-off-by: Jonathan Mitchell <[email protected]>

adds link to docs readme for dualchunk

90f2ef8

Signed-off-by: Jonathan Mitchell <[email protected]>

adds docs to dataloader

507e2b0

Signed-off-by: Jonathan Mitchell <[email protected]>

adds docs to dataloader

6059811

Signed-off-by: Jonathan Mitchell <[email protected]>

moves utils to context parallel file

faedff2

Signed-off-by: Jonathan Mitchell <[email protected]>

utils -> context parallel file

4882dd7

Signed-off-by: Jonathan Mitchell <[email protected]>

adds max seqlen using correct algo

58be150

Signed-off-by: Jonathan Mitchell <[email protected]>

fixes comment in model file

f40647c

Signed-off-by: Jonathan Mitchell <[email protected]>

copies collator and modeling file to models

7286243

Signed-off-by: Jonathan Mitchell <[email protected]>

adds thd padded sequence equivalence test

f52cea3

Signed-off-by: Jonathan Mitchell <[email protected]>

adds test cp

805f1ee

Signed-off-by: Jonathan Mitchell <[email protected]>

test cp code has logit match

c3ccbf9

Signed-off-by: Jonathan Mitchell <[email protected]>

cleanup cp test

6391db7

Signed-off-by: Jonathan Mitchell <[email protected]>

loss value checks added

c0b07da

Signed-off-by: Jonathan Mitchell <[email protected]>

linting

77d3789

Signed-off-by: Jonathan Mitchell <[email protected]>

adds gradient comparisons to context parallel test

4fb3904

Signed-off-by: Jonathan Mitchell <[email protected]>

cleanup code

d022290

Signed-off-by: Jonathan Mitchell <[email protected]>

linting models side

c5c74ab

Signed-off-by: Jonathan Mitchell <[email protected]>



		class CPAwareDataloader:
		"""A dataloader that is aware of context parallelism."""

+# For context parallelism, we need each sequence...
+if pad_sequences_to_be_divisible_by is None:
+    pad_sequences_to_be_divisible_by = 2 * cp_world_size

Adds THD + CP for ESM2 #1320

Are you sure you want to change the base?

Adds THD + CP for ESM2 #1320

Uh oh!

Conversation

jomitchellnv commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Pre-submit Checklist

Uh oh!

copy-pr-bot bot commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jomitchellnv Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jomitchellnv commented Nov 13, 2025 •

edited

Loading

jomitchellnv Nov 14, 2025 •

edited

Loading