Skip to content

Commit 1de5e9d

Browse files
committed
starting work for swqa
Signed-off-by: Peter St. John <[email protected]> updated multistage dockerfile Signed-off-by: Peter St. John <[email protected]> revert to single stage image Signed-off-by: Peter St. John <[email protected]> add flash-attn Signed-off-by: Peter St. John <[email protected]> update readme with local dataset download Signed-off-by: Peter St. John <[email protected]> update readme with local dataset download Signed-off-by: Peter St. John <[email protected]> update readme with local dataset download Signed-off-by: Peter St. John <[email protected]>
1 parent f7fdd1e commit 1de5e9d

File tree

10 files changed

+129
-5
lines changed

10 files changed

+129
-5
lines changed

.devcontainer/recipes/devcontainer.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
"ms-toolsai.jupyter",
3434
"eamodio.gitlens",
3535
"tamasfe.even-better-toml",
36-
"streetsidesoftware.code-spell-checker"
36+
"streetsidesoftware.code-spell-checker""ms-azuretools.vscode-docker",
3737
]
3838
}
3939
}

bionemo-recipes/recipes/esm2_accelerate_te/.dockerignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@ __pycache__
77
.pytest_cache
88
.ruff.toml
99
.dockerignore
10+
.venv/

bionemo-recipes/recipes/esm2_native_te/.dockerignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@ __pycache__
77
.pytest_cache
88
.ruff.toml
99
.dockerignore
10+
.venv/

bionemo-recipes/recipes/esm2_native_te/Dockerfile

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
# syntax=docker/dockerfile:1.4
22
FROM nvcr.io/nvidia/pytorch:25.10-py3
33

4-
RUN --mount=type=secret,id=netrc,target=/root/.netrc \
5-
--mount=type=cache,target=/root/.cache/pip \
4+
RUN --mount=type=cache,target=/root/.cache/pip \
65
--mount=type=bind,source=requirements.txt,target=/requirements.txt \
76
PIP_CONSTRAINT= pip install -r /requirements.txt
87

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# An example, minimal Dockerfile to install dependencies in a fresh python environment with CUDA support. This image
2+
# ends up with two copies of CUDA libraries; the first is the one installed by the base image, and the second is brought
3+
# in when we pip install torch.
4+
5+
FROM nvcr.io/nvidia/cuda:13.0.2-cudnn-devel-ubuntu24.04
6+
7+
ENV UV_LINK_MODE=copy
8+
SHELL ["/bin/bash", "-c"]
9+
10+
# Install torch, transformer-engine, and flash-attn
11+
RUN --mount=type=cache,target=/root/.cache/uv \
12+
--mount=type=cache,target=/root/.cache/pip \
13+
--mount=from=ghcr.io/astral-sh/uv,source=/uv,target=/bin/uv \
14+
<<EOF
15+
uv venv --python 3.12 --seed /workspace/.venv
16+
source /workspace/.venv/bin/activate
17+
uv pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu130
18+
uv pip install wheel packaging psutil
19+
pip install --no-build-isolation "flash-attn>=2.1.1,<=2.8.1"
20+
pip install --no-build-isolation transformer-engine[pytorch]==2.9.0
21+
EOF
22+
23+
# Install recipe-specific dependencies
24+
RUN --mount=type=cache,target=/root/.cache/uv \
25+
--mount=type=cache,target=/root/.cache/pip \
26+
--mount=type=bind,source=requirements.txt,target=/requirements.txt \
27+
--mount=from=ghcr.io/astral-sh/uv,source=/uv,target=/bin/uv \
28+
source /workspace/.venv/bin/activate && uv pip install -r /requirements.txt
29+
30+
ENV PATH="/workspace/.venv/bin:$PATH"
31+
WORKDIR /workspace/bionemo

bionemo-recipes/recipes/esm2_native_te/README.md

Lines changed: 74 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,50 @@ bionemo-framework repository. You can download a zipped directory of this folder
2727
\[1\]: Requires [compute capability](https://developer.nvidia.com/cuda-gpus) 9.0 and above (Hopper+) <br/>
2828
\[2\]: Requires [compute capability](https://developer.nvidia.com/cuda-gpus) 10.0 and 10.3 (Blackwell), 12.0 support pending <br/>
2929

30+
### Installing Dependencies
31+
32+
The easiest way to get started with this recipe is to use the provided Dockerfile, which uses the latest NVIDIA PyTorch
33+
base image to provide optimized versions of PyTorch and TransformerEngine. To build the container, run:
34+
35+
```bash
36+
docker build -t esm2_native_te .
37+
```
38+
39+
To run the container, run:
40+
41+
```bash
42+
docker run -it --gpus all --network host --ipc=host --rm -v ${PWD}:/workspace/bionemo esm2_native_te /bin/bash
43+
```
44+
45+
Alternatively, the dependencies can be installed manually in an environment with CUDA support. See
46+
[Dockerfile.cuda](Dockerfile.cuda) for the process of installing dependencies in a fresh python environment (for e.g.,
47+
CUDA 13.0):
48+
49+
```bash
50+
uv venv --python 3.12 --seed /workspace/.venv
51+
source /workspace/.venv/bin/activate
52+
uv pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu130
53+
uv pip install wheel packaging psutil
54+
pip install --no-build-isolation "flash-attn>=2.1.1,<=2.8.1"
55+
pip install --no-build-isolation transformer-engine[pytorch]==2.9.0
56+
uv pip install -r /requirements.txt
57+
```
58+
59+
To build and run the CUDA base container, run:
60+
61+
```bash
62+
docker build -t esm2_native_te_cuda -f Dockerfile.cuda .
63+
docker run -it --gpus all --network host --ipc=host --rm -v ${PWD}:/workspace/bionemo esm2_native_te_cuda /bin/bash -c "pytest -v ."
64+
```
65+
3066
### Performance Benchmarks
3167

3268
![Performance Benchmarks](../../../docs/docs/assets/images/esm2/esm2_native_te_benchmarks.svg)
3369

34-
Note: "compiled" refers to `torch.compile`. "fa2" is [FlashAttention2](https://github.com/Dao-AILab/flash-attention). Recently, we measured 2800 tokens/second/GPU training speed on H100 with HuggingFace Transformers's ESM-2 implementation of THD sequence packing, however we have not been able to make this configuration work on Blackwell and this work is still in progress.
70+
Note: "compiled" refers to `torch.compile`. "fa2" is [FlashAttention2](https://github.com/Dao-AILab/flash-attention).
71+
Recently, we measured 2800 tokens/second/GPU training speed on H100 with HuggingFace Transformers's ESM-2 implementation
72+
of THD sequence packing, however we have not been able to make this configuration work on Blackwell and this work is
73+
still in progress.
3574

3675
### Distributed Training
3776

@@ -97,6 +136,40 @@ model tag:
97136
python train_fsdp2.py --config-name L0_sanity model_tag=facebook/esm2_t6_8M_UR50D
98137
```
99138

139+
## Downloading Pre-Training Data For Offline Training
140+
141+
An example pre-training dataset for ESM-2 is available in the
142+
[`nvidia/esm2_uniref_pretraining_data`](https://huggingface.co/datasets/nvidia/esm2_uniref_pretraining_data) Hugging
143+
Face dataset. This dataset can be [streamed](https://huggingface.co/docs/datasets/en/stream) from the Hugging Face Hub via
144+
145+
```python
146+
>>> from datasets import load_dataset
147+
>>> dataset = load_dataset('nvidia/esm2_uniref_pretraining_data', split='train', streaming=True)
148+
>>> print(next(iter(dataset)))
149+
{'sequence': 'MSPRRTGGARPPGPCTPCGPRPRCPSRRSAAARPAPSAAPARRARPGRRPGCRPGTDCPGTARRPGGGP...',
150+
'ur50_id': 'UniRef50_A0A081XN86',
151+
'ur90_id': 'UniRef90_UPI002FBE17D9'}
152+
```
153+
154+
For large-scale training, the dataset should be downloaded locally via the [huggingface
155+
CLI](https://huggingface.co/docs/huggingface_hub/guides/download#download-from-the-cli), with appropriate values set for
156+
`HF_HOME` and `HF_TOKEN` environment variables. Use `uv tool install huggingface_hub` to install the CLI if not already
157+
installed.
158+
159+
```bash
160+
export HF_TOKEN=<your_huggingface_token>
161+
hf download nvidia/esm2_uniref_pretraining_data --repo-type dataset --local-dir /path/to/download/directory
162+
# Test to ensure the dataset can be loaded correctly
163+
python -c "import datasets; datasets.load_dataset('/path/to/download/directory', split='train', streaming=True)"
164+
```
165+
166+
Pass the downloaded dataset directory to the training script as the `dataset.path` configuration parameter.
167+
168+
```bash
169+
HF_DATASETS_OFFLINE=1 python train_fsdp2.py --config-name L0_sanity \
170+
dataset.load_dataset_kwargs.path=/path/to/download/directory
171+
```
172+
100173
## Saving and Loading Checkpoints
101174

102175
To enable checkpoint saving, ensure that `checkpoint.ckpt_dir` is set to a writable directory. Checkpointing frequency is

bionemo-recipes/recipes/esm2_native_te/hydra_config/defaults.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ checkpoint:
6969
ckpt_dir: ???
7070
save_final_model: true
7171
resume_from_checkpoint: true
72-
save_every_n_steps: 50
72+
save_every_n_steps: 1_000
7373

7474
logger:
7575
frequency: 100

bionemo-recipes/recipes/esm2_native_te/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
datasets
22
hydra-core
33
megatron-fsdp
4+
pytest
45
torch
56
torchao!=0.14.0
67
torchdata

bionemo-recipes/recipes/esm2_native_te/tests/test_distributed_checkpointing.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ def test_checkpoint_save_and_load_single_process_ddp(recipe_path, tmp_path):
7373
overrides=[
7474
f"checkpoint.ckpt_dir={temp_dir}",
7575
f"+wandb_init_args.dir={tmp_path}",
76+
f"hydra.run.dir={tmp_path}",
7677
"num_train_steps=10",
7778
"checkpoint.save_every_n_steps=5",
7879
"checkpoint.resume_from_checkpoint=false", # Start fresh
@@ -121,6 +122,7 @@ def test_checkpoint_save_and_load_single_process_ddp(recipe_path, tmp_path):
121122
overrides=[
122123
f"checkpoint.ckpt_dir={temp_dir}",
123124
f"+wandb_init_args.dir={tmp_path}",
125+
f"hydra.run.dir={tmp_path}",
124126
"num_train_steps=15",
125127
"checkpoint.save_every_n_steps=5",
126128
"checkpoint.resume_from_checkpoint=true", # Resume from checkpoint
@@ -205,6 +207,7 @@ def test_checkpoint_save_and_load_two_processes_ddp(recipe_path, tmp_path):
205207
"checkpoint.save_every_n_steps=5",
206208
"checkpoint.resume_from_checkpoint=false", # Start fresh
207209
"dataset.use_stateful_dataloader=true",
210+
f"hydra.run.dir={tmp_path}",
208211
]
209212

210213
result1 = subprocess.run(cmd_phase1, check=False, capture_output=True, text=True, env=env)
@@ -268,6 +271,7 @@ def test_checkpoint_save_and_load_two_processes_ddp(recipe_path, tmp_path):
268271
"checkpoint.save_every_n_steps=5",
269272
"checkpoint.resume_from_checkpoint=true", # Resume from checkpoint
270273
"dataset.use_stateful_dataloader=true",
274+
f"hydra.run.dir={tmp_path}",
271275
]
272276

273277
result2 = subprocess.run(cmd_phase2, check=False, capture_output=True, text=True, env=env)
@@ -346,6 +350,7 @@ def test_checkpoint_save_and_load_single_process_mfsdp(recipe_path, tmp_path):
346350
overrides=[
347351
f"checkpoint.ckpt_dir={temp_dir}",
348352
f"+wandb_init_args.dir={tmp_path}",
353+
f"hydra.run.dir={tmp_path}",
349354
"num_train_steps=10",
350355
"checkpoint.save_every_n_steps=5",
351356
"checkpoint.resume_from_checkpoint=false", # Start fresh
@@ -390,6 +395,7 @@ def test_checkpoint_save_and_load_single_process_mfsdp(recipe_path, tmp_path):
390395
overrides=[
391396
f"checkpoint.ckpt_dir={temp_dir}",
392397
f"+wandb_init_args.dir={tmp_path}",
398+
f"hydra.run.dir={tmp_path}",
393399
"num_train_steps=15",
394400
"checkpoint.save_every_n_steps=5",
395401
"checkpoint.resume_from_checkpoint=true", # Resume from checkpoint
@@ -457,6 +463,7 @@ def test_checkpoint_save_and_load_two_processes_mfsdp(recipe_path, tmp_path):
457463
"checkpoint.save_every_n_steps=5",
458464
"checkpoint.resume_from_checkpoint=false", # Start fresh
459465
"dataset.use_stateful_dataloader=true",
466+
f"hydra.run.dir={tmp_path}",
460467
]
461468

462469
result1 = subprocess.run(cmd_phase1, check=False, capture_output=True, text=True, env=env)
@@ -503,6 +510,7 @@ def test_checkpoint_save_and_load_two_processes_mfsdp(recipe_path, tmp_path):
503510
"checkpoint.save_every_n_steps=5",
504511
"checkpoint.resume_from_checkpoint=true", # Resume from checkpoint
505512
"dataset.use_stateful_dataloader=true",
513+
f"hydra.run.dir={tmp_path}",
506514
]
507515

508516
result2 = subprocess.run(cmd_phase2, check=False, capture_output=True, text=True, env=env)
@@ -559,6 +567,7 @@ def test_checkpoint_save_and_load_single_process_fsdp2(recipe_path, tmp_path):
559567
overrides=[
560568
f"checkpoint.ckpt_dir={temp_dir}",
561569
f"+wandb_init_args.dir={tmp_path}",
570+
f"hydra.run.dir={tmp_path}",
562571
"num_train_steps=10",
563572
"checkpoint.save_every_n_steps=5",
564573
"checkpoint.resume_from_checkpoint=false", # Start fresh
@@ -668,6 +677,7 @@ def test_checkpoint_save_and_load_two_processes_fsdp2(recipe_path, tmp_path):
668677
"num_train_steps=10",
669678
"checkpoint.save_every_n_steps=5",
670679
"dataset.use_stateful_dataloader=true",
680+
f"hydra.run.dir={tmp_path}",
671681
]
672682

673683
result1 = subprocess.run(cmd_phase1, check=False, capture_output=True, text=True, env=env)
@@ -714,6 +724,7 @@ def test_checkpoint_save_and_load_two_processes_fsdp2(recipe_path, tmp_path):
714724
"checkpoint.save_every_n_steps=5",
715725
"checkpoint.resume_from_checkpoint=true", # Resume from checkpoint
716726
"dataset.use_stateful_dataloader=true",
727+
f"hydra.run.dir={tmp_path}",
717728
]
718729

719730
result2 = subprocess.run(cmd_phase2, check=False, capture_output=True, text=True, env=env)
@@ -797,6 +808,7 @@ def test_final_model_save_mfsdp(recipe_path, tmp_path):
797808
overrides=[
798809
f"checkpoint.ckpt_dir={temp_dir}",
799810
f"+wandb_init_args.dir={tmp_path}",
811+
f"hydra.run.dir={tmp_path}",
800812
"num_train_steps=3",
801813
"checkpoint.save_final_model=true",
802814
],
@@ -831,6 +843,7 @@ def test_final_model_save_fsdp2(recipe_path, tmp_path):
831843
overrides=[
832844
f"checkpoint.ckpt_dir={temp_dir}",
833845
f"+wandb_init_args.dir={tmp_path}",
846+
f"hydra.run.dir={tmp_path}",
834847
"checkpoint.save_final_model=true",
835848
"num_train_steps=3",
836849
],
@@ -874,6 +887,7 @@ def test_scheduler_resume_single_gpu(recipe_path, tmp_path):
874887
overrides=[
875888
f"checkpoint.ckpt_dir={temp_dir}",
876889
f"+wandb_init_args.dir={tmp_path}",
890+
f"hydra.run.dir={tmp_path}",
877891
"num_train_steps=10",
878892
"checkpoint.save_every_n_steps=5",
879893
"checkpoint.resume_from_checkpoint=false", # Start fresh, don't look for checkpoints
@@ -891,6 +905,7 @@ def test_scheduler_resume_single_gpu(recipe_path, tmp_path):
891905
overrides=[
892906
f"checkpoint.ckpt_dir={temp_dir}",
893907
f"+wandb_init_args.dir={tmp_path}",
908+
f"hydra.run.dir={tmp_path}",
894909
"num_train_steps=15",
895910
"checkpoint.save_every_n_steps=5",
896911
"checkpoint.resume_from_checkpoint=true", # Resume from checkpoint
@@ -951,6 +966,7 @@ def test_scheduler_resume_two_gpu(recipe_path, tmp_path):
951966
"checkpoint.resume_from_checkpoint=false", # Start fresh, don't look for checkpoints
952967
"lr_scheduler_kwargs.num_warmup_steps=20",
953968
"lr_scheduler_kwargs.num_training_steps=100",
969+
f"hydra.run.dir={tmp_path}",
954970
]
955971

956972
result1 = subprocess.run(cmd_phase1, check=False, capture_output=True, text=True, env=env)
@@ -974,6 +990,7 @@ def test_scheduler_resume_two_gpu(recipe_path, tmp_path):
974990
"checkpoint.resume_from_checkpoint=true", # Resume from checkpoint
975991
"lr_scheduler_kwargs.num_warmup_steps=20",
976992
"lr_scheduler_kwargs.num_training_steps=100",
993+
f"hydra.run.dir={tmp_path}",
977994
]
978995

979996
result2 = subprocess.run(cmd_phase2, check=False, capture_output=True, text=True, env=env)

bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/.dockerignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@ __pycache__
1111
wandb
1212
.**
1313
*.sqsh
14+
.venv/

0 commit comments

Comments
 (0)