Skip to content

DLWP Indexing and memory consumption fix #859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 48 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
3b15d51
Add workflow to automatically sync changes from nvidia/modulus into m…
nathanielcresswellclay Nov 23, 2024
8f0cc02
add WeightedOceanMSE to criterion
zacespinosa Nov 25, 2024
c439e00
Merge pull request #1 from AtmosSci-DLESM/WeightedOceanMSE
zacespinosa Dec 2, 2024
47fdce8
add optional gaussian noise to inputs and coupled variables during tr…
zacespinosa Nov 26, 2024
2826a02
add random seed - still need to test
zacespinosa Dec 2, 2024
e44dcc9
remove datatransformer code - shouldn't be part of this PR
zacespinosa Dec 2, 2024
f73e149
move logging
zacespinosa Dec 2, 2024
50041ef
Removed blossom-ci workflow from modulus-uw fork, updated automatic sync
nathanielcresswellclay Dec 2, 2024
e2f3376
Fix the training and inference problem in nvidia modulus
ivanauyeung Dec 13, 2024
b8951be
Fix indexing in constant coupler
nathanielcresswellclay Dec 14, 2024
20170ad
add 'Multi_SymmetricConvNeXtBlock'
Bwformer Dec 16, 2024
68ca5bf
Repalce 'n_layers' with 'n_conv_blocks' for clarity
Bwformer Dec 16, 2024
0c3288a
Merge pull request #8 from AtmosSci-DLESM/fix_indexing_coupler
Bwformer Dec 16, 2024
c81405e
Merge pull request #2 from AtmosSci-DLESM/GaussianNoiseCoupled
zacespinosa Dec 16, 2024
1317c02
Fix indexing in constant coupler
ivanauyeung Dec 18, 2024
027473b
Fix indexing in constant coupler
ivanauyeung Dec 18, 2024
aff387e
change back to 'n_layers' to match the old models
Bwformer Dec 19, 2024
80821ec
Merge pull request #9 from Bwformer/bw/DoubleConv
Bwformer Dec 19, 2024
5726c7b
enforce precedence of upstream modulus changes when auto syncing.
nathanielcresswellclay Dec 26, 2024
4574532
Merge pull request #3 from AtmosSci-DLESM/update_workflow
nathanielcresswellclay Dec 26, 2024
be854fd
Merge branch 'main' into dlwp_coupled_training_inference_fix
yairchn Jan 6, 2025
957b00a
set scaling for mean: 0, std: 1 where no change is needed
yairchn Jan 6, 2025
d519a84
Merge branch 'dev' into dlwp_coupled_training_inference_fix
yairchn Jan 6, 2025
afb4e63
Merge pull request #7 from AtmosSci-DLESM/dlwp_coupled_training_infer…
yairchn Jan 6, 2025
3a3e9d1
fix memory leak in coupled timeseries
yairchn Mar 2, 2025
cd81224
Merge pull request #14 from AtmosSci-DLESM/yc/mem_leak_coupledtimeseries
yairchn Mar 3, 2025
6d34145
Add workflow to automatically sync changes from nvidia/modulus into m…
nathanielcresswellclay Nov 23, 2024
2b9008d
Fix the training and inference problem in nvidia modulus
ivanauyeung Dec 13, 2024
c8ff21a
Fix indexing in constant coupler
ivanauyeung Dec 18, 2024
e0e9d9b
Fix indexing in constant coupler
ivanauyeung Dec 18, 2024
e12861c
Removed blossom-ci workflow from modulus-uw fork, updated automatic sync
nathanielcresswellclay Dec 2, 2024
1dc4783
enforce precedence of upstream modulus changes when auto syncing.
nathanielcresswellclay Dec 26, 2024
5fac4e8
set scaling for mean: 0, std: 1 where no change is needed
yairchn Jan 6, 2025
9c5c9e8
add 'Multi_SymmetricConvNeXtBlock'
Bwformer Dec 16, 2024
5c27fba
Repalce 'n_layers' with 'n_conv_blocks' for clarity
Bwformer Dec 16, 2024
2cd0cec
change back to 'n_layers' to match the old models
Bwformer Dec 19, 2024
7c92ab2
fix memory leak in coupled timeseries
yairchn Mar 2, 2025
bfa1489
Merge branch 'dev' into rebase_physicsnemo
daviddpruitt Apr 24, 2025
65d8483
add coupler fixes, var and time selection
daviddpruitt Apr 24, 2025
4d7f5c9
Fix for ordering on coupler
daviddpruitt Apr 24, 2025
3541167
batch size fix in coupler
daviddpruitt Apr 24, 2025
ed33b0c
broken workflow cleanup
daviddpruitt Apr 24, 2025
6bd06dc
Merge pull request #19 from AtmosSci-DLESM/rebase_physicsnemo
daviddpruitt Apr 24, 2025
f5ba5ff
cleanup for upstream merge (#20)
daviddpruitt Apr 26, 2025
fe01de4
Merge physics nemo (#21)
daviddpruitt May 7, 2025
dbbf67d
Add conditional loss for precip model training and option to disable …
raulantmor Jul 8, 2025
e82643e
Address PR formatting comments for precip diagnostic model
raulantmor Jul 9, 2025
481602b
Merge pull request #23 from AtmosSci-DLESM/morenor/precip
pzharrington Jul 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- ERA5 download example updated to use current file format convention and
restricts global statistics computation to the training set
- Support for training custom StormCast models and various other improvements for StormCast
- Updated CorrDiff training code to support multiple patch iterations to amortize
regression cost and usage of `torch.compile`
- Refactored `physicsnemo/models/diffusion/layers.py` to optimize data type
casting workflow, avoiding unnecessary casting under autocast mode
- Refactored Conv2d to enable fusion of conv2d with bias addition
- Refactored GroupNorm, UNetBlock, SongUNet, SongUNetPosEmbd to support usage of
Apex GroupNorm, fusion of activation with GroupNorm, and AMP workflow.
- Updated SongUNetPosEmbd to avoid unnecessary HtoD Memcpy of `pos_embd`
- Updated `from_checkpoint` to accommodate conversion between Apex optimized ckp
and non-optimized ckp
- Refactored CorrDiff NVTX annotation workflow to be configurable
- Refactored `ResidualLoss` to support patch-accumlating training for
amortizing regression costs

### Deprecated

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

name: diffusion
name: patched_diffusion
# Model type.
hr_mean_conditioning: True
# Recommended to use high-res conditioning for diffusion.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,4 @@ model_args:
# Per-resolution multipliers for the number of channels.
channel_mult: [1, 2, 2, 2, 2]
# Resolutions at which self-attention layers are applied.
attention_levels: [28]
attn_resolutions: [28]
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

hydra:
job:
chdir: true
name: patched_diffusion_opt
run:
dir: ./output/${hydra:job.name}
searchpath:
- pkg://conf/base # Do not modify

# Base parameters for dataset, model, training, and validation
defaults:

- dataset: hrrr_corrdiff_synthetic
# The dataset type for training.
# Accepted values:
# `gefs_hrrr`: full GEFS-HRRR dataset for continental US.
# `hrrr_mini`: smaller HRRR dataset (continental US), for fast experiments.
# `cwb`: full CWB dataset for Taiwan.
# `custom`: user-defined dataset. Parameters need to be specified below.

- model: patched_diffusion
# The model type.
# Accepted values:
# `regression`: a regression UNet for deterministic predictions
# `lt_aware_ce_regression`: similar to `regression` but with lead time
# conditioning
# `diffusion`: a diffusion UNet for residual predictions
# `patched_diffusion`: a more memory-efficient diffusion model
# `lt_aware_patched_diffusion`: similar to `patched_diffusion` but
# with lead time conditioning

- model_size: normal
# The model size configuration.
# Accepted values:
# `normal`: normal model size
# `mini`: smaller model size for fast experiments

- training: ${model}
# The base training parameters. Determined by the model type.


# Dataset parameters. Used for `custom` dataset type.
# Modify or add below parameters that should be passed as argument to the
# user-defined dataset class.
dataset:
data_path: ./data
# Path to .nc data file
stats_path: ./data/stats.json
# Path to json stats file

# Training parameters
training:
hp:
training_duration: 200000000
# Training duration based on the number of processed samples
total_batch_size: 512
# Total batch size
batch_size_per_gpu: 4

patch_shape_x: 448
patch_shape_y: 448
# Patch size. Patch training is used if these dimensions differ from
# img_shape_x and img_shape_y.
patch_num: 16
# Number of patches from a single sample. Total number of patches is
# patch_num * batch_size_global.
max_patch_per_gpu: 9
# Maximum number of pataches a gpu can hold

lr: 0.0002
# Learning rate
grad_clip_threshold: 1e6
lr_decay: 0.7
lr_rampup: 1000000

# Performance
perf:
fp_optimizations: amp-bf16
# Floating point mode, one of ["fp32", "fp16", "amp-fp16", "amp-bf16"]
# "amp-{fp16,bf16}" activates Automatic Mixed Precision (AMP) with {float16,bfloat16}
dataloader_workers: 4
# DataLoader worker processes
songunet_checkpoint_level: 0 # 0 means no checkpointing
# Gradient checkpointing level, value is number of layers to checkpoint
# optimization_mode: True
use_apex_gn: True
torch_compile: True
profile_mode: False

io:
regression_checkpoint_path: /lustre/fsw/portfolios/coreai/users/asui/video-corrdiff-checkpoints/training-state-regression-000513.mdlus
# Path to load the regression checkpoint

# Where to load the regression checkpoint
print_progress_freq: 1000
# How often to print progress
save_checkpoint_freq: 500000
# How often to save the checkpoints, measured in number of processed samples
validation_freq: 5000
# how often to record the validation loss, measured in number of processed samples
validation_steps: 10
# how many loss evaluations are used to compute the validation loss per checkpoint

# Parameters for wandb logging
wandb:
mode: offline
# Configure whether to use wandb: "offline", "online", "disabled"
results_dir: "./wandb"
# Directory to store wandb results
watch_model: false
# If true, wandb will track model parameters and gradients
Loading