Skip to content

🐛[BUG]: Issue of training diffusion model of StormCast #1097

@Steven04453X

Description

@Steven04453X

Version

1.0.1

On which installation method(s) does this occur?

Pip

Describe the issue

Dear StormCast Teams,
I am attempting to train the StormCast using the HRRR and ERA5 datasets over central US following your paper.

With the help of released training scripts, we have train the UNet model smoothly. The training loss decreases remarkably. However, we have encountered issues when training the diffusion model -- the training loss remains nearly constant over hundreds of epochs.

We used the same training scripts for both the UNet and diffusion models. But before training the diffusion model, we modified the following hyperparameters relative to the UNet configuration:
in $StormCast/config/training/default.yaml:
outdir: ‘diffusion_model’
loss: 'edm'
in $StormCast/config/model/stormcast.yaml:
model_name: 'diffusion'
use_regression_net: True
regression_weights: $StormCast/UNet/checkpoints/StormCastUNet.0.520.mdlus
in $StormCast/config/diffusion.yaml:
# Diffusion model specific changes
model:
use_regression_net: True
regression_weights: "$StormCast/UNet /checkpoints/StormCastUNet.0.520.mdlus"
previous_step_conditioning: True
spatial_pos_embed: True
training:
loss: 'edm'

Could you please kindly let us know if we may have missed any key steps or configurations in setting up the diffusion model training? Please let me know if you need to have other information on our training processes.

Thank you very much for your time and guidance.

Minimum reproducible example

Relevant log output

Environment details

Metadata

Metadata

Assignees

Labels

externalIssues/PR filed by people outside the team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions