Validation fu added to examples/structural_mechanics/crash/train.py #1204

dakhare-creator · 2025-10-31T23:38:21Z

PhysicsNeMo Pull Request

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

Dakhare crash - Validation function to track model performance on test dataset is added in physicsnemo/examples/structural_mechanics/crash/train.py - validate_every_n_epochs, save_ckpt_every_n_epochs added in config/training/default.yaml to assign frequency for calling validation function and saking checkpoint

greptile-apps

Greptile Overview

Greptile Summary

This PR adds validation functionality to the structural mechanics crash simulation training example. The main changes include: (1) adding validation dataset creation and distributed sampling in train.py, (2) implementing a validation loop that computes time-step-wise MSE loss and aggregates results across distributed ranks, (3) adding validation configuration parameters to control validation frequency and checkpoint saving, and (4) refactoring the inference code to use a unified sample object interface instead of passing individual graph components separately.

The validation implementation follows distributed training best practices by properly handling data sampling, metric aggregation, and logging only on rank 0. The changes integrate cleanly with the existing training pipeline and tensorboard logging infrastructure, providing essential model monitoring capabilities for the crash simulation example.

PR Description Notes:

The PR description is largely empty with only unchecked checklist items
No standalone description of changes provided
No linked issues or changelog updates mentioned
Missing information about new dependencies or testing coverage

Important Files Changed

Filename	Score	Overview
examples/structural_mechanics/crash/train.py	4/5	Added comprehensive validation functionality with distributed sampling, MSE computation, and tensorboard logging
examples/structural_mechanics/crash/conf/training/default.yaml	5/5	Added validation configuration parameters for sample count, validation frequency, and checkpoint saving
examples/structural_mechanics/crash/inference.py	4/5	Refactored model forward pass to use unified sample object interface instead of separate graph components

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

examples/structural_mechanics/crash/train.py

getting updates from NVIDIA/physicsnemo

updating crash branch

Dakhare crash

examples/structural_mechanics/crash/conf/training/default.yaml

examples/structural_mechanics/crash/train.py

mnabian · 2025-11-03T23:36:27Z

/blossom-ci

val path added and args name changed

update main

mnabian · 2025-11-10T22:07:24Z

/blossom-ci

mnabian

LGTM! Thanks for addressing the comments!

* Move filesystems and version_check to core * Fix version check tests * Reorganize distributed, domain_parallel, and begin nn / utils cleanup. * Move modules and meta to core. Move registry to core. No tests fixed yet. * Add missing init files * Update build system and specify some deps. * Reorganize tests. * Update init files * Clean up neighbor tools. * Update testing * Fix compat tests * Move core model tests to tests/core/ * Add import lint config * Relocate layers * Move graphcast utils into model directory * Relocating util functionalities. * Further clean up and organize tests. * utils tests are passing now * Cleaning up distributed tests * Patching tests working again in nn * Fix sdf test * Fix zenith angle tests * Some organization of tests. Checkpoints is moved into utils. * Remove launch.utils and launch.config. Checkpointing is moved to phsyicsnemo.utils, launch.config is just gone. It was empty. * Most nn tests are passing * Further cleanup. Getting there! * Remove constants file * Add import linting to pre-commit. * Update crash readme (#1212) * update license headers- second try * update readme * Bump multi-storage-client to v0.33.0 with rust client (#1156) * Move gnn layers and start to fix several model tests. * AFNO is now passing. * Rnn models passing. * Fix improt * Healpix tests are working * Domino and unet working * Add jaxtyping to requirements.txt for crash sample (#1218) * update license headers- second try * Update requirements.txt * Updating to address some test issues * Replace 'License' link with 'Dev blog' link (#1215) Co-authored-by: Corey adams <[email protected]> * MGN tests passing again * Most graphcast tests passing again * Move nd conv layers. * update fengwu and pangu * Update sfno and pix2pix test * update tests for figconvnet, swinrnn, superresnet * updating more models to pass * Update distributed tests, now passing. * Validation fu added to examples/structural_mechanics/crash/train.py (#1204) * validation added: works for multi-node job. * rename and rearrange validation function * validate_every_n_epochs, save_ckpt_every_n_epochs added in config * corrected bug (args of model) in inference * args in validation code updated * val path added and args name changed * validation split added -> write_vtp=False * fixed inference bug * bug fix: write_vtp * Domain parallel tests now passing. * Fix active learning imports so tests pass in refactor * Fix some metric imports * Remove deploy package * Remove unused test file * unmigrate these files ... again? * Update import linter. * Add saikrishnanc-nv to github actors (#1225) * Integrate Curator instructions to the Crash example (#1213) * Integrate Curator instructions * Update docs * Formatting changes * Adding code of conduct (#1214) * Adding code of conduct Adopting the code of conduct from the https://www.contributor-covenant.org/ * Update CODE_OF_CONDUCT.MD Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Create .markdownlintignore * Revise README for PhysicsNeMo resources and guidance Updated the 'Getting Started' section and added new resources for learning AI Physics. * Update README.md --------- Co-authored-by: Mohammad Amin Nabian <[email protected]> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Corey adams <[email protected]> * Cleaning up diffusion models. Not quite done yet. * Restore deleted files * Updating more tests. * Further updates to tests. Datapipes almost working. --------- Co-authored-by: Mohammad Amin Nabian <[email protected]> Co-authored-by: Yongming Ding <[email protected]> Co-authored-by: ram-cherukuri <[email protected]> Co-authored-by: Deepak Akhare <[email protected]> Co-authored-by: Sai Krishnan Chandrasekar <[email protected]> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Move filesystems and version_check to core * Fix version check tests * Reorganize distributed, domain_parallel, and begin nn / utils cleanup. * Move modules and meta to core. Move registry to core. No tests fixed yet. * Add missing init files * Update build system and specify some deps. * Reorganize tests. * Update init files * Clean up neighbor tools. * Update testing * Fix compat tests * Move core model tests to tests/core/ * Add import lint config * Relocate layers * Move graphcast utils into model directory * Relocating util functionalities. * Further clean up and organize tests. * utils tests are passing now * Cleaning up distributed tests * Patching tests working again in nn * Fix sdf test * Fix zenith angle tests * Some organization of tests. Checkpoints is moved into utils. * Remove launch.utils and launch.config. Checkpointing is moved to phsyicsnemo.utils, launch.config is just gone. It was empty. * Most nn tests are passing * Further cleanup. Getting there! * Remove constants file * Add import linting to pre-commit. * Update crash readme (#1212) * update license headers- second try * update readme * Bump multi-storage-client to v0.33.0 with rust client (#1156) * Move gnn layers and start to fix several model tests. * AFNO is now passing. * Rnn models passing. * Fix improt * Healpix tests are working * Domino and unet working * Add jaxtyping to requirements.txt for crash sample (#1218) * update license headers- second try * Update requirements.txt * Updating to address some test issues * Replace 'License' link with 'Dev blog' link (#1215) Co-authored-by: Corey adams <[email protected]> * MGN tests passing again * Most graphcast tests passing again * Move nd conv layers. * update fengwu and pangu * Update sfno and pix2pix test * update tests for figconvnet, swinrnn, superresnet * updating more models to pass * Update distributed tests, now passing. * Validation fu added to examples/structural_mechanics/crash/train.py (#1204) * validation added: works for multi-node job. * rename and rearrange validation function * validate_every_n_epochs, save_ckpt_every_n_epochs added in config * corrected bug (args of model) in inference * args in validation code updated * val path added and args name changed * validation split added -> write_vtp=False * fixed inference bug * bug fix: write_vtp * Domain parallel tests now passing. * Fix active learning imports so tests pass in refactor * Fix some metric imports * Remove deploy package * Remove unused test file * unmigrate these files ... again? * Update import linter. * Add saikrishnanc-nv to github actors (#1225) * Integrate Curator instructions to the Crash example (#1213) * Integrate Curator instructions * Update docs * Formatting changes * Adding code of conduct (#1214) * Adding code of conduct Adopting the code of conduct from the https://www.contributor-covenant.org/ * Update CODE_OF_CONDUCT.MD Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Create .markdownlintignore * Revise README for PhysicsNeMo resources and guidance Updated the 'Getting Started' section and added new resources for learning AI Physics. * Update README.md --------- Co-authored-by: Mohammad Amin Nabian <[email protected]> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Corey adams <[email protected]> * Cleaning up diffusion models. Not quite done yet. * Restore deleted files * Updating more tests. * Fixed minor bug in shape validation in SongUNet (#1230) Signed-off-by: Charlelie Laurent <[email protected]> * Add Zarr reader for Crash (#1228) * Add Zarr reader for Crash * Update README * Update validation logic of point data in Zarr reader Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update examples/structural_mechanics/crash/zarr_reader.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Add a test for 2D feature arrays * Update examples/structural_mechanics/crash/zarr_reader.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Further updates to tests. Datapipes almost working. * update import paths * Starting to clean up dependency tree. --------- Signed-off-by: Charlelie Laurent <[email protected]> Co-authored-by: Mohammad Amin Nabian <[email protected]> Co-authored-by: Yongming Ding <[email protected]> Co-authored-by: ram-cherukuri <[email protected]> Co-authored-by: Deepak Akhare <[email protected]> Co-authored-by: Sai Krishnan Chandrasekar <[email protected]> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Charlelie Laurent <[email protected]>

dakhare-creator added 5 commits October 30, 2025 16:24

validation added: works for multi-node job.

f442134

rename and rearrange validation function

87ad160

validate_every_n_epochs, save_ckpt_every_n_epochs added in config

f833402

corrected bug (args of model) in inference

50aecf3

greptile-apps bot reviewed Oct 31, 2025

View reviewed changes

examples/structural_mechanics/crash/train.py Outdated Show resolved Hide resolved

examples/structural_mechanics/crash/train.py Outdated Show resolved Hide resolved

examples/structural_mechanics/crash/train.py Show resolved Hide resolved

dakhare-creator added 4 commits October 31, 2025 16:47

Merge pull request #2 from NVIDIA/main

6682e0f

getting updates from NVIDIA/physicsnemo

Merge pull request #3 from dakhare-creator/main

db8f6df

updating crash branch

args in validation code updated

cc2add3

Merge pull request #4 from dakhare-creator/dakhare-crash

68a3131

Dakhare crash