-
Notifications
You must be signed in to change notification settings - Fork 484
Validation fu added to examples/structural_mechanics/crash/train.py #1204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Dakhare crash - Validation function to track model performance on test dataset is added in physicsnemo/examples/structural_mechanics/crash/train.py - validate_every_n_epochs, save_ckpt_every_n_epochs added in config/training/default.yaml to assign frequency for calling validation function and saking checkpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds validation functionality to the structural mechanics crash simulation training example. The main changes include: (1) adding validation dataset creation and distributed sampling in train.py, (2) implementing a validation loop that computes time-step-wise MSE loss and aggregates results across distributed ranks, (3) adding validation configuration parameters to control validation frequency and checkpoint saving, and (4) refactoring the inference code to use a unified sample object interface instead of passing individual graph components separately.
The validation implementation follows distributed training best practices by properly handling data sampling, metric aggregation, and logging only on rank 0. The changes integrate cleanly with the existing training pipeline and tensorboard logging infrastructure, providing essential model monitoring capabilities for the crash simulation example.
PR Description Notes:
- The PR description is largely empty with only unchecked checklist items
- No standalone description of changes provided
- No linked issues or changelog updates mentioned
- Missing information about new dependencies or testing coverage
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| examples/structural_mechanics/crash/train.py | 4/5 | Added comprehensive validation functionality with distributed sampling, MSE computation, and tensorboard logging |
| examples/structural_mechanics/crash/conf/training/default.yaml | 5/5 | Added validation configuration parameters for sample count, validation frequency, and checkpoint saving |
| examples/structural_mechanics/crash/inference.py | 4/5 | Refactored model forward pass to use unified sample object interface instead of separate graph components |
3 files reviewed, 3 comments
getting updates from NVIDIA/physicsnemo
updating crash branch
Dakhare crash
|
/blossom-ci |
val path added and args name changed
update main
update main
update main
|
/blossom-ci |
mnabian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for addressing the comments!
* Move filesystems and version_check to core * Fix version check tests * Reorganize distributed, domain_parallel, and begin nn / utils cleanup. * Move modules and meta to core. Move registry to core. No tests fixed yet. * Add missing init files * Update build system and specify some deps. * Reorganize tests. * Update init files * Clean up neighbor tools. * Update testing * Fix compat tests * Move core model tests to tests/core/ * Add import lint config * Relocate layers * Move graphcast utils into model directory * Relocating util functionalities. * Further clean up and organize tests. * utils tests are passing now * Cleaning up distributed tests * Patching tests working again in nn * Fix sdf test * Fix zenith angle tests * Some organization of tests. Checkpoints is moved into utils. * Remove launch.utils and launch.config. Checkpointing is moved to phsyicsnemo.utils, launch.config is just gone. It was empty. * Most nn tests are passing * Further cleanup. Getting there! * Remove constants file * Add import linting to pre-commit. * Update crash readme (#1212) * update license headers- second try * update readme * Bump multi-storage-client to v0.33.0 with rust client (#1156) * Move gnn layers and start to fix several model tests. * AFNO is now passing. * Rnn models passing. * Fix improt * Healpix tests are working * Domino and unet working * Add jaxtyping to requirements.txt for crash sample (#1218) * update license headers- second try * Update requirements.txt * Updating to address some test issues * Replace 'License' link with 'Dev blog' link (#1215) Co-authored-by: Corey adams <[email protected]> * MGN tests passing again * Most graphcast tests passing again * Move nd conv layers. * update fengwu and pangu * Update sfno and pix2pix test * update tests for figconvnet, swinrnn, superresnet * updating more models to pass * Update distributed tests, now passing. * Validation fu added to examples/structural_mechanics/crash/train.py (#1204) * validation added: works for multi-node job. * rename and rearrange validation function * validate_every_n_epochs, save_ckpt_every_n_epochs added in config * corrected bug (args of model) in inference * args in validation code updated * val path added and args name changed * validation split added -> write_vtp=False * fixed inference bug * bug fix: write_vtp * Domain parallel tests now passing. * Fix active learning imports so tests pass in refactor * Fix some metric imports * Remove deploy package * Remove unused test file * unmigrate these files ... again? * Update import linter. * Add saikrishnanc-nv to github actors (#1225) * Integrate Curator instructions to the Crash example (#1213) * Integrate Curator instructions * Update docs * Formatting changes * Adding code of conduct (#1214) * Adding code of conduct Adopting the code of conduct from the https://www.contributor-covenant.org/ * Update CODE_OF_CONDUCT.MD Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Create .markdownlintignore * Revise README for PhysicsNeMo resources and guidance Updated the 'Getting Started' section and added new resources for learning AI Physics. * Update README.md --------- Co-authored-by: Mohammad Amin Nabian <[email protected]> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Corey adams <[email protected]> * Cleaning up diffusion models. Not quite done yet. * Restore deleted files * Updating more tests. * Further updates to tests. Datapipes almost working. --------- Co-authored-by: Mohammad Amin Nabian <[email protected]> Co-authored-by: Yongming Ding <[email protected]> Co-authored-by: ram-cherukuri <[email protected]> Co-authored-by: Deepak Akhare <[email protected]> Co-authored-by: Sai Krishnan Chandrasekar <[email protected]> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* Move filesystems and version_check to core * Fix version check tests * Reorganize distributed, domain_parallel, and begin nn / utils cleanup. * Move modules and meta to core. Move registry to core. No tests fixed yet. * Add missing init files * Update build system and specify some deps. * Reorganize tests. * Update init files * Clean up neighbor tools. * Update testing * Fix compat tests * Move core model tests to tests/core/ * Add import lint config * Relocate layers * Move graphcast utils into model directory * Relocating util functionalities. * Further clean up and organize tests. * utils tests are passing now * Cleaning up distributed tests * Patching tests working again in nn * Fix sdf test * Fix zenith angle tests * Some organization of tests. Checkpoints is moved into utils. * Remove launch.utils and launch.config. Checkpointing is moved to phsyicsnemo.utils, launch.config is just gone. It was empty. * Most nn tests are passing * Further cleanup. Getting there! * Remove constants file * Add import linting to pre-commit. * Update crash readme (#1212) * update license headers- second try * update readme * Bump multi-storage-client to v0.33.0 with rust client (#1156) * Move gnn layers and start to fix several model tests. * AFNO is now passing. * Rnn models passing. * Fix improt * Healpix tests are working * Domino and unet working * Add jaxtyping to requirements.txt for crash sample (#1218) * update license headers- second try * Update requirements.txt * Updating to address some test issues * Replace 'License' link with 'Dev blog' link (#1215) Co-authored-by: Corey adams <[email protected]> * MGN tests passing again * Most graphcast tests passing again * Move nd conv layers. * update fengwu and pangu * Update sfno and pix2pix test * update tests for figconvnet, swinrnn, superresnet * updating more models to pass * Update distributed tests, now passing. * Validation fu added to examples/structural_mechanics/crash/train.py (#1204) * validation added: works for multi-node job. * rename and rearrange validation function * validate_every_n_epochs, save_ckpt_every_n_epochs added in config * corrected bug (args of model) in inference * args in validation code updated * val path added and args name changed * validation split added -> write_vtp=False * fixed inference bug * bug fix: write_vtp * Domain parallel tests now passing. * Fix active learning imports so tests pass in refactor * Fix some metric imports * Remove deploy package * Remove unused test file * unmigrate these files ... again? * Update import linter. * Add saikrishnanc-nv to github actors (#1225) * Integrate Curator instructions to the Crash example (#1213) * Integrate Curator instructions * Update docs * Formatting changes * Adding code of conduct (#1214) * Adding code of conduct Adopting the code of conduct from the https://www.contributor-covenant.org/ * Update CODE_OF_CONDUCT.MD Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Create .markdownlintignore * Revise README for PhysicsNeMo resources and guidance Updated the 'Getting Started' section and added new resources for learning AI Physics. * Update README.md --------- Co-authored-by: Mohammad Amin Nabian <[email protected]> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Corey adams <[email protected]> * Cleaning up diffusion models. Not quite done yet. * Restore deleted files * Updating more tests. * Fixed minor bug in shape validation in SongUNet (#1230) Signed-off-by: Charlelie Laurent <[email protected]> * Add Zarr reader for Crash (#1228) * Add Zarr reader for Crash * Update README * Update validation logic of point data in Zarr reader Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update examples/structural_mechanics/crash/zarr_reader.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Add a test for 2D feature arrays * Update examples/structural_mechanics/crash/zarr_reader.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Further updates to tests. Datapipes almost working. * update import paths * Starting to clean up dependency tree. --------- Signed-off-by: Charlelie Laurent <[email protected]> Co-authored-by: Mohammad Amin Nabian <[email protected]> Co-authored-by: Yongming Ding <[email protected]> Co-authored-by: ram-cherukuri <[email protected]> Co-authored-by: Deepak Akhare <[email protected]> Co-authored-by: Sai Krishnan Chandrasekar <[email protected]> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Charlelie Laurent <[email protected]>
PhysicsNeMo Pull Request
Description
Checklist
Dependencies
Review Process
All PRs are reviewed by the PhysicsNeMo team before merging.
Depending on which files are changed, GitHub may automatically assign a maintainer for review.
We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.
AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.