NVIDIA
diff --git a/‎CHANGELOG.md‎
Lines changed: 2 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎examples/cfd/external_aerodynamics/domino/README.md‎
Lines changed: 8 additions & 13 deletions b/‎examples/cfd/external_aerodynamics/domino/README.md‎
Lines changed: 8 additions & 13 deletions
diff --git a/‎examples/cfd/external_aerodynamics/domino/src/train.py‎
Lines changed: 30 additions & 13 deletions b/‎examples/cfd/external_aerodynamics/domino/src/train.py‎
Lines changed: 30 additions & 13 deletions
diff --git a/‎examples/cfd/external_aerodynamics/domino/src/utils.py‎
Lines changed: 64 additions & 36 deletions b/‎examples/cfd/external_aerodynamics/domino/src/utils.py‎
Lines changed: 64 additions & 36 deletions
diff --git a/‎examples/structural_mechanics/crash/README.md‎
Lines changed: 154 additions & 0 deletions b/‎examples/structural_mechanics/crash/README.md‎
Lines changed: 154 additions & 0 deletions
@@ -22,6 +22,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   models. Accessible in `examples/geophysics/diffusion_fwi`.
 - Domain Parallelism: Domain Parallelism is now available for kNN, radius_search,
   and torch.nn.functional.pad.
+- Unified recipe for crash modeling, supporting Transolver and MeshGraphNet,
+  and three transient schemes.
 - Added a check to `stochastic_sampler` that helps handle the `EDMPrecond` model,
   which has a specific `.forward()` signature
 
 
@@ -188,13 +188,14 @@ GPUs and perform operations in a numerically consistent way.  For more informati
 about the techniques of domain parallelism and `ShardTensor`, refer to PhysicsNeMo
 tutorials such as [`ShardTensor`](https://docs.nvidia.com/deeplearning/physicsnemo/physicsnemo-core/api/physicsnemo.distributed.shardtensor.html).
 
-In DoMINO specifically, domain parallelism has been abled in two ways, which
+In DoMINO specifically, domain parallelism has been enabled in two ways, which
 can be used concurrently or separately.  First, the input sampled volumetric
 and surface points can be sharded to accomodate higher resolution point sampling
 Second, the latent space of the model - typically a regularlized grid - can be
 sharded to reduce computational complexity of the latent processing.  When training
 with sharded models in DoMINO, the primary objective is to enable higher
-resolution inputs and larger latent spaces without sacrificing substantial compute time.
+resolution inputs and larger latent spaces without sacrificing
+substantial compute time.
 
 When configuring DoMINO for sharded training, adjust the following parameters
 from `src/conf/config.yaml`:
@@ -207,19 +208,13 @@ domain_parallelism:
 ```
 
 The `domain_size` represents the number of GPUs used for each batch - setting
-`domain_size: 1` is not advised since that is the standard training regime,
-but with extra overhead.  `shard_grid` and `shard_points` will enable domain
+`domain_size: 1` is the standard training regime, and domain_parallelism
+will be ignored.  `shard_grid` and `shard_points` will enable domain
 parallelism over the latent space and input/output points, respectively.
 
-As one last note regarding domain-parallel training: in the phase of the DoMINO
-where the output solutions are calculated, the model can used two different
-techniques (numerically identical) to calculate the output.  Due to the
-overhead of potential communication at each operation, it's recommended to
-use the `one-loop` mode with `model.solution_calculation_mode` when doing
-sharded training.  This technique launches vectorized kernels with less
-launch overhead at the cost of more memory use.  For non-sharded
-training, the `two-loop` setting is more optimal. The difference in `one-loop`
-or `two-loop` is purely computational, not algorithmic.
+Setting domain_size > 1 without specifying `shard_points=True` or `shard_grid=True`
+will result in a runtime error during configuration - if you do not want to use
+domain_parallelism, leave `domain_size=1`.
 
 ### Performance Optimizations
 
 
@@ -44,6 +44,9 @@
 
 import torchinfo
 import torch.distributed as dist
+from torch.distributed.fsdp import fully_shard
+from torch.distributed.tensor import distribute_module
+
 from torch.amp import GradScaler, autocast
 from torch.nn.parallel import DistributedDataParallel
 from torch.utils.data import DataLoader
@@ -333,6 +336,13 @@ def main(cfg: DictConfig) -> None:
     # how to set that up, if needed.
     domain_mesh, data_mesh, placements = coordinate_distributed_environment(cfg)
 
+    if data_mesh is not None:
+        data_replica_size = data_mesh.size()
+        data_rank = data_mesh.get_local_rank()
+    else:
+        data_replica_size = dist.world_size
+        data_rank = dist.rank
+
     ################################
     # Initialize NVML
     ################################
@@ -438,8 +448,8 @@ def main(cfg: DictConfig) -> None:
     )
     train_sampler = DistributedSampler(
         train_dataloader,
-        num_replicas=data_mesh.size(),
-        rank=data_mesh.get_local_rank(),
+        num_replicas=data_replica_size,
+        rank=data_rank,
         **cfg.train.sampler,
     )
 
@@ -458,8 +468,8 @@ def main(cfg: DictConfig) -> None:
     )
     val_sampler = DistributedSampler(
         val_dataloader,
-        num_replicas=data_mesh.size(),
-        rank=data_mesh.get_local_rank(),
+        num_replicas=data_replica_size,
+        rank=data_rank,
         **cfg.val.sampler,
     )
 
@@ -478,15 +488,22 @@ def main(cfg: DictConfig) -> None:
     logger.info(f"Model summary:\n{torchinfo.summary(model, verbose=0, depth=2)}\n")
 
     if dist.world_size > 1:
-        model = DistributedDataParallel(
-            model,
-            device_ids=[dist.local_rank],
-            output_device=dist.device,
-            broadcast_buffers=dist.broadcast_buffers,
-            find_unused_parameters=dist.find_unused_parameters,
-            gradient_as_bucket_view=True,
-            static_graph=True,
-        )
+        if domain_mesh is None:
+            model = DistributedDataParallel(
+                model,
+                device_ids=[dist.local_rank],
+                output_device=dist.device,
+                broadcast_buffers=dist.broadcast_buffers,
+                find_unused_parameters=dist.find_unused_parameters,
+                gradient_as_bucket_view=True,
+                static_graph=True,
+            )
+        else:
+            model = distribute_module(
+                model,
+                device_mesh=domain_mesh,
+            )
+            model = fully_shard(model, mesh=data_mesh)
 
     ######################################################
     # Initialize optimzer and gradient scaler
 
@@ -170,44 +170,72 @@ def coordinate_distributed_environment(cfg: DictConfig):
     # Default to no domain parallelism:
     domain_size = cfg.get("domain_parallelism", {}).get("domain_size", 1)
 
-    # Initialize the device mesh:
-    mesh = dist.initialize_mesh(
-        mesh_shape=(-1, domain_size), mesh_dim_names=("ddp", "domain")
-    )
-    domain_mesh = mesh["domain"]
-    data_mesh = mesh["ddp"]
-
-    if domain_size > 1:
-        # Define the default placements for each tensor that might show up in
-        # the data.  Note that we'll define placements for all keys, even if
-        # they aren't actually used.
-
-        # Note that placements are defined for pre-batched data, no batch index!
-
-        grid_like_placement = [
-            Shard(0),
-        ]
-        point_like_placement = [
-            Shard(0),
-        ]
-        replicate_placement = [
-            Replicate(),
-        ]
-        placements = {
-            "stl_coordinates": point_like_placement,
-            "stl_centers": point_like_placement,
-            "stl_faces": point_like_placement,
-            "stl_areas": point_like_placement,
-            "surface_fields": point_like_placement,
-            "volume_mesh_centers": point_like_placement,
-            "volume_fields": point_like_placement,
-            "surface_mesh_centers": point_like_placement,
-            "surface_normals": point_like_placement,
-            "surface_areas": point_like_placement,
-        }
-    else:
+    if dist.world_size == 1:
         domain_mesh = None
+        data_mesh = None
         placements = None
+    else:
+        # Initialize the device mesh:
+        mesh = dist.initialize_mesh(
+            mesh_shape=(-1, domain_size), mesh_dim_names=("ddp", "domain")
+        )
+        domain_mesh = mesh["domain"]
+        data_mesh = mesh["ddp"]
+
+        if domain_size > 1:
+            # Define the default placements for each tensor that might show up in
+            # the data.  Note that we'll define placements for all keys, even if
+            # they aren't actually used.
+
+            # Note that placements are defined for pre-batched data, no batch index!
+
+            shard_grid = cfg.get("domain_parallelism", {}).get("shard_grid", False)
+            shard_points = cfg.get("domain_parallelism", {}).get("shard_points", False)
+
+            if not shard_grid and not shard_points:
+                raise ValueError(
+                    "Either shard_grid or shard_points must be True if domain_size > 1"
+                )
+
+            # Not supported with physics loss:
+            if cfg.train.add_physics_loss:
+                raise ValueError(
+                    "Domain parallelism is not supported with physics loss"
+                )
+
+            if shard_grid:
+                grid_like_placement = [
+                    Shard(0),
+                ]
+            else:
+                grid_like_placement = [
+                    Replicate(),
+                ]
+
+            if shard_points:
+                point_like_placement = [
+                    Shard(0),
+                ]
+            else:
+                point_like_placement = [
+                    Replicate(),
+                ]
+
+            placements = {
+                "stl_coordinates": point_like_placement,
+                "stl_centers": point_like_placement,
+                "stl_faces": point_like_placement,
+                "stl_areas": point_like_placement,
+                "surface_fields": point_like_placement,
+                "volume_mesh_centers": point_like_placement,
+                "volume_fields": point_like_placement,
+                "surface_mesh_centers": point_like_placement,
+                "surface_normals": point_like_placement,
+                "surface_areas": point_like_placement,
+            }
+        else:
+            domain_mesh = None
+            placements = None
 
     return domain_mesh, data_mesh, placements
 
 
@@ -0,0 +1,154 @@
+<!-- markdownlint-disable -->
+# Machine Learning Surrogates for Automotive Crash Dynamics
+
+## Problem Overview
+
+Automotive crashworthiness assessment is a critical step in vehicle design.  
+Traditionally, engineers rely on high-fidelity finite element (FE)
+simulations (e.g., LS-DYNA) to predict structural deformation and crash responses.  
+While accurate, these simulations are computationally expensive and
+limit the speed of design iterations.
+
+Machine Learning (ML) surrogates provide a promising alternative by learning
+mappings directly from simulation data, enabling:
+
+- **Rapid prediction** of deformation histories across thousands of design candidates.
+- **Scalability** to large structural models without rerunning costly FE simulations.
+- **Flexibility** in experimenting with different model architectures (GNNs, Transformers).
+
+In this example, we demonstrate a unified pipeline for crash dynamics modeling.
+The implementation supports both:
+
+- **Mesh-based Graph Neural Networks (MeshGraphNet)** – leverage connectivity from FE meshes.
+- **Point-cloud Transformers (Transolver)** – avoid explicit mesh dependency.
+
+## Prerequisites
+
+This example requires:
+- Access to LS-DYNA crash datasets (with `d3plot` and `.k` keyword files).
+- A GPU-enabled environment with PyTorch.
+
+Install dependencies:
+
+```bash
+pip install -r requirements.txt
+```
+
+This will install:
+
+- lasso-python (for LS-DYNA file parsing),
+- torch_geometric and torch_scatter (for GNN operations),
+
+## Dataset Preprocessing
+
+Crash simulation data is parsed from LS-DYNA d3plot files using the d3plot_reader.py utility.
+
+Key steps:
+
+- Load node coordinates, displacements, element connectivity, and part IDs.
+- Parse .k keyword files to assign part thickness values.
+- Filter out rigid wall nodes using displacement thresholds.
+- Build edges (for graphs) and store per-node features (e.g., thickness).
+- Optionally export time-stepped meshes as .vtp for visualization.
+
+Run preprocessing automatically via the dataset class (CrashGraphDataset or CrashPointCloudDataset) when launching training or inference.
+
+## Training
+
+Training is managed via Hydra configurations located in conf/.
+The main script is train.py.
+
+Config Structure
+
+```bash
+conf/
+├── config.yaml              # master config (sets datapipe, model, training)
+├── datapipe/                # dataset configs
+│   ├── graph.yaml
+│   └── point_cloud.yaml
+├── model/                   # model configs
+│   ├── mgn_autoregressive_rollout_training.yaml
+│   ├── mgn_one_step_rollout.yaml
+│   ├── mgn_time_conditional.yaml
+│   ├── transolver_autoregressive_rollout_training.yaml
+│   ├── transolver_one_step_rollout.yaml
+│   └── transolver_time_conditional.yaml
+├── training/default.yaml    # training hyperparameters
+└── inference/default.yaml   # inference options
+```
+
+Launch Training
+Single GPU:
+
+```bash
+python train.py
+```
+
+Multi-GPU (Distributed Data Parallel):
+
+```bash
+torchrun --standalone --nproc_per_node=<NUM_GPUS> train.py
+```
+
+## Inference
+
+Use inference.py to evaluate trained models on test crash runs.
+
+```bash
+python inference.py
+```
+
+Predicted meshes are written as .vtp files under
+./predicted_vtps/, and can be opened using ParaView.
+
+## Postprocessing and Evaluation
+
+The postprocessing/ folder provides scripts for quantitative and qualitative evaluation:
+
+- Relative $L^2$ Error (compute_l2_error.py): Computes
+per-timestep relative position error across runs.
+Produces plots and optional CSVs.
+
+Example:
+
+```bash
+python postprocessing/compute_l2_error.py \
+    --predicted_parent ./predicted_vtps \
+    --exact_parent ./exact_vtps \
+    --output_plot rel_error.png \
+    --output_csv rel_error.csv
+```
+
+- Probe Kinematics (Driver vs Passenger Toe Pan)(compute_probe_kinematics.py):
+Extracts displacement/velocity/acceleration histories at selected probe nodes.
+Generates comparison plots (GT vs predicted).
+
+Example:
+
+```bash
+python postprocessing/compute_probe_kinematics.py \
+    --pred_dir ./predicted_vtps/run_001 \
+    --exact_dir ./exact_vtps/run_001 \
+    --driver_points "70658-70659,70664" \
+    --passenger_points "70676-70679" \
+    --dt 0.005 \
+    --output_plot probe_kinematics.png
+```
+
+- Cross-Sectional Plots (plot_cross_section.py): Plots 2D slices
+of predicted vs ground truth deformations at specified cross-sections.
+
+Example:
+
+```bash
+python postprocessing/plot_cross_section.py \
+    --pred_dir ./predicted_vtps/run_001 \
+    --exact_dir ./exact_vtps/run_001 \
+    --output_file cross_section.png
+```
+
+run_post_processing.sh can automate all evaluation tasks across runs.
+
+## References
+
+- Automotive Crash Dynamics Modeling Accelerated with Machine Learning (https://arxiv.org/pdf/2510.15201)