Skip to content

🐛[BUG]: Error with Domain Parallelism (DoMINO) #1223

@kk98kk

Description

@kk98kk

Version

DoMINO 25.08 (physicsnemo v1.3.0a0)

On which installation method(s) does this occur?

No response

Describe the issue

I encounter the following error when attempting to use Domain Parallelism with the DoMINO architecture.
My setup uses python 3.11.14 and torch 2.9.0 (CUDA 12.3) , running in a conda environment .

I would really appreciate any suggestions or insights on how to resolve this issue.

[2025-11-12 12:23:29,074][ComputeStatistics][INFO] - Scaling factors loaded from: /lustre/calc2/domino_prj/DrivAerML/outputs/volume/volume_test/scaling_factors/scaling_factors.pkl
[2025-11-12 12:23:55,238][Train][INFO] - Config summary:
data:
  bounding_box:
    max:
    - 8.5
    - 2.25
    - 3.0
    min:
    - -3.5
    - -2.25
    - -0.32
  bounding_box_surface:
    max:
    - 5.0
    - 1.4
    - 1.4
    min:
    - -1.5
    - -1.4
    - -0.32
  gpu_output: true
  gpu_preprocessing: true
  input_dir: /lustre/calc2/domino_prj/DrivAerML/data/train
  input_dir_val: /lustre/calc2/domino_prj/DrivAerML/data/val
  max_samples_for_statistics: 200
  normalize_coordinates: true
  sample_in_bbox: true
  sampling: true
  scaling_factors: ${project_dir}/scaling_factors/scaling_factors.pkl
  volume_sample_from_disk: false
data_processor:
  cached_dir: /user/cached/drivaer_aws/drivaer_data_fuldl/
  input_dir: /ata/drivaer_aws/drivaer_data_full/
  kind: drivaer_aws
  num_processors: 12
  output_dir: /user/aws_data_all/
  use_cache: false
domain_parallelism:
  domain_size: 2
  shard_grid: true
  shard_points: true
eval:
  checkpoint_name: DoMINO.0.480.pt
  num_points: 1240000
  refine_stl: false
  save_path: ${project_dir}/preds
  scaling_param_path: ${project_dir}/scaling_factors
  test_path: /lustre/calc2/domino_prj/DrivAerML/data/test
exp_tag: files
model:
  activation: gelu
  aggregation_model:
    activation: ${model.activation}
    base_layer: 512
  combine_volume_surface: false
  encode_parameters: false
  geom_points_sample: 300000
  geometry_encoding_type: both
  geometry_local:
    base_layer: 512
    surface_neighbors_in_radius:
    - 32
    - 128
    surface_radii:
    - 0.05
    - 0.25
    volume_neighbors_in_radius:
    - 64
    - 128
    volume_radii:
    - 0.1
    - 0.25
  geometry_rep:
    geo_conv:
      activation: ${model.activation}
      base_neurons: 32
      base_neurons_in: 1
      base_neurons_out: 1
      fourier_features: false
      num_modes: 5
      surface_hops: 1
      surface_neighbors_in_radius:
      - 8
      - 16
      - 128
      surface_radii:
      - 0.01
      - 0.05
      - 1.0
      volume_hops: 1
      volume_neighbors_in_radius:
      - 32
      - 64
      - 128
      - 256
      volume_radii:
      - 0.1
      - 0.5
      - 1.0
      - 2.5
    geo_processor:
      activation: ${model.activation}
      base_filters: 8
      cross_attention: false
      processor_type: conv
      self_attention: false
      surface_sdf_scaling_factor:
      - 0.01
      - 0.02
      - 0.04
      volume_sdf_scaling_factor:
      - 0.04
  integral_loss_scaling_factor: 100
  interp_res:
  - 128
  - 64
  - 64
  local_point_conv:
    activation: ${model.activation}
  loss_function:
    area_weighing_factor: 10000
    loss_type: mse
  model_type: volume
  nn_basis_functions:
    activation: ${model.activation}
    base_layer: 512
    fourier_features: true
    num_modes: 5
  normalization: min_max_scaling
  num_neighbors_surface: 7
  num_neighbors_volume: 10
  parameter_model:
    activation: ${model.activation}
    base_layer: 512
    fourier_features: false
    num_modes: 5
  position_encoder:
    activation: ${model.activation}
    base_neurons: 512
    fourier_features: true
    num_modes: 5
  return_volume_neighbors: false
  solution_calculation_mode: two-loop
  surf_loss_scaling: 5.0
  surface_points_sample: 8192
  surface_sampling_algorithm: area_weighted
  use_sdf_in_basis_func: true
  use_surface_area: true
  use_surface_normals: true
  vol_loss_scaling: 1.0
  volume_points_sample: 8192
output: /lustre/calc2/domino_prj/DrivAerML/outputs/volume/${project.name}/${exp_tag}
project:
  name: volume_test
project_dir: /lustre/calc2/domino_prj/DrivAerML/outputs/volume/${project.name}/
resume_dir: ${output}/models
train:
  add_physics_loss: false
  amp:
    autocast:
      dtype: torch.float16
    clip_grad: true
    enabled: true
    grad_max_norm: 2.0
    scaler:
      _target_: torch.cuda.amp.GradScaler
      enabled: ${..enabled}
  checkpoint_dir: /user/models/
  checkpoint_interval: 1
  dataloader:
    batch_size: 1
    pin_memory: true
    preload_depth: 1
  epochs: 500
  lr_scheduler:
    T_max: ${train.epochs}
    eta_min: 1.0e-06
    gamma: 0.5
    milestones:
    - 50
    - 200
    - 400
    - 500
    - 600
    - 700
    - 800
    - 900
    name: MultiStepLR
  optimizer:
    lr: 0.001
    name: Adam
    weight_decay: 0.0
  sampler:
    drop_last: false
    shuffle: true
val:
  dataloader:
    batch_size: 1
    pin_memory: true
    preload_depth: 1
  sampler:
    drop_last: false
    shuffle: true
variables:
  global_parameters:
    air_density:
      reference: 1.0
      type: scalar
    inlet_velocity:
      reference:
      - 38.889
      type: vector
  surface:
    solution:
      pMeanTrim: scalar
      wallShearStressMeanTrim: vector
  volume:
    solution:
      UMeanTrim: vector
      nutMeanTrim: scalar
      pMeanTrim: scalar

[2025-11-12 12:23:55,365][Train][INFO] - Model summary:
======================================================================
Layer (type:depth-idx)                        Param #
======================================================================
DoMINO                                        --
├─GeometryRep: 1-1                            --
│    └─GELU: 2-1                              --
│    └─ModuleList: 2-2                        --
│    └─ModuleList: 2-3                        266,032
│    └─ModuleList: 2-4                        48,388
│    └─ModuleList: 2-5                        112
│    └─Sequential: 2-6                        71,783
│    └─Conv3d: 2-7                            28
├─GeometryRep: 1-2                            --
│    └─GELU: 2-8                              --
│    └─ModuleList: 2-9                        --
│    └─ModuleList: 2-10                       199,524
│    └─ModuleList: 2-11                       16,323
│    └─ModuleList: 2-12                       84
│    └─Sequential: 2-13                       74,649
│    └─Conv3d: 2-14                           28
├─ModuleList: 1-3                             --
│    └─FourierMLP: 2-15                       542,720
│    └─FourierMLP: 2-16                       542,720
│    └─FourierMLP: 2-17                       542,720
│    └─FourierMLP: 2-18                       542,720
│    └─FourierMLP: 2-19                       542,720
├─GELU: 1-4                                   --
├─FourierMLP: 1-5                             --
│    └─Mlp: 2-20                              570,880
├─MultiGeometryEncoding: 1-6                  --
│    └─ModuleList: 2-21                       410,784
├─MultiGeometryEncoding: 1-7                  --
│    └─ModuleList: 2-22                       591,040
├─ModuleList: 1-8                             --
│    └─AggregationModel: 2-23                 1,411,585
│    └─AggregationModel: 2-24                 1,411,585
│    └─AggregationModel: 2-25                 1,411,585
│    └─AggregationModel: 2-26                 1,411,585
│    └─AggregationModel: 2-27                 1,411,585
├─SolutionCalculatorVolume: 1-9               9,771,525
│    └─ModuleList: 2-28                       (recursive)
│    └─ModuleList: 2-29                       (recursive)
======================================================================
Total params: 21,792,705
Trainable params: 21,792,705
Non-trainable params: 0
======================================================================

[2025-11-12 12:23:55,728][checkpoint][ERROR] - Could not find valid model file /lustre/calc2/domino_prj/DrivAerML/outputs/volume/volume_test/files/models/FSDPDoMINO.0.0.pt, skipping load
[2025-11-12 12:23:55,728][checkpoint][ERROR] - Could not find valid model file /lustre/calc2/domino_prj/DrivAerML/outputs/volume/volume_test/files/models/FSDPDoMINO.0.0.pt, skipping load
[2025-11-12 12:23:55,729][checkpoint][WARNING] - Could not find valid checkpoint file, skipping load
[2025-11-12 12:23:55,729][checkpoint][WARNING] - Could not find valid checkpoint file, skipping load
[2025-11-12 12:23:55,729][Train][INFO] - Device cuda:0, epoch 0:
Error executing job with overrides: []
Error executing job with overrides: []
[rank0]: Traceback (most recent call last):
[rank0]:   File "/lustre/calc2/domino_prj/domino_copy/src/train.py", line 714, in <module>
[rank0]:     main()
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main
[rank0]:     _run_hydra(
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[rank0]:     _run_app(
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app
[rank0]:     run_and_report(
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
[rank0]:     raise ex
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
[rank0]:     return func()
[rank0]:            ^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
[rank0]:     lambda: hydra.run(
[rank0]:             ^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/_internal/hydra.py", line 132, in run
[rank0]:     _ = ret.return_value
[rank0]:         ^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/core/utils.py", line 260, in return_value
[rank0]:     raise self._return_value
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/core/utils.py", line 186, in run_job
[rank0]:     ret.return_value = task_function(task_cfg)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/domino_prj/domino_copy/src/train.py", line 618, in main
[rank0]:     avg_loss = train_epoch(
[rank0]:                ^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/domino_prj/domino_copy/src/train.py", line 223, in train_epoch
[rank0]:     prediction_vol, prediction_surf = model(sampled_batched)
[rank0]:                                       ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
[rank0]:     return inner()
[rank0]:            ^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1829, in inner
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/gitclone/physicsnemo_28_10/physicsnemo/models/domino/model.py", line 508, in forward
[rank0]:     encoding_g_vol = self.geo_rep_volume(geo_centers_vol, p_grid, sdf_grid)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/gitclone/physicsnemo_28_10/physicsnemo/models/domino/geometry_rep.py", line 456, in forward
[rank0]:     mapping, k_short = self.bq_warp[j](x, p_grid)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/gitclone/physicsnemo_28_10/physicsnemo/models/layers/ball_query.py", line 95, in forward
[rank0]:     p_grid = rearrange(p_grid, "b nx ny nz c -> b (nx ny nz) c")
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/einops/einops.py", line 600, in rearrange
[rank0]:     return reduce(tensor, pattern, reduction="rearrange", **axes_lengths)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/einops/einops.py", line 532, in reduce
[rank0]:     return _apply_recipe(
[rank0]:            ^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/einops/einops.py", line 251, in _apply_recipe
[rank0]:     tensor = backend.reshape(tensor, final_shapes)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/einops/_backends.py", line 93, in reshape
[rank0]:     return x.reshape(shape)
[rank0]:            ^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/gitclone/physicsnemo_28_10/physicsnemo/distributed/shard_tensor.py", line 403, in __torch_function__
[rank0]:     return super().__torch_function__(func, types, args, kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/gitclone/physicsnemo_28_10/physicsnemo/distributed/shard_tensor.py", line 433, in __torch_dispatch__
[rank0]:     dispatch_res = DTensor._op_dispatcher.dispatch(func, args, kwargs or {})
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/distributed/tensor/_dispatch.py", line 329, in dispatch
[rank0]:     return return_and_correct_aliasing(op_call, args, kwargs, ret)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/utils/_python_dispatch.py", line 729, in return_and_correct_aliasing
[rank0]:     _correct_storage_aliasing(
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/utils/_python_dispatch.py", line 579, in _correct_storage_aliasing
[rank0]:     alias_non_inplace_storage(args[arg_idx], outs[return_idx])
[rank0]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/utils/_python_dispatch.py", line 551, in alias_non_inplace_storage
[rank0]:     assert type(arg) == type(
[rank0]:            ^^^^^^^^^^^^^^^^^^
[rank0]: AssertionError: Called aten.view.default with input of type <class 'physicsnemo.distributed.shard_tensor.ShardTensor'>
[rank0]: and output of type <class 'torch.distributed.tensor.DTensor'>. But expected types to match.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/lustre/calc2/domino_prj/domino_copy/src/train.py", line 714, in <module>
[rank1]:     main()
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main
[rank1]:     _run_hydra(
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[rank1]:     _run_app(
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app
[rank1]:     run_and_report(
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
[rank1]:     raise ex
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
[rank1]:     return func()
[rank1]:            ^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
[rank1]:     lambda: hydra.run(
[rank1]:             ^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/_internal/hydra.py", line 132, in run
[rank1]:     _ = ret.return_value
[rank1]:         ^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/core/utils.py", line 260, in return_value
[rank1]:     raise self._return_value
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/hydra/core/utils.py", line 186, in run_job
[rank1]:     ret.return_value = task_function(task_cfg)
[rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/domino_prj/domino_copy/src/train.py", line 618, in main
[rank1]:     avg_loss = train_epoch(
[rank1]:                ^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/domino_prj/domino_copy/src/train.py", line 223, in train_epoch
[rank1]:     prediction_vol, prediction_surf = model(sampled_batched)
[rank1]:                                       ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
[rank1]:     return inner()
[rank1]:            ^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1829, in inner
[rank1]:     result = forward_call(*args, **kwargs)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/gitclone/physicsnemo_28_10/physicsnemo/models/domino/model.py", line 508, in forward
[rank1]:     encoding_g_vol = self.geo_rep_volume(geo_centers_vol, p_grid, sdf_grid)
[rank1]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/gitclone/physicsnemo_28_10/physicsnemo/models/domino/geometry_rep.py", line 456, in forward
[rank1]:     mapping, k_short = self.bq_warp[j](x, p_grid)
[rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/gitclone/physicsnemo_28_10/physicsnemo/models/layers/ball_query.py", line 95, in forward
[rank1]:     p_grid = rearrange(p_grid, "b nx ny nz c -> b (nx ny nz) c")
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/einops/einops.py", line 600, in rearrange
[rank1]:     return reduce(tensor, pattern, reduction="rearrange", **axes_lengths)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/einops/einops.py", line 532, in reduce
[rank1]:     return _apply_recipe(
[rank1]:            ^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/einops/einops.py", line 251, in _apply_recipe
[rank1]:     tensor = backend.reshape(tensor, final_shapes)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/einops/_backends.py", line 93, in reshape
[rank1]:     return x.reshape(shape)
[rank1]:            ^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/gitclone/physicsnemo_28_10/physicsnemo/distributed/shard_tensor.py", line 403, in __torch_function__
[rank1]:     return super().__torch_function__(func, types, args, kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/gitclone/physicsnemo_28_10/physicsnemo/distributed/shard_tensor.py", line 433, in __torch_dispatch__
[rank1]:     dispatch_res = DTensor._op_dispatcher.dispatch(func, args, kwargs or {})
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/distributed/tensor/_dispatch.py", line 329, in dispatch
[rank1]:     return return_and_correct_aliasing(op_call, args, kwargs, ret)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/utils/_python_dispatch.py", line 729, in return_and_correct_aliasing
[rank1]:     _correct_storage_aliasing(
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/utils/_python_dispatch.py", line 579, in _correct_storage_aliasing
[rank1]:     alias_non_inplace_storage(args[arg_idx], outs[return_idx])
[rank1]:   File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/utils/_python_dispatch.py", line 551, in alias_non_inplace_storage
[rank1]:     assert type(arg) == type(
[rank1]:            ^^^^^^^^^^^^^^^^^^
[rank1]: AssertionError: Called aten.view.default with input of type <class 'physicsnemo.distributed.shard_tensor.ShardTensor'>
[rank1]: and output of type <class 'torch.distributed.tensor.DTensor'>. But expected types to match.
W1112 12:24:07.112000 23548 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 23581 closing signal SIGTERM
E1112 12:24:07.476000 23548 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 23580) of binary: /lustre/calc2/miniforge3/envs/domino2/bin/python3.11
Traceback (most recent call last):
  File "/lustre/calc2/miniforge3/envs/domino2/bin/torchrun", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/distributed/run.py", line 936, in main
    run(args)
  File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run
    elastic_launch(
  File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/calc2/miniforge3/envs/domino2/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/lustre/calc2/domino_prj/domino_copy/src/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-11-12_12:24:07
  host      : ##############
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 23580)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Minimum reproducible example

Relevant log output

Environment details

Metadata

Metadata

Labels

? - Needs TriageNeed team to review and classifybugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions