Fix errors at using multiple devices by ntohge · Pull Request #84 · georghess/neurad-studio

ntohge · 2025-11-13T05:46:11Z

This change fixes errors when --machine.num-devices=N (N is some number more than 1) is specified for train.py to utilize multiple devices (GPUs).

below is an example of error without this change.

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/workspace/nerfstudio/scripts/train.py", line 161, in _distributed_worker
    output = main_func(local_rank, world_size, config, global_rank)
  File "/workspace/nerfstudio/scripts/train.py", line 107, in train_loop
    trainer.train()
  File "/workspace/nerfstudio/engine/trainer.py", line 266, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "/workspace/nerfstudio/utils/profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "/workspace/nerfstudio/engine/trainer.py", line 502, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "/workspace/nerfstudio/utils/profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "/workspace/nerfstudio/pipelines/base_pipeline.py", line 324, in get_train_loss_dict
    model_outputs = self._model(ray_bundle)  # train distributed data parallel model if world_size > 1
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/nerfstudio/models/ad_model.py", line 58, in forward
    outputs = super().forward(ray_bundle)
  File "/workspace/nerfstudio/models/base_model.py", line 143, in forward
    return self.get_outputs(ray_bundle)
  File "/workspace/nerfstudio/models/splatad.py", line 1240, in get_outputs
    return self.get_lidar_outputs(sensor)
  File "/workspace/nerfstudio/models/splatad.py", line 1119, in get_lidar_outputs
    (torch.einsum("bij,bj->bi", optimized_lidar_to_world[..., :3, :3], lidar_linear_vel))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

ntohge · 2025-11-13T06:04:32Z

Even though, this change fixes errors at using 2 GPUs and training succeeds with both GPUs fully used according to nvtop, training time does not improve but rather gets worse.

I tried some workarounds such as changing gradient_accumulation_steps to bigger values, however, no luck yet.

Does anyone have idea to improve or it's difficult to accelerate with multiple GPUs for this project?

Fix errors at using multiple devices

93cea19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix errors at using multiple devices#84

Fix errors at using multiple devices#84
ntohge wants to merge 1 commit intogeorghess:mainfrom
ntohge:ntohge/fix-multi-devices

ntohge commented Nov 13, 2025 •

edited

Loading

Uh oh!

ntohge commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ntohge commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntohge commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ntohge commented Nov 13, 2025 •

edited

Loading