Skip to content

Fix errors at using multiple devices#84

Open
ntohge wants to merge 1 commit intogeorghess:mainfrom
ntohge:ntohge/fix-multi-devices
Open

Fix errors at using multiple devices#84
ntohge wants to merge 1 commit intogeorghess:mainfrom
ntohge:ntohge/fix-multi-devices

Conversation

@ntohge
Copy link
Copy Markdown

@ntohge ntohge commented Nov 13, 2025

This change fixes errors when --machine.num-devices=N (N is some number more than 1) is specified for train.py to utilize multiple devices (GPUs).

below is an example of error without this change.

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/workspace/nerfstudio/scripts/train.py", line 161, in _distributed_worker
    output = main_func(local_rank, world_size, config, global_rank)
  File "/workspace/nerfstudio/scripts/train.py", line 107, in train_loop
    trainer.train()
  File "/workspace/nerfstudio/engine/trainer.py", line 266, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "/workspace/nerfstudio/utils/profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "/workspace/nerfstudio/engine/trainer.py", line 502, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "/workspace/nerfstudio/utils/profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "/workspace/nerfstudio/pipelines/base_pipeline.py", line 324, in get_train_loss_dict
    model_outputs = self._model(ray_bundle)  # train distributed data parallel model if world_size > 1
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/nerfstudio/models/ad_model.py", line 58, in forward
    outputs = super().forward(ray_bundle)
  File "/workspace/nerfstudio/models/base_model.py", line 143, in forward
    return self.get_outputs(ray_bundle)
  File "/workspace/nerfstudio/models/splatad.py", line 1240, in get_outputs
    return self.get_lidar_outputs(sensor)
  File "/workspace/nerfstudio/models/splatad.py", line 1119, in get_lidar_outputs
    (torch.einsum("bij,bj->bi", optimized_lidar_to_world[..., :3, :3], lidar_linear_vel))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

@ntohge
Copy link
Copy Markdown
Author

ntohge commented Nov 13, 2025

Even though, this change fixes errors at using 2 GPUs and training succeeds with both GPUs fully used according to nvtop, training time does not improve but rather gets worse.

I tried some workarounds such as changing gradient_accumulation_steps to bigger values, however, no luck yet.

Does anyone have idea to improve or it's difficult to accelerate with multiple GPUs for this project?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant