Skip to content

🐛[BUG]: physicsnemo.distributed.DistributedManager expects srun rather than sbatch #819

@Tristanjmeyers

Description

@Tristanjmeyers

Version

1.0.0

On which installation method(s) does this occur?

Pip

Describe the issue

When running DistributedManager within a SLURM script using sbatch, the process fails at the DistributedManager.setup(...) call (line 353). This is due to the absence of the SLURM_LAUNCH_NODE_IPADDR environmental variable when using sbatch.

This variable is available when running with srun, but not with sbatch. As a result, any script using DistributedManager in a SLURM job submitted via sbatch encounters a KeyError.

A workaround is to retrieve the SLURM_LAUNCH_NODE_IPADDR using srun and export it manually in a SLURM script prior to running any python code that invokes DistributedManager, e.g.:

export SLURM_LAUNCH_NODE_IPADDR=$(srun printenv | awk -F= '/^SLURM_LAUNCH_NODE_IPADDR/{print $2}')

It's probably worth adding in a line about checking the existence of the SLURM_LAUNCH_NODE_IPADDR variable before calling DistributedManager.setup(...) so it can fail gracefully. Or potentially write a subproccess command to retrieve it.

Minimum reproducible example

Relevant log output

Environment details

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingexternalIssues/PR filed by people outside the team

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions