-
Notifications
You must be signed in to change notification settings - Fork 493
Description
Version
1.0.0
On which installation method(s) does this occur?
Pip
Describe the issue
When running DistributedManager within a SLURM script using sbatch, the process fails at the DistributedManager.setup(...) call (line 353). This is due to the absence of the SLURM_LAUNCH_NODE_IPADDR environmental variable when using sbatch.
This variable is available when running with srun, but not with sbatch. As a result, any script using DistributedManager in a SLURM job submitted via sbatch encounters a KeyError.
A workaround is to retrieve the SLURM_LAUNCH_NODE_IPADDR using srun and export it manually in a SLURM script prior to running any python code that invokes DistributedManager, e.g.:
export SLURM_LAUNCH_NODE_IPADDR=$(srun printenv | awk -F= '/^SLURM_LAUNCH_NODE_IPADDR/{print $2}')
It's probably worth adding in a line about checking the existence of the SLURM_LAUNCH_NODE_IPADDR variable before calling DistributedManager.setup(...) so it can fail gracefully. Or potentially write a subproccess command to retrieve it.