-
Notifications
You must be signed in to change notification settings - Fork 393
Description
Version
1.0.0
On which installation method(s) does this occur?
Pip
Describe the issue
When running DistributedManager
within a SLURM
script using sbatch
, the process fails at the DistributedManager.setup(...)
call (line 353). This is due to the absence of the SLURM_LAUNCH_NODE_IPADDR
environmental variable when using sbatch.
This variable is available when running with srun
, but not with sbatch
. As a result, any script using DistributedManager
in a SLURM
job submitted via sbatch
encounters a KeyError
.
A workaround is to retrieve the SLURM_LAUNCH_NODE_IPADDR
using srun and export it manually in a SLURM
script prior to running any python code that invokes DistributedManager
, e.g.:
export SLURM_LAUNCH_NODE_IPADDR=$(srun printenv | awk -F= '/^SLURM_LAUNCH_NODE_IPADDR/{print $2}')
It's probably worth adding in a line about checking the existence of the SLURM_LAUNCH_NODE_IPADDR
variable before calling DistributedManager.setup(...)
so it can fail gracefully. Or potentially write a subproccess command to retrieve it.