Skip to content

Creating an error when a SLURM variable isn't found, usually because … #869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion physicsnemo/distributed/manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,14 @@ def initialize_slurm(port):
rank = int(os.environ.get("SLURM_PROCID"))
world_size = int(os.environ.get("SLURM_NPROCS"))
local_rank = int(os.environ.get("SLURM_LOCALID"))
addr = os.environ.get("SLURM_LAUNCH_NODE_IPADDR")
try:
addr = os.environ.get("SLURM_LAUNCH_NODE_IPADDR")
except TypeError:
Comment on lines +360 to +361
Copy link
Collaborator

@coreyjadams coreyjadams May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @Tristanjmeyers - I looked at this more closely and we need to adjust this. os.environ.get will return None for missing variables, and raise no error. I think you can either:

  • Remove the try/except type error, and change to if addr is None: raise EnvErr...; or
  • change from os.environ.get('SLURM_LAUNCH_NODE_IPADDR') to os.environ['SLURM_LAUNCH_NODE_IPADDR'] and TypeError to KeyError.

raise EnvironmentError(
'SLURM variable "SLURM_LAUNCH_NODE_IPADDR" was not detected in the environment. Maybe you need to run with "srun"?'
)



DistributedManager.setup(
rank=rank,
Expand Down