Skip to content

Conversation

@JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented May 4, 2025

Tracking issue

#5088

Changes

This PR brings in support for a new cloud/infra: Slurm.

Slurm is not like most other clouds we added, where we typically get our own VM, where network, disk, etc is isolated. Thus, the implementation is more complex than adding a generic cloud.

Core abstractions

SlurmCommandRunner

...

SlurmCodeGen

...

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@JiangJiaWei1103 JiangJiaWei1103 changed the title Add slurm cluster [slurm] Slurm support May 4, 2025
Comment on lines 257 to 261
ssh_client = SSHClient()
ssh_client.set_missing_host_key_policy(AutoAddPolicy())
ssh_client.connect(ssh_config_dict['hostname'],
port=ssh_config_dict['port'],
username=ssh_config_dict['user'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use our own command_runner.py instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! I’ve rewritten a simplified version using SSHCommandRunner. Here’s the output from the latest changes:

Screenshot 2025-05-05 at 9 50 03 PM

The failed case is shown as follow:

Screenshot 2025-05-05 at 9 51 55 PM

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @JiangJiaWei1103! Thanks for the updating the PR. I just left a few comments.

Comment on lines 3788 to 3798
is_slurm = str(valid_resource.cloud).lower() == 'slurm'
if is_slurm:
cmd = task_copy.run
if isinstance(cmd, list):
cmd = ' '.join(cmd)
runner = handle.get_command_runners()[0]

provisioned_job_id = handle.cached_cluster_info.head_instance_id
rc, stdout, stderr = runner.run(f'srun --jobid={provisioned_job_id} {cmd}', require_outputs=True)

return job_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it cleaner, do you think we can inherit the command runner to be SlurmCommandRunner.run, so that we don't have to construct the command everytime when running some command on the cluster?

Copy link
Collaborator

@Michaelvll Michaelvll Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to handle the multi-node jobs here? Should we manually call srun to help distribute the commands to multiple node?

Just saw that we disabled the multi-node support for now, that sounds good to me!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it cleaner, do you think we can inherit the command runner to be SlurmCommandRunner.run, so that we don't have to construct the command everytime when running some command on the cluster?

Done! And yes regarding multi-node, although I haven't tested it, our implementation would be that if you have N nodes, you only need to call srun once, and let Slurm distribute it to the N nodes. This logic is located in SlurmCodeGen.

@@ -0,0 +1,452 @@
"""Slurm instance provisioning."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file needs to be updated?

# already healthy, i.e. the head node has expected number of nodes
# connected to the ray cluster.
if cluster_info.num_instances > 1 and not ray_cluster_healthy:
if cluster_info.num_instances > 1 and not ray_cluster_healthy and not_slurm:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, if we are not starting ray cluster within the slurm job, how are we dealing with the cluster job submission : )

Comment on lines 16 to 26
ssh:
hostname: {{ssh_hostname}}
port: {{ssh_port}}
user: {{ssh_user}}
private_key: {{slurm_private_key}}

auth:
ssh_user: ubuntu
# TODO(jwj): Modify this tmp workaround.
# ssh_private_key: {{ssh_private_key}}
ssh_private_key: {{slurm_private_key}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between the ssh in the provider and the one in auth?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to consolidate it into auth, seems like the rest of the clouds are that way. Let me do that.

@kevinmingtarja
Copy link
Collaborator

kevinmingtarja commented Nov 25, 2025

/smoke-test --no-resource-heavy 🟢 (failing on same 2 tests as master)

@kevinmingtarja
Copy link
Collaborator

/smoke-test --slurm

@kevinmingtarja
Copy link
Collaborator

/smoke-test --slurm --controller-cloud aws

@kevinmingtarja
Copy link
Collaborator

kevinmingtarja commented Nov 25, 2025

/smoke-test --slurm --jobs-consolidation (94/114 passing)

@kevinmingtarja
Copy link
Collaborator

kevinmingtarja commented Nov 25, 2025

/smoke-test --slurm --jobs-consolidation (98/110)

@kevinmingtarja
Copy link
Collaborator

kevinmingtarja commented Nov 26, 2025

/smoke-test --slurm --jobs-consolidation (101/110)

@kevinmingtarja
Copy link
Collaborator

/smoke-test --slurm --jobs-consolidation

@kevinmingtarja
Copy link
Collaborator

/smoke-test --slurm --jobs-consolidation

Comment on lines +400 to +431
# Add prolog to signal allocation and wait for setup to finish.
# We also need to source ~/.bashrc again to make it as if the
# run section is run in a new shell, after setup is finished.
run_script = (
f"touch {{alloc_signal_file}} && "
f"while [ ! -f {{setup_done_signal_file}} ]; do sleep 0.05; done && "
f"rm -f {{setup_done_signal_file}} && "
"source ~/.bashrc && "
+ script
)
# Start exclusive srun in a thread to reserve allocation (similar to ray.get(pg.ready()))
gpu_arg = f'--gpus-per-node={num_gpus}' if {num_gpus} > 0 else ''
def run_thread_func():
# This blocks until Slurm allocates resources (--exclusive)
# --mem=0 to match RayCodeGen's behavior where we don't explicitly request memory.
srun_cmd = f'srun --quiet --unbuffered --jobid={self._slurm_job_id} --nodes={num_nodes} --cpus-per-task={task_cpu_demand} --mem=0 --ntasks-per-node=1 {{gpu_arg}} --exclusive bash -c {{shlex.quote(run_script)}}'
result = run_bash_command_with_log_and_return_pid(
srun_cmd,
log_path,
env_vars=sky_env_vars_dict,
stream_logs=True,
with_ray=False,
streaming_prefix=f'{{colorama.Fore.CYAN}}({task_name}, pid={{{{pid}}}}){{colorama.Style.RESET_ALL}} ',
)
return result
run_thread_result = {{'result': None}}
def run_thread_wrapper():
run_thread_result['result'] = run_thread_func()
run_thread = threading.Thread(target=run_thread_wrapper)
run_thread.start()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit hacky, but I couldn't think of a way to mimic what we do with Ray with creating a placement group and ray.get(pg.ready()) (which blocks until it is able to acquire the requested resources) before we start running the setup and run section of the task.

@kevinmingtarja
Copy link
Collaborator

/smoke-test --slurm --jobs-consolidation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants