[slurm] Slurm support #5491

JiangJiaWei1103 · 2025-05-04T11:11:27Z

Tracking issue

Changes

This PR brings in support for a new cloud/infra: Slurm.

Slurm is not like most other clouds we added, where we typically get our own VM, where network, disk, etc is isolated. Thus, the implementation is more complex than adding a generic cloud.

Core abstractions

SlurmCommandRunner

...

SlurmCodeGen

...

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

sky/clouds/service_catalog/slurm_catalog.py

Michaelvll · 2025-05-04T17:22:02Z

sky/clouds/slurm.py

+                ssh_client = SSHClient()
+                ssh_client.set_missing_host_key_policy(AutoAddPolicy())
+                ssh_client.connect(ssh_config_dict['hostname'],
+                                   port=ssh_config_dict['port'],
+                                   username=ssh_config_dict['user'],


Should we use our own command_runner.py instead?

Great suggestion! I’ve rewritten a simplified version using SSHCommandRunner. Here’s the output from the latest changes:

The failed case is shown as follow:

Signed-off-by: JiangJiaWei1103 <[email protected]>

Michaelvll

Sorry for the delay @JiangJiaWei1103! Thanks for the updating the PR. I just left a few comments.

Michaelvll · 2025-09-08T21:37:18Z

sky/backends/cloud_vm_ray_backend.py

+        is_slurm = str(valid_resource.cloud).lower() == 'slurm'
+        if is_slurm:
+            cmd = task_copy.run
+            if isinstance(cmd, list):
+                cmd = ' '.join(cmd)
+            runner = handle.get_command_runners()[0]
+
+            provisioned_job_id = handle.cached_cluster_info.head_instance_id
+            rc, stdout, stderr = runner.run(f'srun --jobid={provisioned_job_id} {cmd}', require_outputs=True)
+
+            return job_id


To make it cleaner, do you think we can inherit the command runner to be SlurmCommandRunner.run, so that we don't have to construct the command everytime when running some command on the cluster?

~~How are we going to handle the multi-node jobs here? Should we manually call srun to help distribute the commands to multiple node?~~

Just saw that we disabled the multi-node support for now, that sounds good to me!

To make it cleaner, do you think we can inherit the command runner to be SlurmCommandRunner.run, so that we don't have to construct the command everytime when running some command on the cluster?

Done! And yes regarding multi-node, although I haven't tested it, our implementation would be that if you have N nodes, you only need to call srun once, and let Slurm distribute it to the N nodes. This logic is located in SlurmCodeGen.

sky/catalog/slurm_catalog.py

Michaelvll · 2025-09-08T22:26:00Z

sky/provision/slurm/instance.py

@@ -0,0 +1,452 @@
+"""Slurm instance provisioning."""


This file needs to be updated?

Michaelvll · 2025-09-08T22:27:49Z

sky/provision/provisioner.py

        # already healthy, i.e. the head node has expected number of nodes
        # connected to the ray cluster.
-        if cluster_info.num_instances > 1 and not ray_cluster_healthy:
+        if cluster_info.num_instances > 1 and not ray_cluster_healthy and not_slurm:


curious, if we are not starting ray cluster within the slurm job, how are we dealing with the cluster job submission : )

Michaelvll · 2025-09-08T22:28:20Z

sky/templates/slurm-ray.yml.j2

+  ssh:
+    hostname: {{ssh_hostname}}
+    port: {{ssh_port}}
+    user: {{ssh_user}}
+    private_key: {{slurm_private_key}}
+
+auth:
+  ssh_user: ubuntu
+  # TODO(jwj): Modify this tmp workaround.
+  # ssh_private_key: {{ssh_private_key}}
+  ssh_private_key: {{slurm_private_key}}


What's the difference between the ssh in the provider and the one in auth?

I think we should be able to consolidate it into auth, seems like the rest of the clouds are that way. Let me do that.

Signed-off-by: JiangJiaWei1103 <[email protected]>

kevinmingtarja · 2025-11-25T02:11:54Z

/smoke-test --no-resource-heavy 🟢 (failing on same 2 tests as master)

kevinmingtarja · 2025-11-25T03:13:22Z

/smoke-test --slurm

kevinmingtarja · 2025-11-25T06:15:03Z

/smoke-test --slurm --controller-cloud aws

kevinmingtarja · 2025-11-25T08:37:35Z

/smoke-test --slurm --jobs-consolidation (94/114 passing)

kevinmingtarja · 2025-11-25T23:51:23Z

/smoke-test --slurm --jobs-consolidation (98/110)

kevinmingtarja · 2025-11-26T07:01:04Z

/smoke-test --slurm --jobs-consolidation (101/110)

…r now

…eue on a multi-gpu node

kevinmingtarja · 2025-11-27T02:40:50Z

/smoke-test --slurm --jobs-consolidation

kevinmingtarja · 2025-11-27T03:05:03Z

/smoke-test --slurm --jobs-consolidation

kevinmingtarja · 2025-11-27T06:45:34Z

sky/backends/task_codegen.py

+                # Add prolog to signal allocation and wait for setup to finish.
+                # We also need to source ~/.bashrc again to make it as if the
+                # run section is run in a new shell, after setup is finished.
+                run_script = (
+                    f"touch {{alloc_signal_file}} && "
+                    f"while [ ! -f {{setup_done_signal_file}} ]; do sleep 0.05; done && "
+                    f"rm -f {{setup_done_signal_file}} && "
+                    "source ~/.bashrc && "
+                    + script
+                )
+                # Start exclusive srun in a thread to reserve allocation (similar to ray.get(pg.ready()))
+                gpu_arg = f'--gpus-per-node={num_gpus}' if {num_gpus} > 0 else ''
+                def run_thread_func():
+                    # This blocks until Slurm allocates resources (--exclusive)
+                    # --mem=0 to match RayCodeGen's behavior where we don't explicitly request memory.
+                    srun_cmd = f'srun --quiet --unbuffered --jobid={self._slurm_job_id} --nodes={num_nodes} --cpus-per-task={task_cpu_demand} --mem=0 --ntasks-per-node=1 {{gpu_arg}} --exclusive bash -c {{shlex.quote(run_script)}}'
+                    result = run_bash_command_with_log_and_return_pid(
+                        srun_cmd,
+                        log_path,
+                        env_vars=sky_env_vars_dict,
+                        stream_logs=True,
+                        with_ray=False,
+                        streaming_prefix=f'{{colorama.Fore.CYAN}}({task_name}, pid={{{{pid}}}}){{colorama.Style.RESET_ALL}} ',
+                    )
+                    return result
+
+                run_thread_result = {{'result': None}}
+                def run_thread_wrapper():
+                    run_thread_result['result'] = run_thread_func()
+
+                run_thread = threading.Thread(target=run_thread_wrapper)
+                run_thread.start()


This feels a bit hacky, but I couldn't think of a way to mimic what we do with Ray with creating a placement group and ray.get(pg.ready()) (which blocks until it is able to acquire the requested resources) before we start running the setup and run section of the task.

kevinmingtarja · 2025-11-27T19:09:58Z

/smoke-test --slurm --jobs-consolidation

JiangJiaWei1103 changed the title ~~Add slurm cluster~~ [slurm] Slurm support May 4, 2025

Michaelvll reviewed May 4, 2025

View reviewed changes

JiangJiaWei1103 added 12 commits August 8, 2025 07:20

feat: Add cloud skeleton for the Slurm cluster

55e64f5

Signed-off-by: JiangJiaWei1103 <[email protected]>

feat: Bypass checks to mock compute accessibility

d654c2f

Signed-off-by: JiangJiaWei1103 <[email protected]>

feat: Check SSH credentials for a single-node cluster

4683d6c

Signed-off-by: JiangJiaWei1103 <[email protected]>

refactor: Use SkyPilot's SSHCommandRunner for checking credentials

4798cd7

Signed-off-by: JiangJiaWei1103 <[email protected]>

feat: Force re-authentication

2bf3cc9

Signed-off-by: JiangJiaWei1103 <[email protected]>

docs: Fix example usage

986420c

Signed-off-by: JiangJiaWei1103 <[email protected]>

feat: Support default instance type with a Slurm virtual instance

0aeccd7

Signed-off-by: JiangJiaWei1103 <[email protected]>

feat: Generate basic deploy rsc vars

7df6641

Signed-off-by: JiangJiaWei1103 <[email protected]>

feat: Prepare for naive provisioning logic

b05c8be

Signed-off-by: JiangJiaWei1103 <[email protected]>

feat: Try naive provision by sbatch a long-running job

13c04f9

Signed-off-by: JiangJiaWei1103 <[email protected]>

feat: Prepare post provisioning runtime setup

96f1a7f

Signed-off-by: JiangJiaWei1103 <[email protected]>

feat: Make simple setup command run

718fba8

Signed-off-by: JiangJiaWei1103 <[email protected]>

JiangJiaWei1103 force-pushed the add-slurm-cluster branch from 1c3019f to 718fba8 Compare August 7, 2025 23:30

JiangJiaWei1103 added 7 commits August 8, 2025 07:37

feat: Add SlurmCommandRunner placeholder

80196ed

Signed-off-by: JiangJiaWei1103 <[email protected]>

feat: Send Slurm-supported user cmds over SSH connections

8a56f0b

Signed-off-by: JiangJiaWei1103 <[email protected]>

feat: Add slurm into supported clouds

9f39b96

Signed-off-by: JiangJiaWei1103 <[email protected]>

feat: Support filtering running Slurm virtual instances

8fb7524

Signed-off-by: JiangJiaWei1103 <[email protected]>

fix: Fix --mem-per-cpu option typo

5763eba

Signed-off-by: JiangJiaWei1103 <[email protected]>

Run a naive cmd in the provisioned Slurm virtual instance

31e7265

Signed-off-by: JiangJiaWei1103 <[email protected]>

fix: Avoid killing tasks and make the provisioning job pending

79441b9

Signed-off-by: JiangJiaWei1103 <[email protected]>

Michaelvll reviewed Sep 8, 2025

View reviewed changes

JiangJiaWei1103 added 7 commits September 19, 2025 08:06

Merge branch 'up-master' into add-slurm-cluster

cfc8b50

feat: Use SlurmCommandRunner with sinfo to check credentials

645dd3b

Signed-off-by: JiangJiaWei1103 <[email protected]>

refactor: Consider multi-node provisioning for future dev

095f8ea

Signed-off-by: JiangJiaWei1103 <[email protected]>

feat: Tear down the cluster

07007e6

Signed-off-by: JiangJiaWei1103 <[email protected]>

refactor: Disable stop instance function

d8f09c4

Signed-off-by: JiangJiaWei1103 <[email protected]>

Merge branch 'up-master' into add-slurm-cluster

396cd82

fix: Add missing parameters

91bf832

Signed-off-by: JiangJiaWei1103 <[email protected]>

skip some smoke tests for slurm

32dddb7

kevinmingtarja added 2 commits November 24, 2025 21:59

fix autodown in slurm, skip more tests

5c21edc

add more items to _CLOUD_UNSUPPORTED_FEATURES

7da75a0

kevinmingtarja added 3 commits November 24, 2025 22:54

skip serve tests that require opening ports

7d2e3fd

get available gpus dynamically

a858590

skip docker/image_id tests

f050898

kevinmingtarja added 3 commits November 25, 2025 13:18

skip and fix more tests

a6c2abe

small cleanup

e0e592f

fix terminate_instances for autodown

d970169

kevinmingtarja added 3 commits November 25, 2025 16:44

tidy up sbatch provision script

63ec716

more smoke test fixes

a301776

remove feature flag

7cd2637

kevinmingtarja added 4 commits November 26, 2025 01:36

fix slurm show-gpus test

907f92e

improve SlurmCodeGen such that each step gets exclusive allocation fo…

414d96d

…r now

fix check_instance_fits for multi-gpu node, add smoke test for job qu…

808dee1

…eue on a multi-gpu node

Merge branch 'master' into add-slurm-cluster

6cc793c

kevinmingtarja added 2 commits November 26, 2025 19:04

round up cpus

f9089ed

update snapshot

729a938

kevinmingtarja added 2 commits November 26, 2025 19:12

int

359d23b

fix None task name, source bashrc again in run script

fbf8f52

kevinmingtarja reviewed Nov 27, 2025

View reviewed changes

[slurm] Slurm support #5491

Are you sure you want to change the base?

[slurm] Slurm support #5491

Uh oh!

Conversation

JiangJiaWei1103 commented May 4, 2025 • edited by kevinmingtarja Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tracking issue

Changes

Core abstractions

SlurmCommandRunner

SlurmCodeGen

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Michaelvll left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Michaelvll Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinmingtarja commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinmingtarja commented Nov 25, 2025

Uh oh!

kevinmingtarja commented Nov 25, 2025

Uh oh!

kevinmingtarja commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinmingtarja commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinmingtarja commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinmingtarja commented Nov 27, 2025

Uh oh!

kevinmingtarja commented Nov 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinmingtarja commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JiangJiaWei1103 commented May 4, 2025 •

edited by kevinmingtarja

Loading

Michaelvll Sep 8, 2025 •

edited

Loading

kevinmingtarja commented Nov 25, 2025 •

edited

Loading

kevinmingtarja commented Nov 25, 2025 •

edited

Loading

kevinmingtarja commented Nov 25, 2025 •

edited

Loading

kevinmingtarja commented Nov 26, 2025 •

edited

Loading