-
Notifications
You must be signed in to change notification settings - Fork 857
[slurm] Slurm support #5491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[slurm] Slurm support #5491
Conversation
sky/clouds/slurm.py
Outdated
| ssh_client = SSHClient() | ||
| ssh_client.set_missing_host_key_policy(AutoAddPolicy()) | ||
| ssh_client.connect(ssh_config_dict['hostname'], | ||
| port=ssh_config_dict['port'], | ||
| username=ssh_config_dict['user'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use our own command_runner.py instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
1c3019f to
718fba8
Compare
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Michaelvll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @JiangJiaWei1103! Thanks for the updating the PR. I just left a few comments.
sky/backends/cloud_vm_ray_backend.py
Outdated
| is_slurm = str(valid_resource.cloud).lower() == 'slurm' | ||
| if is_slurm: | ||
| cmd = task_copy.run | ||
| if isinstance(cmd, list): | ||
| cmd = ' '.join(cmd) | ||
| runner = handle.get_command_runners()[0] | ||
|
|
||
| provisioned_job_id = handle.cached_cluster_info.head_instance_id | ||
| rc, stdout, stderr = runner.run(f'srun --jobid={provisioned_job_id} {cmd}', require_outputs=True) | ||
|
|
||
| return job_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it cleaner, do you think we can inherit the command runner to be SlurmCommandRunner.run, so that we don't have to construct the command everytime when running some command on the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we going to handle the multi-node jobs here? Should we manually call srun to help distribute the commands to multiple node?
Just saw that we disabled the multi-node support for now, that sounds good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it cleaner, do you think we can inherit the command runner to be SlurmCommandRunner.run, so that we don't have to construct the command everytime when running some command on the cluster?
Done! And yes regarding multi-node, although I haven't tested it, our implementation would be that if you have N nodes, you only need to call srun once, and let Slurm distribute it to the N nodes. This logic is located in SlurmCodeGen.
| @@ -0,0 +1,452 @@ | |||
| """Slurm instance provisioning.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file needs to be updated?
sky/provision/provisioner.py
Outdated
| # already healthy, i.e. the head node has expected number of nodes | ||
| # connected to the ray cluster. | ||
| if cluster_info.num_instances > 1 and not ray_cluster_healthy: | ||
| if cluster_info.num_instances > 1 and not ray_cluster_healthy and not_slurm: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious, if we are not starting ray cluster within the slurm job, how are we dealing with the cluster job submission : )
| ssh: | ||
| hostname: {{ssh_hostname}} | ||
| port: {{ssh_port}} | ||
| user: {{ssh_user}} | ||
| private_key: {{slurm_private_key}} | ||
|
|
||
| auth: | ||
| ssh_user: ubuntu | ||
| # TODO(jwj): Modify this tmp workaround. | ||
| # ssh_private_key: {{ssh_private_key}} | ||
| ssh_private_key: {{slurm_private_key}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between the ssh in the provider and the one in auth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to consolidate it into auth, seems like the rest of the clouds are that way. Let me do that.
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
|
/smoke-test --no-resource-heavy 🟢 (failing on same 2 tests as master) |
|
/smoke-test --slurm |
|
/smoke-test --slurm --controller-cloud aws |
|
/smoke-test --slurm --jobs-consolidation (94/114 passing) |
|
/smoke-test --slurm --jobs-consolidation (98/110) |
|
/smoke-test --slurm --jobs-consolidation (101/110) |
|
/smoke-test --slurm --jobs-consolidation |
|
/smoke-test --slurm --jobs-consolidation |
| # Add prolog to signal allocation and wait for setup to finish. | ||
| # We also need to source ~/.bashrc again to make it as if the | ||
| # run section is run in a new shell, after setup is finished. | ||
| run_script = ( | ||
| f"touch {{alloc_signal_file}} && " | ||
| f"while [ ! -f {{setup_done_signal_file}} ]; do sleep 0.05; done && " | ||
| f"rm -f {{setup_done_signal_file}} && " | ||
| "source ~/.bashrc && " | ||
| + script | ||
| ) | ||
| # Start exclusive srun in a thread to reserve allocation (similar to ray.get(pg.ready())) | ||
| gpu_arg = f'--gpus-per-node={num_gpus}' if {num_gpus} > 0 else '' | ||
| def run_thread_func(): | ||
| # This blocks until Slurm allocates resources (--exclusive) | ||
| # --mem=0 to match RayCodeGen's behavior where we don't explicitly request memory. | ||
| srun_cmd = f'srun --quiet --unbuffered --jobid={self._slurm_job_id} --nodes={num_nodes} --cpus-per-task={task_cpu_demand} --mem=0 --ntasks-per-node=1 {{gpu_arg}} --exclusive bash -c {{shlex.quote(run_script)}}' | ||
| result = run_bash_command_with_log_and_return_pid( | ||
| srun_cmd, | ||
| log_path, | ||
| env_vars=sky_env_vars_dict, | ||
| stream_logs=True, | ||
| with_ray=False, | ||
| streaming_prefix=f'{{colorama.Fore.CYAN}}({task_name}, pid={{{{pid}}}}){{colorama.Style.RESET_ALL}} ', | ||
| ) | ||
| return result | ||
| run_thread_result = {{'result': None}} | ||
| def run_thread_wrapper(): | ||
| run_thread_result['result'] = run_thread_func() | ||
| run_thread = threading.Thread(target=run_thread_wrapper) | ||
| run_thread.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit hacky, but I couldn't think of a way to mimic what we do with Ray with creating a placement group and ray.get(pg.ready()) (which blocks until it is able to acquire the requested resources) before we start running the setup and run section of the task.
|
/smoke-test --slurm --jobs-consolidation |


Tracking issue
#5088
Changes
This PR brings in support for a new cloud/infra: Slurm.
Slurm is not like most other clouds we added, where we typically get our own VM, where network, disk, etc is isolated. Thus, the implementation is more complex than adding a generic cloud.
Core abstractions
SlurmCommandRunner
...
SlurmCodeGen
...
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)