Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
115 commits
Select commit Hold shift + click to select a range
55e64f5
feat: Add cloud skeleton for the Slurm cluster
JiangJiaWei1103 May 3, 2025
d654c2f
feat: Bypass checks to mock compute accessibility
JiangJiaWei1103 May 3, 2025
4683d6c
feat: Check SSH credentials for a single-node cluster
JiangJiaWei1103 May 4, 2025
4798cd7
refactor: Use SkyPilot's SSHCommandRunner for checking credentials
JiangJiaWei1103 May 5, 2025
2bf3cc9
feat: Force re-authentication
JiangJiaWei1103 May 5, 2025
986420c
docs: Fix example usage
JiangJiaWei1103 May 5, 2025
0aeccd7
feat: Support default instance type with a Slurm virtual instance
JiangJiaWei1103 May 5, 2025
7df6641
feat: Generate basic deploy rsc vars
JiangJiaWei1103 May 6, 2025
b05c8be
feat: Prepare for naive provisioning logic
JiangJiaWei1103 May 7, 2025
13c04f9
feat: Try naive provision by sbatch a long-running job
JiangJiaWei1103 May 8, 2025
96f1a7f
feat: Prepare post provisioning runtime setup
JiangJiaWei1103 May 10, 2025
718fba8
feat: Make simple setup command run
JiangJiaWei1103 May 14, 2025
80196ed
feat: Add SlurmCommandRunner placeholder
JiangJiaWei1103 Aug 7, 2025
8a56f0b
feat: Send Slurm-supported user cmds over SSH connections
JiangJiaWei1103 Aug 9, 2025
9f39b96
feat: Add slurm into supported clouds
JiangJiaWei1103 Aug 9, 2025
8fb7524
feat: Support filtering running Slurm virtual instances
JiangJiaWei1103 Aug 9, 2025
5763eba
fix: Fix --mem-per-cpu option typo
JiangJiaWei1103 Aug 10, 2025
31e7265
Run a naive cmd in the provisioned Slurm virtual instance
JiangJiaWei1103 Aug 10, 2025
79441b9
fix: Avoid killing tasks and make the provisioning job pending
JiangJiaWei1103 Aug 11, 2025
cfc8b50
Merge branch 'up-master' into add-slurm-cluster
JiangJiaWei1103 Sep 19, 2025
645dd3b
feat: Use SlurmCommandRunner with sinfo to check credentials
JiangJiaWei1103 Sep 19, 2025
095f8ea
refactor: Consider multi-node provisioning for future dev
JiangJiaWei1103 Sep 19, 2025
07007e6
feat: Tear down the cluster
JiangJiaWei1103 Sep 19, 2025
d8f09c4
refactor: Disable stop instance function
JiangJiaWei1103 Sep 19, 2025
396cd82
Merge branch 'up-master' into add-slurm-cluster
JiangJiaWei1103 Oct 1, 2025
91bf832
fix: Add missing parameters
JiangJiaWei1103 Oct 1, 2025
82a65ab
feat: Support GPU-accelerated workloads
JiangJiaWei1103 Oct 22, 2025
0905e12
feat: Parse Slurm virtual instance resources from config
JiangJiaWei1103 Oct 24, 2025
c8c67a9
feat: Support getting a SlurmCommandRunner for the head virtual instance
JiangJiaWei1103 Oct 24, 2025
02ffb36
feat: Support gpu type
JiangJiaWei1103 Oct 24, 2025
531e092
feat: Add boilerplate for dynamically discovering launchable resources
JiangJiaWei1103 Nov 7, 2025
9aea61d
feat: Dynamically fitting the Slurm instance by client-defined slurmc…
JiangJiaWei1103 Nov 7, 2025
0a90e24
refactor: Clean up Slurm task execution logic
JiangJiaWei1103 Nov 9, 2025
6d51fd6
refactor: Clean up redundant interfaces of Slurm catalog
JiangJiaWei1103 Nov 9, 2025
bdafe4f
refactor: Reuse slurm utilities
JiangJiaWei1103 Nov 9, 2025
a546a1a
refactor: Explicitly skip Ray cluster setup for the Slurm cluster
JiangJiaWei1103 Nov 9, 2025
95539ff
refactor: Recover setup commands
JiangJiaWei1103 Nov 9, 2025
bf6a2db
Add support for ssh proxy command for slurm head
Michaelvll Nov 12, 2025
b533b5b
Fix query method
Michaelvll Nov 14, 2025
367c1aa
Merge branch 'master' into add-slurm-cluster
kevinmingtarja Nov 18, 2025
65f76d4
make single launch work end to end
kevinmingtarja Nov 19, 2025
8539e42
Update SLURM instance type output in optimizer table to `-`
romilbhardwaj Nov 19, 2025
4b85534
fix yaml jinja template, add slurm to dependencies.py
kevinmingtarja Nov 19, 2025
388f5f7
share skypilot-runtime between different clusters
kevinmingtarja Nov 19, 2025
ce40acd
multislurm support
romilbhardwaj Nov 19, 2025
b8813fc
set stream_logs=False on slurm client
kevinmingtarja Nov 19, 2025
6a5657d
create custom ssh config for slurm clusters
kevinmingtarja Nov 20, 2025
228de99
rm some stray logging
kevinmingtarja Nov 20, 2025
1da72c9
create .hushlogin to suppress sudo warning message
kevinmingtarja Nov 20, 2025
f497bb8
Add show-gpus support (#8020)
Michaelvll Nov 20, 2025
c1c1a15
rename context -> cluster for slurm in infra page
kevinmingtarja Nov 20, 2025
39554cd
uppercase slurm gpu names
kevinmingtarja Nov 20, 2025
82dd60d
make sky show-gpus fetch in parallel
kevinmingtarja Nov 20, 2025
d73f65e
add ray-style log prefix for slurm task executor
kevinmingtarja Nov 20, 2025
0153f8c
[SLURM] Fix race condition when waiting for node allocation to jobs (…
romilbhardwaj Nov 20, 2025
55e41c3
also wait after job is submitted, add completing state to status_map
kevinmingtarja Nov 20, 2025
48690d6
dynamically pass in SKYPILOT_NODE_IPS and SKYPILOT_NODE_RANK
kevinmingtarja Nov 20, 2025
216c12c
move skypilot-runtime to /tmp for slurm
kevinmingtarja Nov 20, 2025
ffdd7ef
combine node info and partitions into one call
kevinmingtarja Nov 20, 2025
c1fe8cf
remove --output and --error from virtual instance sbatch
kevinmingtarja Nov 20, 2025
f93533f
update slurm codegen unit test snapshot
kevinmingtarja Nov 20, 2025
f2b8ad8
fix check_instance_fits, set --unbuffered in srun
kevinmingtarja Nov 21, 2025
c6da0de
cleanup slurm client
kevinmingtarja Nov 21, 2025
e5e3279
minor opt
kevinmingtarja Nov 21, 2025
f6b14ae
robustify launch path
kevinmingtarja Nov 21, 2025
6e3bb13
simplify launch
kevinmingtarja Nov 21, 2025
bb3f576
cleanup sky dirs when terminating instance
kevinmingtarja Nov 21, 2025
5deaf6d
clean up cluster state management
kevinmingtarja Nov 21, 2025
f44f418
clean up codegen abstraction
kevinmingtarja Nov 21, 2025
4e77229
more codegen abstraction cleanup
kevinmingtarja Nov 22, 2025
63d5512
small fix
kevinmingtarja Nov 22, 2025
d3973a8
Revert "make sky show-gpus fetch in parallel"
kevinmingtarja Nov 22, 2025
08595a2
various small cleanups
kevinmingtarja Nov 22, 2025
2cd56fd
bump API_VERSION to 24
kevinmingtarja Nov 22, 2025
fe3375f
rm dead code
kevinmingtarja Nov 22, 2025
1ce80d4
hide slurm under SKYPILOT_ENABLE_SLURM feature flag
kevinmingtarja Nov 22, 2025
e16f5ba
fmt
kevinmingtarja Nov 24, 2025
9434c56
move sqlite dbs from ~/.sky to /tmp too, make test_minimal smoke test…
kevinmingtarja Nov 24, 2025
4669c52
Merge branch 'master' into add-slurm-cluster
kevinmingtarja Nov 24, 2025
22e1f41
disable grpc for now if cloud is slurm
kevinmingtarja Nov 24, 2025
ad51993
add slurm to buildkite generate_pipeline.py
kevinmingtarja Nov 24, 2025
556950f
fmt, fix dir cleanup during termination
kevinmingtarja Nov 24, 2025
03a9232
revert
kevinmingtarja Nov 24, 2025
cea1bf9
add list_accelerators to slurm_catalog
kevinmingtarja Nov 24, 2025
99ae468
handle FileNotFoundError in get_all_slurm_cluster_names
kevinmingtarja Nov 24, 2025
b65722f
set SKYPILOT_ENABLE_SLURM=1 for smoke tests
kevinmingtarja Nov 24, 2025
6d507af
fix slurm env var config in smoke test
kevinmingtarja Nov 24, 2025
5009611
impl instance_type_exists for slurm
kevinmingtarja Nov 24, 2025
ed0b3df
fix slurm validate_region_zone
kevinmingtarja Nov 24, 2025
4899345
clear request_body_env_vars cache in override_sky_config smoke test u…
kevinmingtarja Nov 24, 2025
69ed050
extract setup_sky_dirs_commands to its own step for other clouds, fix…
kevinmingtarja Nov 25, 2025
41fcf55
refactor attempt_skylet for slurm
kevinmingtarja Nov 25, 2025
66d1d58
fmt
kevinmingtarja Nov 25, 2025
3937303
fix ray_port.json
kevinmingtarja Nov 25, 2025
b17ffba
fix resources_dict in codegen
kevinmingtarja Nov 25, 2025
32dddb7
skip some smoke tests for slurm
kevinmingtarja Nov 25, 2025
5c21edc
fix autodown in slurm, skip more tests
kevinmingtarja Nov 25, 2025
7da75a0
add more items to _CLOUD_UNSUPPORTED_FEATURES
kevinmingtarja Nov 25, 2025
7d2e3fd
skip serve tests that require opening ports
kevinmingtarja Nov 25, 2025
a858590
get available gpus dynamically
kevinmingtarja Nov 25, 2025
f050898
skip docker/image_id tests
kevinmingtarja Nov 25, 2025
a6c2abe
skip and fix more tests
kevinmingtarja Nov 25, 2025
e0e592f
small cleanup
kevinmingtarja Nov 25, 2025
d970169
fix terminate_instances for autodown
kevinmingtarja Nov 25, 2025
63ec716
tidy up sbatch provision script
kevinmingtarja Nov 26, 2025
a301776
more smoke test fixes
kevinmingtarja Nov 26, 2025
7cd2637
remove feature flag
kevinmingtarja Nov 26, 2025
907f92e
fix slurm show-gpus test
kevinmingtarja Nov 26, 2025
414d96d
improve SlurmCodeGen such that each step gets exclusive allocation f…
kevinmingtarja Nov 27, 2025
808dee1
fix check_instance_fits for multi-gpu node, add smoke test for job qu…
kevinmingtarja Nov 27, 2025
6cc793c
Merge branch 'master' into add-slurm-cluster
kevinmingtarja Nov 27, 2025
f9089ed
round up cpus
kevinmingtarja Nov 27, 2025
729a938
update snapshot
kevinmingtarja Nov 27, 2025
359d23b
int
kevinmingtarja Nov 27, 2025
fbf8f52
fix None task name, source bashrc again in run script
kevinmingtarja Nov 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .buildkite/generate_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@
'nebius': QUEUE_GENERIC_CLOUD,
'lambda': QUEUE_GENERIC_CLOUD,
'runpod': QUEUE_GENERIC_CLOUD,
'slurm': QUEUE_GENERIC_CLOUD,
'kubernetes': QUEUE_KIND
}

Expand Down
2 changes: 2 additions & 0 deletions sky/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ def set_proxy_env_var(proxy_var: str, urllib_var: Optional[str]):
GCP = clouds.GCP
Lambda = clouds.Lambda
SCP = clouds.SCP
Slurm = clouds.Slurm
Kubernetes = clouds.Kubernetes
K8s = Kubernetes
OCI = clouds.OCI
Expand Down Expand Up @@ -170,6 +171,7 @@ def set_proxy_env_var(proxy_var: str, urllib_var: Optional[str]):
'RunPod',
'Vast',
'SCP',
'Slurm',
'Vsphere',
'Fluidstack',
'Nebius',
Expand Down
365 changes: 365 additions & 0 deletions sky/adaptors/slurm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,365 @@
"""Slurm adaptor for SkyPilot."""

import logging
import re
import time
from typing import Dict, List, Optional, Tuple

from sky.utils import command_runner
from sky.utils import subprocess_utils
from sky.utils import timeline

logger = logging.getLogger(__name__)


class SlurmClient:
"""Client for Slurm control plane operations."""

def __init__(
self,
ssh_host: str,
ssh_port: int,
ssh_user: str,
ssh_key: Optional[str],
ssh_proxy_command: Optional[str] = None,
):
"""Initialize SlurmClient.

Args:
ssh_host: Hostname of the Slurm controller.
ssh_port: SSH port on the controller.
ssh_user: SSH username.
ssh_key: Path to SSH private key, or None for keyless SSH.
ssh_proxy_command: Optional SSH proxy command.
"""
self.ssh_host = ssh_host
self.ssh_port = ssh_port
self.ssh_user = ssh_user
self.ssh_key = ssh_key
self.ssh_proxy_command = ssh_proxy_command

# Internal runner for executing Slurm CLI commands
# on the controller node.
self._runner = command_runner.SSHCommandRunner(
(ssh_host, ssh_port),
ssh_user,
ssh_key,
ssh_proxy_command=ssh_proxy_command,
)

def query_jobs(
self,
job_name: Optional[str] = None,
state_filters: Optional[List[str]] = None,
) -> List[str]:
"""Query Slurm jobs by state and optional name.

Args:
job_name: Optional job name to filter by.
state_filters: List of job states to filter by
(e.g., ['running', 'pending']). If None, returns all jobs.

Returns:
List of job IDs matching the filters.
"""
cmd = 'squeue --me -h -o "%i"'
if state_filters is not None:
state_filters_str = ','.join(state_filters)
cmd += f' --states {state_filters_str}'
if job_name is not None:
cmd += f' --name {job_name}'

rc, stdout, stderr = self._runner.run(cmd,
require_outputs=True,
stream_logs=False)
subprocess_utils.handle_returncode(rc,
cmd,
'Failed to query Slurm jobs.',
stderr=stderr)

job_ids = stdout.strip().splitlines()
return job_ids

def cancel_jobs_by_name(self,
job_name: str,
signal: Optional[str] = None) -> None:
"""Cancel Slurm job(s) by name.

Args:
job_name: Name of the job(s) to cancel.
signal: Optional signal to send to the job(s).
"""
cmd = f'scancel --name {job_name}'
if signal is not None:
cmd += f' --signal {signal}'
rc, stdout, stderr = self._runner.run(cmd,
require_outputs=True,
stream_logs=False)
subprocess_utils.handle_returncode(rc,
cmd,
f'Failed to cancel job {job_name}.',
stderr=stderr)
logger.debug(f'Successfully cancelled job {job_name}: {stdout}')

def info(self) -> str:
"""Get Slurm cluster information.

This is useful for checking if the cluster is accessible and
retrieving node information.

Returns:
The stdout output from sinfo.
"""
cmd = 'sinfo'
rc, stdout, stderr = self._runner.run(cmd,
require_outputs=True,
stream_logs=False)
subprocess_utils.handle_returncode(
rc, cmd, 'Failed to get Slurm cluster information.', stderr=stderr)
return stdout

def info_nodes(self) -> List[str]:
"""Get Slurm node information.

Returns node names, states, GRES (generic resources like GPUs),
and partition.

Returns:
A list of node info lines.
"""
cmd = 'sinfo -h --Node -o "%N %t %G %P"'
rc, stdout, stderr = self._runner.run(cmd,
require_outputs=True,
stream_logs=False)
subprocess_utils.handle_returncode(
rc, cmd, 'Failed to get Slurm node information.', stderr=stderr)
return stdout.splitlines()

def node_details(self, node_name: str) -> Dict[str, str]:
"""Get detailed Slurm node information.

Returns:
A dictionary of node attributes.
"""

def _parse_scontrol_node_output(output: str) -> Dict[str, str]:
"""Parses the key=value output of 'scontrol show node'."""
node_info = {}
# Split by space, handling values that might have spaces
# if quoted. This is simplified; scontrol can be complex.
parts = output.split()
for part in parts:
if '=' in part:
key, value = part.split('=', 1)
# Simple quote removal, might need refinement
value = value.strip('\'"')
node_info[key] = value
return node_info

cmd = f'scontrol show node {node_name}'
rc, node_details, _ = self._runner.run(cmd,
require_outputs=True,
stream_logs=False)
subprocess_utils.handle_returncode(
rc,
cmd,
f'Failed to get detailed node information for {node_name}.',
stderr=node_details)
node_info = _parse_scontrol_node_output(node_details)
return node_info

def info_partitions(self) -> List[str]:
"""Get Slurm node-to-partition information.

Returns:
A list of node and partition info lines.
"""
cmd = 'sinfo -h --Nodes -o "%N %P"'
rc, stdout, stderr = self._runner.run(cmd,
require_outputs=True,
stream_logs=False)
subprocess_utils.handle_returncode(
rc,
cmd,
'Failed to get Slurm partition information.',
stderr=stderr)
return stdout.splitlines()

def get_node_jobs(self, node_name: str) -> List[str]:
"""Get the list of jobs for a given node name.

Returns:
A list of job names for the current user on the node.
"""
cmd = f'squeue --me -h --nodelist {node_name} -o "%b"'
rc, stdout, stderr = self._runner.run(cmd,
require_outputs=True,
stream_logs=False)
subprocess_utils.handle_returncode(
rc, cmd, f'Failed to get jobs for node {node_name}.', stderr=stderr)
return stdout.splitlines()

def get_job_state(self, job_id: str) -> Optional[str]:
"""Get the state of a Slurm job.

Args:
job_id: The Slurm job ID.

Returns:
The job state (e.g., 'PENDING', 'RUNNING', 'COMPLETED', etc.),
or None if the job is not found.
"""
# Use --only-job-state since we only need the job state.
# This reduces the work required by slurmctld.
cmd = f'squeue -h --only-job-state --jobs {job_id} -o "%T"'
rc, stdout, stderr = self._runner.run(cmd,
require_outputs=True,
stream_logs=False)
if rc != 0:
# Job may not exist
logger.debug(f'Failed to get job state for job {job_id}: {stderr}')
return None

state = stdout.strip()
return state if state else None

@timeline.event
def wait_for_job_nodes(self, job_id: str, timeout: int = 300) -> None:
"""Wait for a Slurm job to have nodes allocated.

Args:
job_id: The Slurm job ID.
timeout: Maximum time to wait in seconds (default: 300).
"""
start_time = time.time()
last_state = None

while time.time() - start_time < timeout:
state = self.get_job_state(job_id)

if state != last_state:
logger.debug(f'Job {job_id} state: {state}')
last_state = state

if state is None:
raise RuntimeError(f'Job {job_id} not found. It may have been '
'cancelled or failed.')

if state in ('COMPLETED', 'CANCELLED', 'FAILED', 'TIMEOUT'):
raise RuntimeError(
f'Job {job_id} terminated with state {state} '
'before nodes were allocated.')
# TODO(kevin): Log reason for pending.

# Check if nodes are allocated by trying to get node list
cmd = f'squeue -h --jobs {job_id} -o "%N"'
rc, stdout, stderr = self._runner.run(cmd,
require_outputs=True,
stream_logs=False)

if rc == 0 and stdout.strip():
# Nodes are allocated
logger.debug(
f'Job {job_id} has nodes allocated: {stdout.strip()}')
return
elif rc != 0:
logger.debug(f'Failed to get nodes for job {job_id}: {stderr}')

# Wait before checking again
time.sleep(2)

raise TimeoutError(f'Job {job_id} did not get nodes allocated within '
f'{timeout} seconds. Last state: {last_state}')

@timeline.event
def get_job_nodes(self,
job_id: str,
wait: bool = True) -> Tuple[List[str], List[str]]:
"""Get the list of nodes and their IPs for a given job ID.

The ordering is guaranteed to be stable for the lifetime of the job.

Args:
job_id: The Slurm job ID.
wait: If True, wait for nodes to be allocated before returning.

Returns:
A tuple of (nodes, node_ips) where nodes is a list of node names
and node_ips is a list of corresponding IP addresses.
"""
# Wait for nodes to be allocated if requested
if wait:
self.wait_for_job_nodes(job_id)

cmd = (
f'squeue -h --jobs {job_id} -o "%N" | tr \',\' \'\\n\' | '
f'while read node; do '
# TODO(kevin): Use json output for more robust parsing.
f'ip=$(scontrol show node=$node | grep NodeAddr= | '
f'awk -F= \'{{print $2}}\' | awk \'{{print $1}}\'); '
f'echo "$node $ip"; '
f'done')
rc, stdout, stderr = self._runner.run(cmd,
require_outputs=True,
stream_logs=False)
subprocess_utils.handle_returncode(
rc, cmd, f'Failed to get nodes for job {job_id}.', stderr=stderr)
logger.debug(f'Successfully got nodes for job {job_id}: {stdout}')

node_info = {}
for line in stdout.strip().splitlines():
line = line.strip()
if line:
parts = line.split()
if len(parts) >= 2:
node_name = parts[0]
node_ip = parts[1]
node_info[node_name] = node_ip

nodes = list(node_info.keys())
node_ips = [node_info[node] for node in nodes]
if not nodes:
raise RuntimeError(
f'No nodes found for job {job_id}. '
f'The job may have terminated or the output was empty.')
assert (len(nodes) == len(node_ips)
), f'Number of nodes and IPs do not match: {nodes} != {node_ips}'

return nodes, node_ips

def submit_job(
self,
partition: str,
job_name: str,
script_path: str,
) -> str:
"""Submit a Slurm job script.

Args:
partition: Slurm partition to submit to.
job_name: Name to give the job.
script_path: Remote path where the script will be stored.

Returns:
The job ID of the submitted job.
"""
cmd = f'sbatch --partition={partition} {script_path}'
rc, stdout, stderr = self._runner.run(cmd,
require_outputs=True,
stream_logs=False)
subprocess_utils.handle_returncode(rc,
cmd,
'Failed to submit Slurm job.',
stderr=f'{stdout}\n{stderr}')

# Parse job ID from sbatch output (format: "Submitted batch job 12345")
job_id_match = re.search(r'Submitted batch job (\d+)', stdout)
if not job_id_match:
raise RuntimeError(
f'Failed to parse job ID from sbatch output: {stdout}')

job_id = job_id_match.group(1).strip()
logger.debug(f'Successfully submitted Slurm job {job_id} with name '
f'{job_name}: {stdout}')

return job_id
Loading
Loading