-
Notifications
You must be signed in to change notification settings - Fork 861
Open
Description
Description
When running a SkyPilot managed job (sky job launch) against an SSH cluster using the SkyPilot API Server, the job fails during prechecks with an error suggesting the required SSH support is missing.
However, running the same workload by first provisioning the cluster (sky launch) and then executing the task (sky exec) works successfully.
Steps to Reproduce
-
API Server Configuration (
~/.sky/config.yamlexcerpt):The user is part of the
tuam-siteworkspace where SSH is not disabled.tuam-site: private: true allowed_users: - [email protected] - [email protected] ssh: disabled: false # Explicitly enabled gcp: disabled: true aws: disabled: true kubernetes: allowed_contexts: - dev_europe-west4_gke-hpc-cp2625-uat
-
Successful Direct Execution (Expected Behavior):
- Launch:
sky launch -c SSH-TEST --infra ssh/* --cpus 0.1+ --memory 0.1+ - Execute:
sky exec SSH-TEST hello-world.yaml - Result: Succeeds.
- Launch:
-
Failing Managed Job Execution via API Server (Actual Behavior):
- Command:
sky job launch -c testing-ssh hello-world.yaml - Result: Fails with status
FAILED_PRECHECKS.
- Command:
Job file
resources:
# Use only kubernetes resources
infra: ssh/*
cpus: 0.1+
memory: 0.1+
run: |
echo "Hello, SkyPilot!"Error Log
ERROR:root:exec: process returned 127. /usr/local/bin/sky-kube-exec-wrapper: line 3: exec: ssh-tunnel.sh: not found
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read:None, redirect:None, status:None)) after connection broken by 'NewConnectionError('\<urllib3.connection.HTTPSConnection object at 0x77bf39162d40\>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?limit=1
... (Connection Refused Warnings) ...
E 11-19 15:53:18 recovery\_strategy.py:515] Failure happened before provisioning. Failover reasons: sky.exceptions.ResourcesUnavailableError: Task 'hello-on-ssh' requires SSH Node Pools which is not enabled. To enable access, change the task cloud requirement or run: sky check ssh
E 11-19 15:53:18 controller.py:732] Provision prechecks failed for task 0
E 11-19 15:53:18 controller.py:736] [sky.exceptions.ResourcesUnavailableError] Task 'hello-on-ssh' requires SSH Node Pools which is not enabled. To enable access, change the task cloud requirement or run: sky check ssh
I 11-19 15:53:18 state.py:2267] [sky.exceptions.ResourcesUnavailableError] Task 'hello-on-ssh' requires SSH Node Pools which is not enabled. To enable access, change the task cloud requirement or run: sky check ssh
... (Cleanup logs) ...
✓ Job finished (status: ManagedJobStatus.FAILED\_PRECHECKS).
command terminated with exit code 100Environment
- SkyPilot Version:
1.0.0.dev20251118(Running inside a Kubernetes Pod for the API Server) - Infrastructure: Mixture of GKE and SSH-based servers.
- Observations: The error
/usr/local/bin/sky-kube-exec-wrapper: line 3: exec: ssh-tunnel.sh: not foundstrongly suggests that the container image used for managed job execution (sky job launch) via the API server is missing the necessaryssh-tunnel.shscript or other SSH provisioning tools. - Tested on all worskspaces including
default.
Appendix
sky check ssh
Checking credentials to enable infra for SkyPilot (Workspaces: default, digit-it, apn-lab, tuam-site).
Checking enabled infra for workspace: 'default'
SSH: enabled [compute]
SSH Node Pools:
└── ian-cluster: enabled.
🎉 Enabled infra for workspace: 'default' 🎉
SSH [compute]
SSH Node Pools:
└── ian-cluster
Checking enabled infra for workspace: 'tuam-site'
SSH: enabled [compute]
SSH Node Pools:
└── ian-cluster: enabled.
🎉 Enabled infra for workspace: 'tuam-site' 🎉
SSH [compute]
SSH Node Pools:
└── ian-cluster
Note: The following clouds were disabled because they were not included in allowed_clouds in ~/.sky/config.yaml or disabled for this workspace 'tuam-site': GCP, AWS
To enable a cloud, follow the hints above and rerun: sky check
If any problems remain, refer to detailed docs at: https://docs.skypilot.co/en/latest/getting-started/installation.htmlcg505 and kyuds