Skip to content

[Bug]: sky job launch fails with FAILED_PRECHECKS on SSH infra when API server is used, but sky exec succeeds. #8017

@sledress

Description

@sledress

Description

When running a SkyPilot managed job (sky job launch) against an SSH cluster using the SkyPilot API Server, the job fails during prechecks with an error suggesting the required SSH support is missing.

However, running the same workload by first provisioning the cluster (sky launch) and then executing the task (sky exec) works successfully.

Steps to Reproduce

  1. API Server Configuration (~/.sky/config.yaml excerpt):

    The user is part of the tuam-site workspace where SSH is not disabled.

    tuam-site:
      private: true
      allowed_users:
        - [email protected]
        - [email protected]
      ssh:
        disabled: false # Explicitly enabled
      gcp:
        disabled: true
      aws:
        disabled: true
      kubernetes:
        allowed_contexts:
          - dev_europe-west4_gke-hpc-cp2625-uat
  2. Successful Direct Execution (Expected Behavior):

    • Launch: sky launch -c SSH-TEST --infra ssh/* --cpus 0.1+ --memory 0.1+
    • Execute: sky exec SSH-TEST hello-world.yaml
    • Result: Succeeds.
  3. Failing Managed Job Execution via API Server (Actual Behavior):

    • Command: sky job launch -c testing-ssh hello-world.yaml
    • Result: Fails with status FAILED_PRECHECKS.

Job file

resources:
  # Use only kubernetes resources
  infra: ssh/*
  cpus: 0.1+
  memory: 0.1+

run: |
  echo "Hello, SkyPilot!"

Error Log

ERROR:root:exec: process returned 127. /usr/local/bin/sky-kube-exec-wrapper: line 3: exec: ssh-tunnel.sh: not found
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read:None, redirect:None, status:None)) after connection broken by  'NewConnectionError('\<urllib3.connection.HTTPSConnection object at 0x77bf39162d40\>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?limit=1
... (Connection Refused Warnings) ...
E 11-19 15:53:18 recovery\_strategy.py:515] Failure happened before provisioning. Failover reasons: sky.exceptions.ResourcesUnavailableError: Task 'hello-on-ssh' requires SSH Node Pools which is not enabled. To enable access, change the task cloud requirement or run: sky check ssh
E 11-19 15:53:18 controller.py:732] Provision prechecks failed for task 0
E 11-19 15:53:18 controller.py:736] [sky.exceptions.ResourcesUnavailableError] Task 'hello-on-ssh' requires SSH Node Pools which is not enabled. To enable access, change the task cloud requirement or run: sky check ssh
I 11-19 15:53:18 state.py:2267] [sky.exceptions.ResourcesUnavailableError] Task 'hello-on-ssh' requires SSH Node Pools which is not enabled. To enable access, change the task cloud requirement or run: sky check ssh
... (Cleanup logs) ...
✓ Job finished (status: ManagedJobStatus.FAILED\_PRECHECKS).
command terminated with exit code 100

Environment

  • SkyPilot Version: 1.0.0.dev20251118 (Running inside a Kubernetes Pod for the API Server)
  • Infrastructure: Mixture of GKE and SSH-based servers.
  • Observations: The error /usr/local/bin/sky-kube-exec-wrapper: line 3: exec: ssh-tunnel.sh: not found strongly suggests that the container image used for managed job execution (sky job launch) via the API server is missing the necessary ssh-tunnel.sh script or other SSH provisioning tools.
  • Tested on all worskspaces including default.

Appendix

sky check ssh
Checking credentials to enable infra for SkyPilot (Workspaces: default, digit-it, apn-lab, tuam-site).

Checking enabled infra for workspace: 'default'
  SSH: enabled [compute]
    SSH Node Pools:
    └── ian-cluster: enabled.

🎉 Enabled infra for workspace: 'default' 🎉
  SSH [compute]
    SSH Node Pools:
    └── ian-cluster

Checking enabled infra for workspace: 'tuam-site'
  SSH: enabled [compute]
    SSH Node Pools:
    └── ian-cluster: enabled.

🎉 Enabled infra for workspace: 'tuam-site' 🎉
  SSH [compute]
    SSH Node Pools:
    └── ian-cluster
Note: The following clouds were disabled because they were not included in allowed_clouds in ~/.sky/config.yaml or disabled for this workspace 'tuam-site': GCP, AWS     

To enable a cloud, follow the hints above and rerun: sky check
If any problems remain, refer to detailed docs at: https://docs.skypilot.co/en/latest/getting-started/installation.html

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions