Skip to content

SkyPilot v0.10.5

Latest

Choose a tag to compare

@Michaelvll Michaelvll released this 06 Nov 21:31
· 161 commits to master since this release
4e2cfdc

SkyPilot v0.10.5: Major Managed Jobs Efficiency Improvement, UX and SDK Usability Enhencement, and API Server Robustness Fixes

This release focuses on production stability, performance optimization, and fixing critical bugs that affected reliability in multi-user and Kubernetes environments. Key highlights include major performance improvements for managed jobs and the dashboard, resolution of race conditions and resource leaks, and expanded cloud/accelerator support.

This release includes 400+ merged pull requests with high-priority critical fixes and additional improvements spanning bug fixes, performance enhancements, new features, and comprehensive documentation updates.


Highlights

Managed Job Efficiency and Robustness Improvement

jobs-controller
# config.yaml
jobs:
  controller:
    consolidation_mode: true  # Currently defaults to False.

job-launch-fast

  • Managed jobs robustness improvement (#7769, #7716).

Performance and Robustness Improvement at a Large Scale

image 1

API Server Robustness Improvement

  • Significantly reduce memory consumption and avoid OOM for API server (#7240).
Screenshot 2025-11-11 at 11 51 07 AM

Python SDK Usability Improvements

  • New Python SDK examples to easily scale out your jobs on any infra (#7335).
import sky
from sky import jobs as managed_jobs

for i in range(100):
    resource = sky.Resources(accelerators='A100:8')
    task = sky.Task(resources=resource,
                    workdir='.',
                    run='python batch_inference.py')
    managed_jobs.launch(task, name=f'hello-{i}')
image
  • Admin policy robustness and better usability (#7827).

Integration

  • CoreWeave managed Kubernetes and storage support (#7756, #6519, #7759).
  • AMD GPU support and GPU metrics (#6944).
image 4
image 5

What's New

Additional Performance Improvements

  • Migrated managed jobs and SkyServe to a gRPC architecture, reducing P95 latency by up to 86% and CPU usage by 20-30% (#7245, #6647, #6702, #7184).
  • Optimized Kubernetes pod processing with streaming JSON, providing 2x speedup and 50% memory reduction for large clusters (#7469).
  • Optimized database queries for sky status and API endpoints, improving mean response time by 10-35% (#7689, #7690, #7665, #7705, #7708).
  • Replaced standard JSON with high-performance orjson for API serialization, improving tail latency for large payloads (#7734).

Critical Stability & Security Fixes

  • Fixed critical race conditions in Docker operations ("container name already in use") (#7030), API server start/stop (#7534), and asynchronous job cancellation, preventing cluster leaks (#7511).
  • Resolved critical Kubernetes resource leaks, including semaphore file leaks (/dev/shm exhaustion) (#7678) and fusermount-server leaks (#7398).
  • Fixed system hangs caused by SSH threads (#7202), K8s SSH proxy blocking (#7537), and BrokenProcessPool crashes when canceling sky logs (#7607).
  • Addressed a critical OOM bug in the server caused by inefficient GPU name canonicalization in Kubernetes (#7080).
  • Fixed security/privacy bugs where pending jobs were visible across users (#7581) and environment variables leaked between managed jobs (#7459).
  • Prevents accidental cancellation of wrong requests by requiring exact ID matches when prefixes are ambiguous (#7730).
  • Replaced insecure curl | sh pattern for fluent-bit installation with official package repositories (#7126).

UX Improvement

  • Improved CLI error messages for Kubernetes resource constraints (#6814) and multi-node setup failures (#7001).
  • Enhanced sky down output to show a clear summary when multiple clusters fail or succeed (#7225, #7635).
  • Added a CLI spinner for short-running requests (like sky status) to show activity when the server is under load (#7631).
  • Added sky ui as a shorter alias for sky dashboard and sky volume as an alias for sky volumes (#7565, #7746).
  • SDK stream_and_get can now retrieve the latest request without an ID (#6965).
  • Improved provision log hint to show the user-friendly sky logs --provision <cluster> command (#7682).
  • Added SKYPILOT_NUM_JOBS environment variable for pools to complement SKYPILOT_JOB_RANK (#7542).
  • Dynamically scale worker pools using sky jobs apply --workers <n> (#7236).
  • Launch multiple local kind clusters with custom names via sky local up --name (#7244, #7394).

Dashboard Improvements

  • Lazy-loads cluster history, speeding up the cluster page by 20x for users with many clusters (#7215).
  • Added an upgrade banner that automatically displays when the API server is undergoing maintenance (#7722).
  • Enabled automatic login to the dashboard when basic authentication credentials are in the URL (#7701).
  • Dashboard now pauses data refresh when the browser tab is hidden to save resources (#7709).
  • Added workspace-aware filtering to the infrastructure page (#7129).
  • Enhanced filtering (by GPU, infra, role) and deduplication for the users page (#7704, #7723, #7758, #7771, #7777, #7784, #7783).
  • Added total accelerator count to the users page (#7754) and "Pool Version" to the pools page (#7409).
  • Added monitoring for SSH connection latency and counts (#7687, #7806).
  • Fixed various dashboard UI bugs, race conditions, and styling issues (#7248, #7322, #7721, #7753, #7795, #7811).

Storage & File Operations

  • Fixed sky storage delete CLI command which failed to delete cloud buckets (#7279).
  • Fixed R2 (Cloudflare) credential mounting in Kubernetes Helm deployments (#7125).
  • Managed jobs now fail fast on storage/bucket errors instead of getting stuck in PENDING (#7519).
  • Fixed Azure storage mounting failures on Ubuntu 18.04 (#7531) and shim overwrite bugs (#7818).
  • Fixed a bug preventing the syncing of empty directories (#7438).
  • Fixed fusermount failures for storage names containing "-u" (#7424).
  • Fixed rclone installation failures on systems without yum (#7740).
  • Reduced upload chunk size to 100MB to fix 413 errors with Cloudflare proxies (#7391).

Enhanced Cloud Platform Support

New Cloud Provider Support

  • Added Seeweb as a new, fully integrated cloud provider (#6884, #7191).
  • Added support for AWS ARM instances (e.g., Graviton) with automatic architecture detection and custom ARM images (#7104, #7105).
  • Enabled dynamic Docker login authentication for private registries, including AWS ECR (#4871).

Kubernetes Improvements

  • Added support for allowed_contexts: all to simplify setup in controlled environments (#7196).
  • Added robust pod preemption detection (e.g., from Kueue) and automatic pod recovery on re-launch (#7726, #7553).
  • Improved reliability by adding apt mirror failover (#7036), fixing UV environment activation (#7061, #7230, #7259), and handling MOTD in containers (#7174).
  • Fixed several bugs related to .failed file generation to improve setup debugging (#7207, #7208, #7218, #7219).
  • Fixed SSH key race conditions in multi-user environments by embedding keys in pod specs (#7454).
  • Fixed SSH connection issues for in-cluster Sky clusters (#7733).
  • Fixed orphaned service resources on cluster termination (#7634).
  • Now correctly uses K8s resource requests (not limits) for worker calculations (#7053).
  • Default CPU/memory now scales proportionally with multi-GPU requests (#7145).
  • Added support for raw and canonicalized accelerator names (#6978).

AWS

  • Added support for per-workspace AWS profiles, enabling multi-account setups (#7781).
  • Fixed Docker SSH connections for private VMs that require a jump-host proxy (#7415).
  • Fixed AWS SSM auto-defaulting to respect explicit use_ssm: false configuration (#7387).
  • Removed duplicate p4de entries from the AWS catalog (#7108).
  • Added SSM Session Manager Plugin to Docker images (#7346).

Other Clouds

  • Fixed multi-node cluster launches on Lambda Cloud (#7097) and improved its Docker support (#7179).
  • Fixed GCP crashes caused by malformed API project metadata (#7088).
  • Fixed distributed training on Nebius by adding hostname entries to /etc/hosts (#7773).
  • Fixed RunPod volume listing (#7386) and credential checks (#7320).
  • Fixed VAST provisioner bug for instances without names (#7141).
  • Fixed Cudo credential check bug (#7449).

API Server Improvements

  • Added a dedicated thread pool for server requests to prevent high-concurrency hangs, especially from log streaming (#7599, #7600).
  • Fixed a critical resource leak in AsyncFileLock that occurred when async operations were cancelled (#7627, #7664).
  • Added support for server-log in Helm deployments for better observability (#7226).
  • Fixed an issue where OAuth2 authentication would hang when used with GCE ingress and other load balancers (#7713).
  • Fixed a bug where SSH keys were lost on API server rolling updates (#7650).
  • Fixed a race condition in High Availability (HA) mode for managed jobs (#7720).
  • Fixed a connection already closed error when using PostgreSQL (#7584).
  • Fixed a bug where sky jobs logs would fail with ClusterNotUpError due to stale cache (#7585).
  • Moved volume validation server-side, removing K8s dependencies for clients (#7370, #7384).
  • Added SSH latency measurement for K8s proxy (#7538).
  • Ensures server dependencies are installed when using cloud-specific extras like skypilot[aws] (#7646).
  • Numerous database performance optimizations for faster API responses and reduced load (#7076, #7220, #7246, #7261, #7354, #7389, #7392, #7439, #7470, #7475, #7478, #7480, #7526, #7527, #7555, #7557, #7569, #7576, #7647, #7648, #7653, #7658, #7665, #7670, #7676, #7697, #7702, #7741, #7742, #7780).

Documentation & Examples

Testing & CI/CD & Development

Developer Experience


Installation

Install or upgrade to v0.10.5 using pip:

pip install -U "skypilot[all]==0.10.5"

Or with uv:

uv pip install -U "skypilot[all]==0.10.5"

For API server deployments using Helm:

NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.10.5

helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
  --set apiService.image=berkeleyskypilot/skypilot:$VERSION \
  --version $VERSION --devel --reuse-values

Contributors

Thank you to all contributors who made this release possible! 🎉

New Contributors

All Contributors

@bobokvsky, @coopslarhette, @massaindustries, @mluogh, @eric-czech, @pokgak, @Elden123, @jmalukaite, @SamuelMarks, @brianstrauch, @lynnliu030, @EricBryann, @Michaelvll, @romilbhardwaj, @concretevitamin, @andylizf, @alex000kim, @cg505, @kevinmingtarja, @zpoint, @aylei, @rohansonecha, @SeungjinYang, @DanielZhangQD, and the entire SkyPilot community.

Special thanks to the community for bug reports, feature requests, and pull requests that helped make this release more robust and feature-rich. This release includes 400+ merged PRs from dozens of contributors! Also, thanks @alex000kim for helping with this release note!


Full Changelog

For a complete list of all changes: v0.10.3...v0.10.5