SkyPilot v0.10.5: Major Managed Jobs Efficiency Improvement, UX and SDK Usability Enhencement, and API Server Robustness Fixes
This release focuses on production stability, performance optimization, and fixing critical bugs that affected reliability in multi-user and Kubernetes environments. Key highlights include major performance improvements for managed jobs and the dashboard, resolution of race conditions and resource leaks, and expanded cloud/accelerator support.
This release includes 400+ merged pull requests with high-priority critical fixes and additional improvements spanning bug fixes, performance enhancements, new features, and comprehensive documentation updates.
Highlights
Managed Job Efficiency and Robustness Improvement
- 18x more jobs with the same jobs controller size – 2000+ parallel jobs on an 8-CPU controller (#7051, #7371, #7379, #7408, #7432, #7473, #7488, #7487, #7494, #7519, #7585, #7595).
- [Beta] Avoid separate job controller with the new
consolidation_mode(#7127, #7396, #7459, #7122, #7498, #7601, #7560, #7619, #7717, #7720).- Consistent credentials across API server and jobs controller
- 6x faster job submission,
sky jobs launch
# config.yaml
jobs:
controller:
consolidation_mode: true # Currently defaults to False.Performance and Robustness Improvement at a Large Scale
- 20x cluster status speedup at large scale (#7215, #7220, #7224, #7246, #7389, #7453, #7282, #7261, #7076, #7676).
- 6x jobs query and dashboard speedup (#7463, #7458, #7780, #7798, #7453).
API Server Robustness Improvement
- Significantly reduce memory consumption and avoid OOM for API server (#7240).
Python SDK Usability Improvements
- New Python SDK examples to easily scale out your jobs on any infra (#7335).
import sky
from sky import jobs as managed_jobs
for i in range(100):
resource = sky.Resources(accelerators='A100:8')
task = sky.Task(resources=resource,
workdir='.',
run='python batch_inference.py')
managed_jobs.launch(task, name=f'hello-{i}')- Admin policy robustness and better usability (#7827).
Integration
- CoreWeave managed Kubernetes and storage support (#7756, #6519, #7759).
- AMD GPU support and GPU metrics (#6944).
What's New
Additional Performance Improvements
- Migrated managed jobs and SkyServe to a gRPC architecture, reducing P95 latency by up to 86% and CPU usage by 20-30% (#7245, #6647, #6702, #7184).
- Optimized Kubernetes pod processing with streaming JSON, providing 2x speedup and 50% memory reduction for large clusters (#7469).
- Optimized database queries for
sky statusand API endpoints, improving mean response time by 10-35% (#7689, #7690, #7665, #7705, #7708). - Replaced standard JSON with high-performance
orjsonfor API serialization, improving tail latency for large payloads (#7734).
Critical Stability & Security Fixes
- Fixed critical race conditions in Docker operations ("container name already in use") (#7030), API server start/stop (#7534), and asynchronous job cancellation, preventing cluster leaks (#7511).
- Resolved critical Kubernetes resource leaks, including semaphore file leaks (
/dev/shmexhaustion) (#7678) and fusermount-server leaks (#7398). - Fixed system hangs caused by SSH threads (#7202), K8s SSH proxy blocking (#7537), and
BrokenProcessPoolcrashes when cancelingsky logs(#7607). - Addressed a critical OOM bug in the server caused by inefficient GPU name canonicalization in Kubernetes (#7080).
- Fixed security/privacy bugs where pending jobs were visible across users (#7581) and environment variables leaked between managed jobs (#7459).
- Prevents accidental cancellation of wrong requests by requiring exact ID matches when prefixes are ambiguous (#7730).
- Replaced insecure
curl | shpattern for fluent-bit installation with official package repositories (#7126).
UX Improvement
- Improved CLI error messages for Kubernetes resource constraints (#6814) and multi-node setup failures (#7001).
- Enhanced
sky downoutput to show a clear summary when multiple clusters fail or succeed (#7225, #7635). - Added a CLI spinner for short-running requests (like
sky status) to show activity when the server is under load (#7631). - Added
sky uias a shorter alias forsky dashboardandsky volumeas an alias forsky volumes(#7565, #7746). - SDK
stream_and_getcan now retrieve the latest request without an ID (#6965). - Improved provision log hint to show the user-friendly
sky logs --provision <cluster>command (#7682). - Added
SKYPILOT_NUM_JOBSenvironment variable for pools to complementSKYPILOT_JOB_RANK(#7542). - Dynamically scale worker pools using
sky jobs apply --workers <n>(#7236). - Launch multiple local kind clusters with custom names via
sky local up --name(#7244, #7394).
Dashboard Improvements
- Lazy-loads cluster history, speeding up the cluster page by 20x for users with many clusters (#7215).
- Added an upgrade banner that automatically displays when the API server is undergoing maintenance (#7722).
- Enabled automatic login to the dashboard when basic authentication credentials are in the URL (#7701).
- Dashboard now pauses data refresh when the browser tab is hidden to save resources (#7709).
- Added workspace-aware filtering to the infrastructure page (#7129).
- Enhanced filtering (by GPU, infra, role) and deduplication for the users page (#7704, #7723, #7758, #7771, #7777, #7784, #7783).
- Added total accelerator count to the users page (#7754) and "Pool Version" to the pools page (#7409).
- Added monitoring for SSH connection latency and counts (#7687, #7806).
- Fixed various dashboard UI bugs, race conditions, and styling issues (#7248, #7322, #7721, #7753, #7795, #7811).
Storage & File Operations
- Fixed
sky storage deleteCLI command which failed to delete cloud buckets (#7279). - Fixed R2 (Cloudflare) credential mounting in Kubernetes Helm deployments (#7125).
- Managed jobs now fail fast on storage/bucket errors instead of getting stuck in PENDING (#7519).
- Fixed Azure storage mounting failures on Ubuntu 18.04 (#7531) and shim overwrite bugs (#7818).
- Fixed a bug preventing the syncing of empty directories (#7438).
- Fixed fusermount failures for storage names containing "-u" (#7424).
- Fixed rclone installation failures on systems without
yum(#7740). - Reduced upload chunk size to 100MB to fix 413 errors with Cloudflare proxies (#7391).
Enhanced Cloud Platform Support
New Cloud Provider Support
- Added Seeweb as a new, fully integrated cloud provider (#6884, #7191).
- Added support for AWS ARM instances (e.g., Graviton) with automatic architecture detection and custom ARM images (#7104, #7105).
- Enabled dynamic Docker login authentication for private registries, including AWS ECR (#4871).
Kubernetes Improvements
- Added support for
allowed_contexts: allto simplify setup in controlled environments (#7196). - Added robust pod preemption detection (e.g., from Kueue) and automatic pod recovery on re-launch (#7726, #7553).
- Improved reliability by adding apt mirror failover (#7036), fixing UV environment activation (#7061, #7230, #7259), and handling MOTD in containers (#7174).
- Fixed several bugs related to
.failedfile generation to improve setup debugging (#7207, #7208, #7218, #7219). - Fixed SSH key race conditions in multi-user environments by embedding keys in pod specs (#7454).
- Fixed SSH connection issues for in-cluster Sky clusters (#7733).
- Fixed orphaned service resources on cluster termination (#7634).
- Now correctly uses K8s resource requests (not limits) for worker calculations (#7053).
- Default CPU/memory now scales proportionally with multi-GPU requests (#7145).
- Added support for raw and canonicalized accelerator names (#6978).
AWS
- Added support for per-workspace AWS profiles, enabling multi-account setups (#7781).
- Fixed Docker SSH connections for private VMs that require a jump-host proxy (#7415).
- Fixed AWS SSM auto-defaulting to respect explicit
use_ssm: falseconfiguration (#7387). - Removed duplicate p4de entries from the AWS catalog (#7108).
- Added SSM Session Manager Plugin to Docker images (#7346).
Other Clouds
- Fixed multi-node cluster launches on Lambda Cloud (#7097) and improved its Docker support (#7179).
- Fixed GCP crashes caused by malformed API project metadata (#7088).
- Fixed distributed training on Nebius by adding hostname entries to
/etc/hosts(#7773). - Fixed RunPod volume listing (#7386) and credential checks (#7320).
- Fixed VAST provisioner bug for instances without names (#7141).
- Fixed Cudo credential check bug (#7449).
API Server Improvements
- Added a dedicated thread pool for server requests to prevent high-concurrency hangs, especially from log streaming (#7599, #7600).
- Fixed a critical resource leak in
AsyncFileLockthat occurred when async operations were cancelled (#7627, #7664). - Added support for
server-login Helm deployments for better observability (#7226). - Fixed an issue where OAuth2 authentication would hang when used with GCE ingress and other load balancers (#7713).
- Fixed a bug where SSH keys were lost on API server rolling updates (#7650).
- Fixed a race condition in High Availability (HA) mode for managed jobs (#7720).
- Fixed a
connection already closederror when using PostgreSQL (#7584). - Fixed a bug where
sky jobs logswould fail withClusterNotUpErrordue to stale cache (#7585). - Moved volume validation server-side, removing K8s dependencies for clients (#7370, #7384).
- Added SSH latency measurement for K8s proxy (#7538).
- Ensures server dependencies are installed when using cloud-specific extras like
skypilot[aws](#7646). - Numerous database performance optimizations for faster API responses and reduced load (#7076, #7220, #7246, #7261, #7354, #7389, #7392, #7439, #7470, #7475, #7478, #7480, #7526, #7527, #7555, #7557, #7569, #7576, #7647, #7648, #7653, #7658, #7665, #7670, #7676, #7697, #7702, #7741, #7742, #7780).
Documentation & Examples
- Added new LLM training examples for NVIDIA NeMo RL (#7247), VERL (#7580), nanochat (#7614), and SkyRL (#7633).
- Added a new example for using the Spyder IDE with SkyPilot (#7437).
- Added SDK examples for distributed PyTorch training (#7393) and the PyTorch quickstart (#7335).
- Published documentation for setting up High Availability (HA) controllers (#5875).
- Added documentation for API compatibility guarantees (#6738) and multi-Kubernetes context configuration (#6717).
- Added guides for using multiple GCP projects with Workspaces (#7775) and CoreWeave with InfiniBand (#7756).
- Improved documentation for private Git repo authentication (#7222), job queue workdirs (#7309), and Cloudflare Zero Trust setup (#7264).
- Reorganized CoreWeave documentation to a first-class provider (#7759).
- Numerous other fixes and clarifications (#7067, #7073, #7086, #7094, #7132, #7134, #7147, #7173, #7253, #7314, #7359, #7361, #7365, #7385, #7388, #7400, #7414, #7418, #7420, #7426, #7427, #7448, #7541, #7547, #7589, #7603, #7612, #7616, #7630, #7644, #7668, #7671, #7804).
Testing & CI/CD & Development
- Added a concurrent workload load testing script for benchmarking the API server (#6972).
- Implemented new smoke tests for vLLM on pools (#7197), Helm upgrades (#7223), pool job queueing/recovery (#7216), and pool job cancellation (#7340, #7728).
- Added database scale testing infrastructure for clusters and managed jobs (#7435, #7455, #7496).
- Added test coverage for memory usage on large file uploads (#7413) and K8s
MOUNT_CACHED(#7421). - Improved CI reliability by fixing flaky tests, enhancing dependency testing, and tuning vulnerability scanning (#6991, #7047, #7049, #7064, #7103, #7111, #7156, #7183, #7186, #7189, #7195, #7227, #7232, #7233, #7235, #7250, #7253, #7272, #7300, #7327, #7329, #7358, #7367, #7372, #7374, #7382, #7383, #7401, #7402, #7425, #7434, #7452, #7465, #7467, #7486, #7497, #7507, #7508, #7510, #7512, #7514, #7520, #7524, #7533, #7539, #7540, #7544, #7554, #7598, #7609, #7618, #7620, #7624, #7651, #7681, #7693, #7718, #7725, #7728, #7729, #7732, #7744, #7751, #7776, #7793, #7794, #7797, #7805, #7808, #7813, #7817, #7825, #7827, #7835, #7836).
Developer Experience
- SDK now supports using git repositories as workdirs programmatically (#7137).
- Replaced ambiguous
Dictreturn types with strongly-typed Pydantic models for SDKs (#6833, #6847, #7404). - Fixed a bug where async SDK calls were silently dropping parameters (#7102).
- Removed the obsolete
CommandGenfeature, simplifying the Task API (#7801). - Upgraded local Ray version requirement to >= 2.6.1 (#7073).
- Numerous internal refactorings to improve code quality, maintainability, and remove technical debt (#7121, #7138, #7310, #7360, #7366, #7412, #7456, #7457, #7506, #7588, #7595, #7629, #7647, #7697, #7706, #7746, #7752, #7783, #7806, #7823).
Installation
Install or upgrade to v0.10.5 using pip:
pip install -U "skypilot[all]==0.10.5"
Or with uv:
uv pip install -U "skypilot[all]==0.10.5"
For API server deployments using Helm:
NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.10.5
helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
--set apiService.image=berkeleyskypilot/skypilot:$VERSION \
--version $VERSION --devel --reuse-valuesContributors
Thank you to all contributors who made this release possible! 🎉
New Contributors
- @bobokvsky made their first contribution in #7141
- @coopslarhette made their first contribution in #7125
- @massaindustries made their first contribution in #6884
- @mluogh made their first contribution in #7108
- @eric-czech made their first contribution in #7097
- @pokgak made their first contribution in #5923
- @Elden123 made their first contribution in #6685
- @jmalukaite made their first contribution in #7300
- @SamuelMarks made their first contribution in #7603
- @brianstrauch made their first contribution in #7679
- @lynnliu030 made their first contribution in #7633
- @EricBryann made their first contribution in #7740
All Contributors
@bobokvsky, @coopslarhette, @massaindustries, @mluogh, @eric-czech, @pokgak, @Elden123, @jmalukaite, @SamuelMarks, @brianstrauch, @lynnliu030, @EricBryann, @Michaelvll, @romilbhardwaj, @concretevitamin, @andylizf, @alex000kim, @cg505, @kevinmingtarja, @zpoint, @aylei, @rohansonecha, @SeungjinYang, @DanielZhangQD, and the entire SkyPilot community.
Special thanks to the community for bug reports, feature requests, and pull requests that helped make this release more robust and feature-rich. This release includes 400+ merged PRs from dozens of contributors! Also, thanks @alex000kim for helping with this release note!
Full Changelog
For a complete list of all changes: v0.10.3...v0.10.5






