Releases: GoogleCloudPlatform/cluster-toolkit
Releases · GoogleCloudPlatform/cluster-toolkit
Address bug in updated NVIDIA package causing Slurm job failures
Release v1.51.0
Highlights:
- gpu-health-epilog checks added for A3 High/Mega Slurm blueprints
- New GKE TPU v6e example
What's Changed
Key New Features 🎉
- add GKE version_prefix support as a configurable parameter by @ighosh98 in #4060
- Better integrate, optimize, and control the gIB NCCL RDMA plugin installer by @ndebuhr in #4069
Breaking Changes 🚨
- Bring parity of functionality to both A3U and A4 by @samskillman in #4023
Module Improvements 🔨
- Make additional disk params optional by @annuay-google in #4056
- Managed Lustre Data Import (Hydration) by @cdunbar13 in #4045
- Add developer key option for slurm by @cdunbar13 in #4094
Improvements 🛠
- add gpu-health-epilog check to the A3 High and A3 Mega Slurm blueprints by @RachaelSTamakloe in #4048
- Update helm manifest for nvidia dra driver by @parulbajaj01 in #4067
- Add TPU V6 Trillium example by @SwarnaBharathiMantena in #4051
- GKE DWS Flex Start consumption option examples by @SwarnaBharathiMantena in #3968
- Add copying prolog and epilog scripts to compute nodes by @gkcalat in #4071
Bug fixes 🐞
- Explicitly define labels at
google_compute_instance_from_templateby @mr0re1 in #4055 - Remove instance image override at the top of cae.yaml example by @abbas1902 in #4066
- split kubectl installation into two parts to reduce race conditions by @ighosh98 in #4081
- Fix gke a3 mega integration test by @ighosh98 in #4093
- Update default Accelerator Image in A3 Ultra and A4 Slurm blueprints by @tpdownes in #4089
Full Changelog: v1.50...v1.51.0
Release v1.50.0
Highlights
- New blueprints for Managed Lustre attached to VMs and to Slurm clusters (including opt-in solution for A3 Ultra and A4 Slurm blueprints)
- Breaking change: RoCE (RDMA) networks no longer support firewall rules. Older blueprints will fail with a validation warning; the solution is to remove the firewall rules following the examples in 312d7fb.
What's Changed
Key New Features 🎉
- Move from auth/munge to auth/slurm by @harshthakkar01 in #3955
- deprecate kueue v0.11.1 and use v0.11.4 by @ighosh98 in #4026
New Modules 🧱
- Cluster Toolkit - new module for creating Artifact Registries by @scott-nag in #3639
Module Improvements 🔨
- Enable specification of system node pool zones in the GKE Cluster module by @ndebuhr in #3976
- Add support for optional GCS bucket module config by @mohitchaurasia91 in #3990
- fix(htcondor): explicitly set region for cm and ap addresses to match subnetwork region by @rbekhtaoui in #3991
- Remove the broken auto_delete_disk system, and replace it with a working snapshot-based alternative, in the NFS Server module by @ndebuhr in #3887
- add non-queue flex-start support in gke by @chengcongdu in #3995
- Extend slurm_conf_tpl to support raw content by @gkcalat in #4010
- Adding GKE support for Managed Lustre by @cdunbar13 in #4022
Improvements 🛠
- Add a simple XPK blueprint example by @ndebuhr in #3980
- Add unique name for resource policy by @parulbajaj01 in #4002
- update kueue configurations and reservations in a3 mega, ultra and A4 by @ighosh98 in #4017
- Remove k8s service account var from gke-a3U blueprint by @parulbajaj01 in #4024
- disable unattended upgrades in a3u and a4h slurm solutions by @RachaelSTamakloe in #4006
- Add improved MIG lifecycle management for flex by @abbas1902 in #4015
- Add dws flex and spot provisioning options to the A4 example by @abbas1902 in #3945
Deprecations 💤
- Removal of the omnia install module and related content by @cdunbar13 in #4021
Version Updates ⏫
- Update a3ultra to 570 and cuda 12-8 by @samskillman in #3859
Bug fixes 🐞
- Address shelve permissions by @casassg in #3951
- Fixing the kernel upgrade flag for slurm image creation by @cdunbar13 in #4005
- Missed setting that breaks integration test by @cdunbar13 in #4029
- Fix
placement_max_distancein slurm partitions by @cdunbar13 in #4030 - A3 Ultra Slurm: workaround temporary driver packaging issue by @tpdownes in #4059
New Contributors
- @casassg made their first contribution in #3951
- @gkcalat made their first contribution in #4019
- @rick154 made their first contribution in #4046
Full Changelog: v1.49.1...v1.50
v1.49.1: Minor Documentation Patch
New Contributors
- @WyattGorman made their first contribution in #3941
Full Changelog: v1.49.0...v1.49.1
v1.49.0
Highlights
- TPU Support in GKE ndoepool module with example blueprint
- Support for Managed Lustre in pre-existing-network-storage module; Managed Lustre provisioning will be supported in a future Toolkit release
What's Changed
Key New Features 🎉
- add nvidia imex support by @ighosh98 in #3885
- TPU support with GKE nodepool module and TPU v4 2x2x2 example blueprint by @SwarnaBharathiMantena in #3817
- helm_install module implemented by @ighosh98 in #3933
- integrate support for multi-arch compliant jobset v0.8.1 by @ighosh98 in #3934
- Update vm-instance to support additional persistent disks by @tpdownes in #3935
- add support for workload policy by @ighosh98 in #3938
Breaking Changes 🚨
- Make login nodes deployable independently of "controller" by @mr0re1 in #3958
NOTE: Attempt to re-deploy pre-existing Slurm cluster with new gcluster version will cause login nodes to be destroyed. - DWS Flex Implementation will change with this release, if you would like to continue using the legacy implementation we've add
use_bulk_insertoptions to our dws_flex nodeset settings. For more on DWS Flex support in Slurm visit: https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/docs/slurm-dws-flex.md
New Modules 🧱
- Adding Managed Lustre to Cluster Toolkit by @cdunbar13 in #3950
Module Improvements 🔨
- [GKE] Add support to enable DNS based endpoint config by @mohitchaurasia91 in #3884
- Update filestore timeout config based on high capacity tier by @mohitchaurasia91 in #3900
- split terraform bundles for gpu operator by @ighosh98 in #3911
- split crd manifest out by @ighosh98 in #3913
- Add support for filestore instance description by @mohitchaurasia91 in #3953
- Fix Packer documentation for minimum necessary IAM roles by @tpdownes in #3960
- Fix workload_policy varible definition and usage. by @mohitchaurasia91 in #3963
- Add Managed Lustre to pre-existing-filestore module by @cdunbar13 in #3937
Improvements 🛠
- Update GKE version prefix for A3 Mega to v1.32.2 by @ighosh98 in #3874
- Add disk size vars for A4 by @parulbajaj01 in #3872
- Add comment description for variables in a4 blueprint by @parulbajaj01 in #3880
- Add kueue configuration support to a3 mega by @ighosh98 in #3860
- Update dra driver module by @ighosh98 in #3894
- Add MIG based DWS Flex support by @abbas1902 in #3903
- A4 Slurm: enable sudo in Slurm jobs for users with OS Admin Login role by @tpdownes in #3961
- Add sudo via OS Login to all A3 Slurm solutions by @RachaelSTamakloe in #3966
Version Updates ⏫
Bug fixes 🐞
- Fix filestore instance location var for REGIONAL tier by @mohitchaurasia91 in #3871
- add GCS updates to A3 Ultra by @ighosh98 in #3883
- GPU Operator Integration Redesign by @ighosh98 in #3892
- Add resource quota for gpu operator by @ighosh98 in #3895
- Update imex and nvidia DRA Driver configurations by @ighosh98 in #3902
- Update nccl installer for a4 by @ighosh98 in #3906
- Fix syntax errors for resource policy by @ighosh98 in #3954
- Revert network profile URI by @cdunbar13 in #3962
Full Changelog: v1.48.1...v1.49.0
v1.48.1: AI/ML documentation update
Release v1.48.1 updates our documentation to better guide users to Google Cloud AI Hypercomputer solutions implemented by the Cluster Toolkit for GKE and for Slurm.
Commits were merged in the following pull requests:
Full Changelog: v1.48.0...v1.48.1
Release v1.48.0
Highlights
- The GKE nodepool module of Toolkit has been updated to support multiple nodepools. (PR#3826)
- Automatic Prolog/Epilog Slurm GPU Health Checks
- Kueue v0.11.1 manifest support
What's Changed
Key New Features 🎉
- Add a4-high-vm blueprint by @samskillman in #3751
- Cloud DNS config addition to GKE Cluster module by @SwarnaBharathiMantena in #3752
- Update Slurm image reference to new family (6-9) by @abbas1902 in #3740
- Adding Automatic Prolog/Epilog Slurm GPU Health Checks by @RachaelSTamakloe in #3781
- Created GPU Operator Manifest by @ighosh98 in #3814
- add kueue v0.11.1 manifest by @ighosh98 in #3833
- Support resource manager tags on instance template and attached disks by @annuay-google in #3829
- introduce feature to enable k8s beta apis by @ighosh98 in #3840
- Add support for Kueue 0.11.1 by @mwysokin in #3830
- Integrate gpu operator in kubectl by @ighosh98 in #3838
Module Improvements 🔨
- add support for enablePrivateNode at nodepool level by @chengcongdu in #3794
- Add nodeset name as a label to all nodeset instance templates by @annuay-google in #3787
- Fix network names backward compatibility for A3 Mega and A3 High by @sharabiani in #3811
- Support multiple nodepools creation in gke nodepool module by @SwarnaBharathiMantena in #3826
- Enable higher performance self-managed NFS server configurations by @ndebuhr in #3807
- add support for resource quota in gpu-operator namespace by @ighosh98 in #3855
Improvements 🛠
- A4 GKE integration test by @annuay-google in #3718
- A3U Slurm: enable nvidia-persistenced daemon by @tpdownes in #3698
- Remove experimental tag from GKE blueprints in readme by @parulbajaj01 in #3724
- NCCL integration tests by @annuay-google in #3697
- Add NeMo and HPL Slurm GCS System Benchmarks with Ramble by @samskillman in #3726
- GCS update to GKE A4 High blueprint by @SwarnaBharathiMantena in #3749
- Update Kueue documentation by @ighosh98 in #3786
- Add comment descriptions for a3U vars by @parulbajaj01 in #3783
- Improve job template naming by @ighosh98 in #3816
- Update GPU Operator manifest definition by @ighosh98 in #3820
- Unify gke a3 ultra blueprints by @ighosh98 in #3835
- Advanced network configuration support on notebook instance community module by @caetano-colin in #3671
- updating defaults for slurm chs prolog by @RachaelSTamakloe in #3843
- Add disk size vars in deployment file for a3U by @parulbajaj01 in #3812
- Update A3U slurm threads configuration by @ighosh98 in #3853
Deprecations 💤
- Reduce number of startup scripts. by @mr0re1 in #3770
- Add omnia deprecation warning and update A3U and A4 blueprints threads configurations by @ighosh98 in #3837
Version Updates ⏫
Bug fixes 🐞
- Fix issue 3748 (Error with stateful_ips iteration in MIG) by @rbekhtaoui in #3765
- Update urls to point to toolkit main by @ighosh98 in #3793
- Rollback name injection change in job template by @ighosh98 in #3821
- Add Rocky 9 compatibility for NFS by @samskillman in #3813
- fix gke a3-ultra blueprint by @ighosh98 in #3845
- Force retry use of gce service account by @samskillman in #3854
New Contributors
- @yuryu made their first contribution in #3728
- @DavidToneian-Google made their first contribution in #3766
- @rbekhtaoui made their first contribution in #3765
- @Shuang-cnt made their first contribution in #3799
- @sheepx86 made their first contribution in #3832
- @caetano-colin made their first contribution in #3671
- @ndebuhr made their first contribution in #3807
- @mwysokin made their first contribution in #3830
Full Changelog: v1.47.0...v1.48.0
What’s changed gets added automatically by GitHub.
Release v1.47.0
Highlights
-
Toolkit adds support for A4 machine family including blueprints and documentation using (GKE/ Slurm)
-
DWS Flex support introduced for GKE.
-
Support for persistent Slurm controller state
What's Changed
Key New Features 🎉
- Support DWS Flex on GKE by @SwarnaBharathiMantena in #3636
- GCSFuse cache enabled a3-mega blueprint by @koallison in #3460
- Add controller save state disk by @alyssa-sm in #3661
- Toolkit GKE now supports the A4 machine family. Blueprints and documentation are now available. by @SwarnaBharathiMantena and @annuay-google in #3703 #3704 #3702 #3718 #3705 #3656 #3657 #3719
- Add A4 slurm blueprints by @harshthakkar01 in #3709
Module Improvements 🔨
- Support Kueue 0.10.1 by @annuay-google in #3620
- enable queued provisioning on gke nodepool by @SwarnaBharathiMantena in #3582
- update GPU and disk definitions for A4 by @SwarnaBharathiMantena in #3657
Improvements 🛠
- Remove explicitly stated mtu in gke-a3-ultragpu as default mtu has sa… by @parulbajaj01 in #3619
- Add cuda-toolkit to a3ultra-jbvms blueprint by @RachaelSTamakloe in #3615
- Add missing indexes and a3U documentation in readme for gke blueprints by @parulbajaj01 in #3599
- Add variable for k8s service account and remove hardcoded value by @parulbajaj01 in #3634
- Set defaults for GPU driver, disk type and Jobset version for A3U blueprints by @annuay-google in #3679
- Standardize naming prefixes for kubernetes network objects by @parulbajaj01 in #3644
- Remove autoscaling max nodes from A3H and A3M tests by @parulbajaj01 in #3696
- A4 GKE integration test by @annuay-google in #3718
Deprecations 💤
- Deprecate gke topology scheduler by @annuay-google in #3678
Version Updates ⏫
- Update NeMo framework examples to 24.12 by @akiki-liang0 in #3616
- Pin to latest TPG v6.20.0 minor release by @abbas1902 in #3669
Bug fixes 🐞
- Fixes HPL benchmark test due to WARMUP_END_PROG environment variable. by @samskillman in #3631
- Increase google and google-beta provider versions for GKE cluster by @annuay-google in #3635
- Fix guest accelerator (broken for GKE) by @annuay-google in #3656
- Enable NVIDIA DCGM in A3 Ultra Slurm blueprint by @tpdownes in #3673
- Fix htcondor config by @lemaitre-aneo in #3664
- Fix ordering of local SSD mounting and docker by @tpdownes in #3682
Full Changelog: v1.46.1...v1.47.0
v1.46.1: Fix cloud rdma ofi tunables always being set
What's Changed
Bug fixes 🐞
- Switch ofi startup script to not run automatically by @abbas1902 in #3659
Full Changelog: v1.46.0...v1.46.1
Release v1.46.0
Highlights:
- Kueue becomes the officially supported workload scheduler for A3U.
- New blueprints added for A3U (GKE/GCS) as well as H4d VMs and Slurm examples.
- SlurmGCP module enhanced with advanced machine features/plugins deprecated.
What's Changed
Key New Features 🎉
- Promote Kueue as the only workload scheduling solution for A3U and adopt the same in NCCL tests by @annuay-google in #3534
- Add a3u-gke-gcs blueprint by @samskillman in #3454
- Add H4d vm blueprint by @abbas1902 in #3578
- Add H4d Slurm cluster example by @abbas1902 in #3586
Module Improvements 🔨
- SlurmGCP. Deprecate
enable_smt, addadvanced_machine_featuresby @mr0re1 in #3525 - Adding a note to use of Private Service Access module by @RachaelSTamakloe in #3527
- Remove unnecessary constraint on static+dynamic with placement by @mr0re1 in #3574
- Adding var.provisioning_model to vm-instance module by @RachaelSTamakloe in #3588
Improvements 🛠
- Fixing validation block error in Filestore module by @RachaelSTamakloe in #3528
- Updating location for gke-managed-hyperdisk test by @RachaelSTamakloe in #3536
- Updating A3 Ultra JBVM blueprint README documentation by @RachaelSTamakloe in #3552
- Untangle
exclusiveandenable_placement, update examples. by @mr0re1 in #3362 - Added support for clean target to remove gcluster binary and the ghpc… by @nadig-google in #3577
- Improve GKE service account posture by aligning with GKE best practices by @parulbajaj01 in #3571
- Adding ray operator addon support for GKE cluster creation by @raushan2016 in #3584
- Add integration test for a3ultra-vm.yaml by @RachaelSTamakloe in #3579
- Replace Spack w/ Enroot/pyxis for NCCL tests by @samskillman in #3589
- Fixed naming for deployment name in test to fit in size limit by @parulbajaj01 in #3591
- Upgrade default kueue version to v0.10.0 by @ighosh98 in #3581
Deprecations 💤
Version Updates ⏫
- Update Minimum terraform version in Makefile from 1.2 to 1.5.7 by @RachaelSTamakloe in #3526
- update minimum terraform version to support check block by @SwarnaBharathiMantena in #3565
- Update NCCL plugin to v1.0.3 in A3U by @akiki-liang0 in #3594
New Contributors
- @guspan-tanadi made their first contribution in #3562
- @nadig-google made their first contribution in #3577
- @raushan2016 made their first contribution in #3584
Full Changelog: v1.45.1...v1.46.0