Releases: GoogleCloudPlatform/cluster-toolkit
Releases · GoogleCloudPlatform/cluster-toolkit
v1.72.0
What's Changed
Key New Features 🎉
- Integrating Managed lustre in TPU v6e by @shubpal07 in #4814
- Support sycomp storage by @gqiu-sycomp-com in #4798
Breaking Changes 🚨
- Enable Private Nodes by default in GKE Node Pool by @kadupoornima in #4682
Module Improvements 🔨
- Add tpu_topology as an output value for workload_policy by @agrawalkhushi18 in #4813
Improvements 🛠
- Refactor a4xhigh-slurm-blueprint.yaml by moving epilog and prolog to slurm-gcp by @Neelabh94 in #4733
- Update nccl-rdma manifest in gke a4x by @parulbajaj01 in #4817
- Adding integration test for GKE A4X by @vikramvs-gg in #4828
Bug fixes 🐞
- Fix default mount paths in Slurm controller README.md by @nikosavola in #4779
New Contributors
- @sudheer-quad made their first contribution in #4791
- @gqiu-sycomp-com made their first contribution in #4798
Full Changelog: v1.71.0...v1.72.0
v1.71.0
What's Changed
Module Improvements 🔨
- Adding validations for naming resources by @vikramvs-gg in #4788
Improvements 🛠
- Add Managed Lustre support in gke-a4x blueprint by @parulbajaj01 in #4793
Bug fixes 🐞
New Contributors
Full Changelog: v1.70.0...v1.71.0
v1.70.0
What's Changed
Breaking Changes 🚨
- Removing support for maintenance_interval for reservations created by TAMs by @LAVEEN in #4748
- Migration of jobset from static manifests to helm chart and upgrading version to 0.10.1 by @shubpal07 in #4765
Module Improvements 🔨
- Add automated TPU support and GCS integration in TPU v6 blueprint by @shubpal07 in #4755
Improvements 🛠
- H4d blueprint refactored by @rachit-google in #4740
Full Changelog: v1.69.0...v1.70.0
v1.69.0
What's Changed
Key New Features 🎉
- Add NUMA-aware scheduling in GKE clusters (enabled for G4) by @kadupoornima in #4760
- Add daily PR integration tests for G4 machines by @kadupoornima in #4761
New Modules 🧱
Improvements 🛠
- Adding GKE sample for running nvidia-bug-report by @raushan2016 in #4741
- PSA update by @okrause in #4744
New Contributors
- @aslam-quad made their first contribution in #4742
Full Changelog: v1.68.0...v1.69.0
v1.68.0
What's Changed
Key New Features 🎉
- downloading libnccl2 and libnccl-dev for a3u and a4h by @rachit-google in #4680
Breaking Changes 🚨
- Allowing setting use_job_duration with non-exclusive partitions. by @arpit974 in #4696
- Add multi-network support in TPU v6e by @agrawalkhushi18 in #4723
- Update vpc and cloud_router versions in VPC network module by @kadupoornima in #4732
Module Improvements 🔨
- Refactoring in gke persistent module by @vikramvs-gg in #4618
- Migrate Kueue installation to use Helm chart by @shubpal07 in #4542
Improvements 🛠
- Update nvidia DRA driver version to v25.3.0 by @parulbajaj01 in #4670
- Updated A3-mega and A4-high Slurm blueprints to adopt nvidia add repository scirpt. by @rachit-google in #4667
- Update H4D blueprint: disable automatic updates, provide image info, and delete duplicate filestore by @Neelabh94 in #4644
- Add Managed Lustre support in gke-a4 by @parulbajaj01 in #4654
- Add Managed Lustre support in gke a3 ultra by @parulbajaj01 in #4700
- Adds an irdma health check to h4d nodes by @samskillman in #4704
- Enable Spot VM Provisioning For H4D by @LAVEEN in #4735
- Add slurm-gke blueprint by @ACW101 in #4607
Version Updates ⏫
Bug fixes 🐞
- Remove superfluous addition of chs logs to cloud ops config by @abbas1902 in #4679
- Adding "datacenter-gpu-manager-4-dev" as an additional installation in A* YAML files. by @Neelabh94 in #4623
- minor bug fix on MFT version comparison by @ljqg in #4689
- Fix inconsistent plan on Slurm cluster reconfigure by @wiktorn in #4538
- Update process to filter out starting comments in a source yaml file by @SwarnaBharathiMantena in #4707
- Fix gke build failures by @annuay-google in #4708
- Update machine-leaning/a3-ultragpu-8g/nemo-framework to fix segmentation fault error by @SwarnaBharathiMantena in #4725
New Contributors
- @mufaqam-gcl made their first contribution in #4688
- @wtempel made their first contribution in #4705
- @nikosavola made their first contribution in #4720
- @ACW101 made their first contribution in #4607
Full Changelog: v1.67.0...v1.68.0
Release v1.67.0
What's Changed
Key New Features 🎉
Module Improvements 🔨
- added nvidia-repositories script by @rachit-google in #4553
Improvements 🛠
- Install NCCL/gIB .deb and .rpm packages for A3U and A4 by @rachit-google in #4543
- updating example to use jax ai images by @pulasthi in #4575
- Enabling Spot VM For A3 Mega/High by @LAVEEN in #4634
New Contributors
Full Changelog: v1.66.0...v1.67.0
Release v1.66.0
What's Changed
Key New Features 🎉
- H4D enable gcsfuse and set cluster availability type to ZONAL by @kadupoornima in #4608
- Add G4 GKE base blueprints by @kadupoornima in #4560
Module Improvements 🔨
- Slinky upgraded to v0.3.1 by @sharabiani in #4548
- Update Managed lustre gke blueprint by @parulbajaj01 in #4603
Improvements 🛠
- Making separate integration test for nccl test in gke a3 ultra by @shubpal07 in #4622
- Upgrade to Slurm 25.05 by @LAVEEN in #4606
- Hotfix: H4D Blueprint provisioning model option update by @abbas1902 in #4640
New Contributors
- @saara-tyagi27 made their first contribution in #4619
Full Changelog: v1.65.0...v1.66.0
v1.65.1: Hotfix: H4D Blueprint provisioning model options update
What's Changed
Improvements 🛠
- Hotfix: H4D Blueprint provisioning model options update by @abbas1902 in #4640
Full Changelog: v1.65.0...v1.65.1
v1.65.0
What's Changed
Improvements 🛠
- Surface Managed Lustre support in a4x by @RachaelSTamakloe in #4576
- Expand A* gpu network wait solution by @RachaelSTamakloe in #4584
- Restart slurmctld.service before scontrol reconfigure by @RachaelSTamakloe in #4609
- Support use of other shared file locations for NCCL Tests by @RachaelSTamakloe in #4615
- Add sudo to systemctl restart by @RachaelSTamakloe in #4626
Deprecations 💤
- Deprecate Debian blueprints from a3 mega gpu by @rachit-google in #4537
Bug fixes 🐞
- Power down non-responding node if there is not instance attached by @abbas1902 in #4627
Full Changelog: v1.64.0...v1.65.0
Release v1.64.0
What's Changed
Key New Features 🎉
- GKE Managed Lustre integration by @vikramvs-gg in #4572
Breaking Changes 🚨
- updated the storage for a3Ultra to basic ssd by @rachit-google in #4516
Improvements 🛠
Version Updates ⏫
- Revert gke-node-pool module to using google-beta provider by @kadupoornima in #4577
New Contributors
Full Changelog: v1.63.0...v1.64.0