Skip to content

Releases: GoogleCloudPlatform/cluster-toolkit

Address bug in updated NVIDIA package causing Slurm job failures

20 May 17:04
8b7aae6

Choose a tag to compare

What's Changed

Bug fixes 🐞

  • Block broken release of nvidia-container-toolkit by @tpdownes in #4152

Full Changelog: v1.51.0...v1.51.1

Release v1.51.0

13 May 00:04
349211a

Choose a tag to compare

Highlights:

  • gpu-health-epilog checks added for A3 High/Mega Slurm blueprints
  • New GKE TPU v6e example

What's Changed

Key New Features 🎉

  • add GKE version_prefix support as a configurable parameter by @ighosh98 in #4060
  • Better integrate, optimize, and control the gIB NCCL RDMA plugin installer by @ndebuhr in #4069

Breaking Changes 🚨

Module Improvements 🔨

Improvements 🛠

Bug fixes 🐞

  • Explicitly define labels at google_compute_instance_from_template by @mr0re1 in #4055
  • Remove instance image override at the top of cae.yaml example by @abbas1902 in #4066
  • split kubectl installation into two parts to reduce race conditions by @ighosh98 in #4081
  • Fix gke a3 mega integration test by @ighosh98 in #4093
  • Update default Accelerator Image in A3 Ultra and A4 Slurm blueprints by @tpdownes in #4089

Full Changelog: v1.50...v1.51.0

Release v1.50.0

05 May 18:52
c0b8bed

Choose a tag to compare

Highlights

  • New blueprints for Managed Lustre attached to VMs and to Slurm clusters (including opt-in solution for A3 Ultra and A4 Slurm blueprints)
  • Breaking change: RoCE (RDMA) networks no longer support firewall rules. Older blueprints will fail with a validation warning; the solution is to remove the firewall rules following the examples in 312d7fb.

What's Changed

Key New Features 🎉

New Modules 🧱

  • Cluster Toolkit - new module for creating Artifact Registries by @scott-nag in #3639

Module Improvements 🔨

  • Enable specification of system node pool zones in the GKE Cluster module by @ndebuhr in #3976
  • Add support for optional GCS bucket module config by @mohitchaurasia91 in #3990
  • fix(htcondor): explicitly set region for cm and ap addresses to match subnetwork region by @rbekhtaoui in #3991
  • Remove the broken auto_delete_disk system, and replace it with a working snapshot-based alternative, in the NFS Server module by @ndebuhr in #3887
  • add non-queue flex-start support in gke by @chengcongdu in #3995
  • Extend slurm_conf_tpl to support raw content by @gkcalat in #4010
  • Adding GKE support for Managed Lustre by @cdunbar13 in #4022

Improvements 🛠

Deprecations 💤

  • Removal of the omnia install module and related content by @cdunbar13 in #4021

Version Updates ⏫

Bug fixes 🐞

New Contributors

Full Changelog: v1.49.1...v1.50

v1.49.1: Minor Documentation Patch

24 Apr 21:29
396360e

Choose a tag to compare

New Contributors

Full Changelog: v1.49.0...v1.49.1

v1.49.0

24 Apr 18:32
3d1f01c

Choose a tag to compare

Highlights

  • TPU Support in GKE ndoepool module with example blueprint
  • Support for Managed Lustre in pre-existing-network-storage module; Managed Lustre provisioning will be supported in a future Toolkit release

What's Changed

Key New Features 🎉

Breaking Changes 🚨

  • Make login nodes deployable independently of "controller" by @mr0re1 in #3958
    NOTE: Attempt to re-deploy pre-existing Slurm cluster with new gcluster version will cause login nodes to be destroyed.
  • DWS Flex Implementation will change with this release, if you would like to continue using the legacy implementation we've add use_bulk_insert options to our dws_flex nodeset settings. For more on DWS Flex support in Slurm visit: https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/docs/slurm-dws-flex.md

New Modules 🧱

Module Improvements 🔨

Improvements 🛠

Version Updates ⏫

  • Bump golang versions (1.22, 1.23) -> (1.23, 1.24) by @mr0re1 in #3923

Bug fixes 🐞

Full Changelog: v1.48.1...v1.49.0

v1.48.1: AI/ML documentation update

11 Apr 16:40
5254651

Choose a tag to compare

Release v1.48.1 updates our documentation to better guide users to Google Cloud AI Hypercomputer solutions implemented by the Cluster Toolkit for GKE and for Slurm.

Commits were merged in the following pull requests:

Full Changelog: v1.48.0...v1.48.1

Release v1.48.0

01 Apr 00:46
f6bb9cf

Choose a tag to compare

Highlights

  • The GKE nodepool module of Toolkit has been updated to support multiple nodepools. (PR#3826)
  • Automatic Prolog/Epilog Slurm GPU Health Checks
  • Kueue v0.11.1 manifest support

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Deprecations 💤

  • Reduce number of startup scripts. by @mr0re1 in #3770
  • Add omnia deprecation warning and update A3U and A4 blueprints threads configurations by @ighosh98 in #3837

Version Updates ⏫

Bug fixes 🐞

New Contributors

Full Changelog: v1.47.0...v1.48.0

What’s changed gets added automatically by GitHub.

Release v1.47.0

27 Feb 01:00
44a0c43

Choose a tag to compare

Highlights

  • Toolkit adds support for A4 machine family including blueprints and documentation using (GKE/ Slurm)

  • DWS Flex support introduced for GKE.

  • Support for persistent Slurm controller state

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Deprecations 💤

Version Updates ⏫

Bug fixes 🐞

Full Changelog: v1.46.1...v1.47.0

v1.46.1: Fix cloud rdma ofi tunables always being set

11 Feb 01:59
c491a4a

Choose a tag to compare

What's Changed

Bug fixes 🐞

  • Switch ofi startup script to not run automatically by @abbas1902 in #3659

Full Changelog: v1.46.0...v1.46.1

Release v1.46.0

07 Feb 00:49
bb1ddad

Choose a tag to compare

Highlights:

  • Kueue becomes the officially supported workload scheduler for A3U.
  • New blueprints added for A3U (GKE/GCS) as well as H4d VMs and Slurm examples.
  • SlurmGCP module enhanced with advanced machine features/plugins deprecated.

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Deprecations 💤

Version Updates ⏫

New Contributors

Full Changelog: v1.45.1...v1.46.0