Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 49 additions & 3 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ md_toc github examples/README.md | sed -e "s/\s-\s/ * /"
* [Instructions](#instructions)
* [(Optional) Setting up a remote terraform state](#optional-setting-up-a-remote-terraform-state)
* [Blueprint Descriptions](#blueprint-descriptions)
* [c4a-vm.yaml](#c4a-vmyaml-) ![core-badge]
* [hpc-slurm-c4a.yaml](#hpc-slurm-c4ayaml-) ![core-badge]
* [hpc-slurm.yaml](#hpc-slurmyaml-) ![core-badge]
* [hpc-enterprise-slurm.yaml](#hpc-enterprise-slurmyaml-) ![core-badge]
* [hpc-slurm-static.yaml](#hpc-slurm-staticyaml-) ![core-badge]
Expand All @@ -29,7 +31,7 @@ md_toc github examples/README.md | sed -e "s/\s-\s/ * /"
* [serverless-batch-mpi.yaml](#serverless-batch-mpiyaml-) ![core-badge]
* [pfs-lustre.yaml](#pfs-lustreyaml-) ![core-badge] ![deprecated-badge]
* [pfs-managed-lustre-vms.yaml](#pfs-managed-lustre-vmsyaml-) ![core-badge]
* [gke-managed-lustre.yaml](#gke-managed-lustreyaml-) ![core-badge]
* [gke-managed-lustre.yaml](#gke-managed-lustreyaml-) ![core-badge]
* [ps-slurm.yaml](#ps-slurmyaml--) ![core-badge] ![experimental-badge]
* [cae-slurm.yaml](#cae-slurmyaml-) ![core-badge]
* [hpc-build-slurm-image.yaml](#hpc-build-slurm-imageyaml--) ![community-badge] ![experimental-badge]
Expand Down Expand Up @@ -159,6 +161,50 @@ Toolkit team, partners, etc.) and are labeled with the community badge
Blueprints that are still in development and less stable are also labeled with
the experimental badge (![experimental-badge]).

### [c4a-vm.yaml] ![core-badge]

The [c4a-vm.yaml] blueprint creates a small, two-node cluster of C4A (ARM64) VMs. It includes a Filestore instance mounted to `/home` and uses Hyperdisk for the boot disks.

To deploy this blueprint:

```bash
gcluster deploy examples/c4a-vm.yaml \
-v project_id=<YOUR-PROJECT-ID> \
-v deployment_name=c4a-vm
```

#### Quota Requirements for c4a-vm.yaml

For this example the following is needed in the selected region:

* Cloud Filestore API: Basic HDD (Standard) capacity (GB): **1,024 GB**
* Compute Engine API: Hyperdisk Balanced (GB): **~100 GB** (50 GB/node)
* Compute Engine API: C4A CPUs: **144** (2 nodes * 72 vCPUs)

[c4a-vm.yaml]: ./c4a-vm.yaml

### [hpc-slurm-c4a.yaml] ![core-badge]

This blueprint creates a small Slurm cluster using C4A (ARM64) instances. It provisions a `c4a-standard-4` for the controller and login nodes, and a dynamic partition of `c4a-highcpu-72` compute nodes. All nodes use Hyperdisk Balanced for their boot disks. It also includes a 2.5 TiB Basic SSD Filestore instance for `/home`. This blueprint is a good starting point for running ARM-based workloads on a Slurm cluster.

To deploy this blueprint:

```bash
gcluster deploy examples/hpc-slurm-c4a.yaml \
-v project_id=<YOUR-PROJECT-ID> \
-v deployment_name=slurm-c4a
```

#### Quota Requirements for hpc-slurm-c4a.yaml

For this example the following is needed in the selected region:

* Cloud Filestore API: Basic SSD capacity (GB): **2,560 GB**
* Compute Engine API: Hyperdisk Balanced (GB): **~200 GB** (50 GB/node for 2 compute nodes + 50 GB for controller boot disk + 50 GB for controller state disk + 50 GB for login)
* Compute Engine API: C4A CPUs: **152** (4 for controller + 4 for login + 2 nodes * 72 vCPUs)

[hpc-slurm-c4a.yaml]: ./hpc-slurm-c4a.yaml

### [hpc-slurm.yaml] ![core-badge]

Creates a basic auto-scaling Slurm cluster with mostly default settings. The
Expand Down Expand Up @@ -786,7 +832,7 @@ providing a high-performance file system for demanding workloads.
volumes:
- name: lustre-volume
persistentVolumeClaim:
claimName: $(vars.lustre_instance_id)-pvc # Matches the PVC name
claimName: $(vars.lustre_instance_id)-pvc # Matches the PVC name
```

Note: This is just an example job using busybox image.
Expand Down Expand Up @@ -1172,7 +1218,7 @@ to the cluster using `kubectl` and will run on the specified node pool.
1. The output of the `./gcluster deploy` on CLI includes a `kubectl create` command to create the job.

```sh
kubectl create -f <job-yaml-path>
kubectl create -f <job-yaml-path>
```

This command creates a job that uses busybox image and prints `Hello World`. This result can be viewed by looking at the pod logs.
Expand Down
72 changes: 72 additions & 0 deletions examples/c4a-vm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---

blueprint_name: c4a-vm

vars:
project_id: ## Set GCP Project ID Here ##
deployment_name: c4a-vm
region: us-central1
zone: us-central1-a
hostname_prefix: $(vars.deployment_name)
base_network_name: $(vars.deployment_name)
instance_image:
family: rocky-linux-8-optimized-gcp-arm64
project: rocky-linux-cloud

# Documentation for each of the modules used below can be found at
# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md

deployment_groups:
- group: primary
modules:

# Source is an embedded module, denoted by "modules/*" without ./, ../, /
# as a prefix. To refer to a local module, prefix with ./, ../ or /

- id: cluster-net-0
source: modules/network/vpc
settings:
network_name: $(vars.base_network_name)-net

- id: homefs
source: modules/file-system/filestore
use: [cluster-net-0]
settings:
local_mount: /home
outputs:
- network_storage

- id: c4a_startup
source: modules/scripts/startup-script
settings:
configure_ssh_host_patterns:
- $(vars.hostname_prefix)-*

- id: c4a-vms
source: modules/compute/vm-instance
use: [c4a_startup, homefs, cluster-net-0]
settings:
machine_type: c4a-standard-72
instance_count: 2
disk_type: hyperdisk-balanced
bandwidth_tier: tier_1_enabled

- id: wait-for-vms
source: community/modules/scripts/wait-for-startup
settings:
instance_names: $(c4a-vms.name)
timeout: 7200
101 changes: 101 additions & 0 deletions examples/hpc-slurm-c4a.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
blueprint_name: slurm-c4a
vars:
project_id: ## Set GCP Project ID Here ##
deployment_name: slurm-c4a
region: us-central1
zone: us-central1-a
instance_image:
family: slurm-gcp-6-9-ubuntu-2204-lts-arm64
project: schedmd-slurm-public
disk_type: hyperdisk-balanced

# Documentation for each of the modules used below can be found at
# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md
deployment_groups:
- group: primary
modules:

# Source is an embedded module, denoted by "modules/*" without ./, ../, /
# as a prefix. To refer to a local module, prefix with ./, ../ or /

- id: c4a-slurm-net-0
source: modules/network/vpc

- id: homefs
source: modules/file-system/filestore
use: [c4a-slurm-net-0]
settings:
filestore_tier: BASIC_SSD
size_gb: 2560
filestore_share_name: homeshare
local_mount: /home

- id: c4a_startup
source: modules/scripts/startup-script
settings:
local_ssd_filesystem:
fs_type: ext4
mountpoint: /mnt/lssd
permissions: "1777"

- id: c4a_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [c4a_startup, c4a-slurm-net-0]
settings:
# Unattended upgrades are disabled to prevent automatic daily updates which may lead to potential instability.
# For more details, see https://cloud.google.com/compute/docs/instances/create-hpc-vm#disable_automatic_updates
allow_automatic_updates: false
bandwidth_tier: tier_1_enabled
machine_type: c4a-highcpu-72
node_count_static: 0
node_count_dynamic_max: 2
enable_placement: true

- id: c4a_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [c4a_nodeset]
settings:
exclusive: false
partition_name: c4a
is_default: true
partition_conf:
ResumeTimeout: 900
SuspendTimeout: 600

- id: login_startup
source: modules/scripts/startup-script
settings:
configure_ssh_host_patterns:
- $(vars.deployment_name)-*

- id: slurm_login
source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
use: [c4a-slurm-net-0]
settings:
machine_type: c4a-standard-4
enable_login_public_ips: true

- id: slurm_controller
source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
use: [c4a-slurm-net-0, c4a_partition, slurm_login, homefs]
settings:
machine_type: c4a-standard-4
controller_state_disk:
type: hyperdisk-balanced
size: 50
enable_controller_public_ips: false
login_startup_script: $(login_startup.startup_script)
Loading