Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
5862744
Merge pull request #4747 from arpit974/a3ultrs-vm-imageUpdate
arpit974 Oct 9, 2025
880a6c4
Revert "Merge pull request #4750 from arpit974/reverting-google-beta-…
arpit974 Oct 10, 2025
10ed63b
bump TF and blueprint version and improve protocol validation
arpit974 Oct 10, 2025
8382614
Merge pull request #4747 from arpit974/a3ultrs-vm-imageUpdate
arpit974 Oct 9, 2025
bd88190
Revert "Merge pull request #4750 from arpit974/reverting-google-beta-…
arpit974 Oct 10, 2025
a017834
updating vpc version in gpu-rdma-vpc module.
arpit974 Oct 10, 2025
d2b7e28
add netapp-storage-pool module
okrause Aug 26, 2025
9889355
add netapp-volume module
okrause Aug 27, 2025
6031528
adds example blueprint documentation and doc updates
okrause Aug 29, 2025
2ad24f8
change google provider version dependency
okrause Sep 3, 2025
0ede295
Review process changes #1
okrause Oct 1, 2025
7337e20
Adding Provisioning method for A3 Mega and A3 High
LAVEEN Sep 9, 2025
e58df97
Review process changes 2
okrause Oct 7, 2025
811d215
reset a3-highgpu-8g/README.md to develop version
okrause Oct 7, 2025
c32804a
add NetApp Volumes integration test
okrause Oct 9, 2025
6b773b7
Change zone var to not use variable substitution
okrause Oct 23, 2025
e2e53a0
change region for tests to us-central1
okrause Oct 24, 2025
2978ab2
change network name for tests
okrause Oct 27, 2025
3d7403b
change test to use deployment name for network
okrause Oct 27, 2025
3a1954a
bump TF and blueprint version and improve protocol validation
okrause Oct 27, 2025
c9daee9
Removes non-GCNV files accidentially made it into this commit
okrause Oct 27, 2025
bab6f55
add comments for choice of TF version
okrause Oct 27, 2025
a5e8b12
add comments for choice of TF version
okrause Oct 27, 2025
e3a7aee
add eda-hybrid-cloud blueprint
okrause Oct 17, 2025
b71e0a5
add eda-all-on-cloud blueprint and README
okrause Oct 17, 2025
66dd75b
updating README.md
okrause Oct 23, 2025
083d000
updates README.md file
okrause Oct 28, 2025
dd2b0fd
replaces compute in eda-all-on-cloud blueprint with H4D
okrause Oct 29, 2025
82c0ff9
updates eda-hybrid-cloud blueprint with H4D compute support
okrause Oct 30, 2025
5dace5b
adds blueprint decription to examples/README.md
okrause Oct 30, 2025
ee2982e
removes eda-hybrid-cloud.yaml stale blueprint
okrause Oct 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/network_storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ storage.

The Toolkit contains modules that will **provision**:

- [Google Cloud NetApp Volumes (GCP managed enterprise NFS and SMB)][netapp-volumes]
- [Filestore (GCP managed NFS)][filestore]
- [DDN EXAScaler lustre][ddn-exascaler] (Deprecated, removal on July 1, 2025)
- [Managed Lustre][managed-lustre]
Expand Down Expand Up @@ -106,6 +107,7 @@ nfs-server | via USE | via USE | via USE | via STARTUP | via USE | via USE
cloud-storage-bucket (GCS)| via USE | via USE | via USE | via STARTUP | via USE | via USE
DDN EXAScaler lustre | via USE | via USE | via USE | Needs Testing | via USE | via USE
Managed Lustre | via USE | Needs Testing | via USE | Needs Testing | Needs Testing | Needs Testing
netapp-volume | Needs Testing | Needs Testing | via USE | Needs Testing | Needs Testing | Needs Testing
|  |   |   |   |   |  
filestore (pre-existing) | via USE | via USE | via USE | via STARTUP | via USE | via USE
nfs-server (pre-existing) | via USE | via USE | via USE | via STARTUP | via USE | via USE
Expand All @@ -129,3 +131,4 @@ GCS FUSE (pre-existing) | via USE | via USE | via USE | via STARTUP | via USE |
[ddn-exascaler]: ../community/modules/file-system/DDN-EXAScaler/README.md
[managed-lustre]: ../modules/file-system/managed-lustre/README.md
[nfs-server]: ../community/modules/file-system/nfs-server/README.md
[netapp-volumes]: ../modules/file-system/netapp-volume/README.md
74 changes: 74 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ md_toc github examples/README.md | sed -e "s/\s-\s/ * /"
* [xpk-n2-filestore](#xpk-n2-filestore--) ![community-badge] ![experimental-badge]
* [gke-h4d](#gke-h4d-) ![core-badge]
* [gke-g4](#gke-g4-) ![core-badge]
* [netapp-volumes.yaml](#netapp-volumesyaml--) ![community-badge]
* [Blueprint Schema](#blueprint-schema)
* [Writing an HPC Blueprint](#writing-an-hpc-blueprint)
* [Blueprint Boilerplate](#blueprint-boilerplate)
Expand Down Expand Up @@ -1631,6 +1632,79 @@ This blueprint uses GKE to provision a Kubernetes cluster and a G4 node pool, al

[gke-g4]: ../examples/gke-g4

### [netapp-volumes.yaml] ![core-badge]

This blueprint demonstrates how to provision NFS volumes as shares filesystems for compute VMs, using Google Cloud NetApp Volumes. It can be used as an alternative to FileStore in blueprints.

NetApp Volumes is a first-party Google service that provides NFS and/or SMB shared file-systems to VMs. It offers advanced data management capabilities and highly scalable capacity and performance.

NetApp Volume provides:

* robust support for NFSv3, NFSv4.x and SMB 2.1 and 3.x
* a [rich feature set][service-levels]
* scalable [performance](https://cloud.google.com/netapp/volumes/docs/performance/performance-benchmarks)
* FlexCache: Caching of ONTAP-based volumes to provide high-throughput and low latency read access to compute clusters of on-premises data
* [Auto-tiering](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/manage-auto-tiering) of unused data to optimse cost

Support for NetApp Volumes is split into two modules.

* **netapp-storage-pool** provisions a [storage pool](https://cloud.google.com/netapp/volumes/docs/configure-and-use/storage-pools/overview). Storage pools are pre-provisioned storage capacity containers which host volumes. A pool also defines fundamental properties of all the volumes within, like the region, the attached network, the [service level][service-levels], CMEK encryption, Active Directory and LDAP settings.
* **netapp-volume** provisions a [volume](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview) inside an existing storage pool. A volume is a file-system which is shared using NFS or SMB. It provides advanced data management capabilities.

You can provision multiple volumes in a pool. For service levels Standard, Premium and Extreme the throughput capability depends on volume size and service level. Every GiB of provisioned volume space adds 16/64/128 KiBps of throughput capability.

### [eda-all-on-cloud] ![core-badge]

Creates a basic auto-scaling Slurm cluster intended for EDA use cases. The blueprint also creates two new VPC networks, one frontend network which connects VMs, SLURM and storage and the other for fast RDMA networking between the H4D nodes, along with four [Google Cloud NetApp Volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview) mounted to `/home`, `/tools`, `/libraries` and `/scratch`. There is an `h4d` partition that uses compute-optimized `h4d-highmem-192-lssd` machine type.

The deployment instructions can be found in the [README](/examples/eda/README.md).

[eda-all-on-cloud]: ../examples/eda/eda-all-on-cloud.yaml

### [eda-hybrid-cloud] ![core-badge]

Creates a basic auto-scaling Slurm cluster intended for EDA use cases. The blueprint also connects to one exiting frontend network which connects VMs, SLURM and storage and creates a new RDMA network for low latency communication between the compute nodes. There is an `h4d` partition that uses compute-optimized `h4d-highmem-192-lssd` machine type.

Four pre-existing NFS volumes are mounted to `/home`, `/tools`, `/libraries` and `/scratch`. Using [FlexCache](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/cache-ontap-volumes/overview) volumes allows to bring on-premises data to Google Cloud compute, without having to manually copy the data. This enables "burst to the cloud" use cases.

The deployment instructions can be found in the [README](/examples/eda/README.md).

[eda-hybrid-cloud]: ../examples/eda/eda-hybrid-cloud.yaml

#### Steps to deploy the blueprint

To provision the bluebrint, please run:

```shell
./gcluster create examples/netapp-volumes.yaml --vars "project_id=${GOOGLE_CLOUD_PROJECT}"
./gcluster deploy netapp-volumes
```

After the blueprint deployed, you can login to the VM created:

```shell
gcloud compute ssh --zone "us-central1-a" "netapp-volumes-0" --project ${GOOGLE_CLOUD_PROJECT} --tunnel-through-iap
```

A NetApp Volumes volume is provisioned and mounted to /home in all the provisioned VMs. A home directory for your user is created automatically:

```shell
pwd
df -h -t nfs
```

#### Clean Up
To destroy all resources associated with creating the GKE cluster, run the following command:

```sh
./gcluster destroy netapp-volumes
```

[netapp-storage-pool]: ../netapp-storage-pool/README.md
[service-levels]: https://cloud.google.com/netapp/volumes/docs/discover/service-levels
[auto-tiering]: https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/manage-auto-tiering
[netapp-volumes.yaml]: ../examples/netapp-volumes.yaml

## Blueprint Schema

Similar documentation can be found on
Expand Down
Binary file added examples/eda/ClusterToolkit-EDA-AllCloud.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/eda/ClusterToolkit-EDA-Hybrid.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
249 changes: 249 additions & 0 deletions examples/eda/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
# Electronic Design Automation (EDA) Reference Architecture

The Electronic Design Automation (EDA) blueprints in
this folder captures a reference architecture where the right cloud components
are assembled to optimally cater to the requirements of EDA workloads.

For file IO, Google Cloud NetApp Volumes NFS storage services are available.
It scales from small to high capacity and high performance and provides fan-out
caching of on-premises ONTAP systems into Google Cloud to enable hybrid cloud
architecture. The scheduling of the workloads is done by a workload
manager.

## Architecture
The EDA blueprints are intended to be a starting point for more tailored
explorations of EDA.

This blueprint features a general setup suited for EDA applications on
Google Cloud including:

- Google Compute Engine partitions
- Google Cloud NetApp Volumes NFS-based shared storage
- SLURM workload scheduler

Two example blueprints are provided.

### Blueprint [eda-all-on-cloud](eda-all-on-cloud.yaml)

This blueprint assumes that all compute and data resides in the cloud.

![EDA all-cloud architecture](./ClusterToolkit-EDA-AllCloud.png)

In the setup deploment group (see [deployment stages](#deployment_stages)) it provisions a new network and multiple NetApp Volumes volumes to store your data. Adjust the volume sizes to suit your requirements before deployment. If your volumes are larger than 15 TiB, creating them as [large volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview#large-capacity-volumes) adds performance benefits.

The cluster deployment group deploys a managed instance group which is managed by SLURM.

When scaling down the deploment, make sure to only destroy the *compute* deployment group. If you destroy the *setup* group too, all the volumes will be deleted and you will lose your data.

### Blueprint [eda-hybrid-cloud](./eda-hybrid-cloud.yaml)

This blueprint assumes you are using an pre-existing Google VPC with pre-existing NFS shares on NetApp Volumes, managed outside of Cluster Toolkit.

![EDA hybrid-cloud architecture](./ClusterToolkit-EDA-Hybrid.png)

The setup deployment group (see [deployment stages](#deployment_stages)) connects to an existing network and mounts multiple NetApp Volumes volumes. This blueprint assumes you have pre-existing volumes for "tools", "libraries", "home" and "scratch". Before deployment, update `server_ip` and `remote_mount` parameters of the respective volumes in the blueprint declarations to reflect the actual IP and export path of your existing volumes. Using existing volumes also avoids the danger of being deleted accidentally when deleting the setup deployment group.

The volumes used can be regular NetApp Volume [volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview), [large volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview#large-capacity-volumes) or [FlexCache volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/cache-ontap-volumes/overview).

FlexCache offers the following features which enable bursting on-premises workloads into Google Cloud to use its powerful compute options:

- Read-writable sparse volume
- Block-level, “pull only” paradigm
- 100% consistent, coherent, current
- write-around
- LAN-like latencies after first read
- Fan-out. Use multiple caches to scale out workload

It can accelerate metadata- or throughput-heavy read workloads considerably.
It can accelerate metadata- or throughput-heavy read workloads considerably.

FlexCache and Large Volumes offer six IP addresses per volume which all provide access to the same data. Currently Cluster Toolkit only uses one of these IPs. Support for using all 6 IPs is planned for a later release. To spread you compute nodes over all IPs today, you can use CloudDNS to create an DNS record with all 6 IPs and specify that DNS name instead of individual IPs in the blueprint. CloudDNS will return one of the 6 IPs in a round-robin fashion on lookups.

The cluster deployment group deploys a managed instance group which is managed by SLURM.

## Getting Started
To explore the reference architecture, you should follow the these steps:

Before you start, make sure your prerequisites and dependencies are set up:
[Set up Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/setup/configure-environment).

For deploying the EDA reference blueprint follow the
[Deployment Instructions](#deployment-instructions).

### Deployment Stages

This blueprint has the following deployment groups:

- `setup`: Setup backbone infrastructure such as networking and file systems
- `software_installation`(_optional_): This deployment group is a stub for
custom software installation on the network storage before the cluster is brought up
- `cluster`: Deploys an auto-scaling cluster

Having multiple deployment groups decouples the life cycle of some
infrastructure. For example a) you can tear down the cluster while leaving the
storage intact and b) you can build software before you deploy your cluster.

## Deployment Instructions

> [!WARNING]
> Installing this blueprint uses the following billable components of Google
> Cloud:
>
> - Compute Engine
> - NetApp Volumes
>
> To avoid continued billing after use closely follow the
> [teardown instructions](#teardown-instructions). To generate a cost estimate based on
> your projected usage, use the [pricing calculator](https://cloud.google.com/products/calculator).
>
> [!WARNING]
> Before attempting to execute the following instructions, it is important to
> consider your project's quota. The blueprints create an
> autoscaling cluster that, when fully scaled up, can deploy many powerful VMs.
>
> This is merely an example for an instance of this reference architecture.
> Node counts can easily be adjusted in the blueprint.

1. Clone the repo

```bash
git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
cd cluster-toolkit
```

1. Build the Cluster Toolkit

```bash
make
```

1. Change parameters in your blueprint file to reflect your requirements. Examples are VPC names for exiting networks, H4D instance group node limits or export paths of existing NFS volumes.

1. Generate the deployment folder after replacing `<blueprint>` with the name of the blueprint (`eda-all-cloud` or `eda-gybrid-cloud`) and `<project>` with the project id.

```bash
./gcluster create examples/eda/<blueprint>.yaml -w --vars project_id=<project>
```

1. Deploy the `setup` group

Call the following gcluster command to deploy the blueprint.

```bash
./gcluster deploy <blueprint>
```

The next `gcluster` prompt will ask you to **display**, **apply**, **stop**, or
**continue** without applying the `setup` group. Select 'apply'.

This group will create a network and file systems to be used by the cluster.

> [!WARNING]
> This gcluster command will run through 2 deployment groups (3 if you populate
> & activate the `software_installation` stage) and prompt you to apply each one.
> If the command is cancelled or exited by accident before finishing, it can
> be rerun to continue deploying the blueprint.

1. Deploy the `software_installation` group (_optional_).

> [!NOTE]
> Installation processes differ between applications. Some come as a
> precompiled binary with all dependencies included, others may need to
> be built from source, while others can be deployed through package
> managers such as spack. This deployment group is intended to be used
> if the software installation process requires substantial amount of time (e.g.
> compilation from source). By building the software in a separate
> deployment group, this process can be done before the cluster is
> up, minimizing costs.
>
> [!NOTE]
> By default, this deployment group is disabled in the reference design. See
> [Software Installation Patterns](#software-installation-patterns) for more information.

If this deployment group is used (needs to be uncomment in the blueprint first),
you can return to the gcluster command which will ask you to **display**, **apply**,
**stop**, or **continue** without applying the `software_installation` group.
Select 'apply'.

1. Deploy the `cluster` group

The next `gcluster` prompt will ask you to **display**, **apply**, **stop**, or
**continue** without applying the `cluster` group. Select 'apply'.

This deployment group contains the Slurm cluster and compute partitions.

## Teardown Instructions

> [!NOTE]
> If you created a new project for testing of the EDA solution, the easiest way to
> eliminate billing is to delete the project.

When you would like to tear down the deployment, each stage must be destroyed.
Since the `software_installation` and `cluster` depend on the network deployed
in the `setup` stage, they must be destroyed first. You can use the following
commands to destroy the deployment in this reverse order. You will be prompted
to confirm the deletion of each stage.

```bash
./gcluster destroy <blueprint>
```

> [!WARNING]
> If you do not destroy all three deployment groups then there may be continued
> associated costs.

## Software Installation Patterns

This section is intended to illustrate how software can be installed in the context
of the EDA reference solution.

Depending on the software you want to use, different installation paths may be required.

- **Installation with binary**
Commercial-off-the-shelf applications typically come with precompiled binaries which
are provided by the ISV. If you not share them using the toolsfs or libraryfs shares,
you can install software using the following method.

In general, you need to bring the binaries to your EDA cluster for which it is
useful to use a Google Clouds Storage bucket, which is accessible from any machine using the
gsutil command and which can be mounted in the cluster.

As this installation process only needs to be done once and at the same time may require time,
we recommend to do this installation in a separate deployment group before you bring up the cluster.
The `software_installation' stage is meant to accommodate this. You can for example bring up
a dedicated VM

``` {.yaml}
- id: sw-installer-vm
source: modules/compute/vm-instance
use: [network1, appsfs]
settings:
name_prefix: sw-installer
add_deployment_name_before_prefix: true
threads_per_core: 2
machine_type: c2-standard-16
```

where you can follow the installation steps manually. Or using the toolkit's
[startup-script](../../modules/scripts/startup-scripts/README.md) module, the process
can be automated.

Once that is completed, the software will persist on the NFS Filestore share for as long as you
do not destroy the `setup` stage.

- **Installation from source/with package manager**
For open source software, you may want to compile the software from scratch or use a
package manager such as spack for the installation. This process typically takes
a non-negligible amount of time (~hours). We therefore strongly suggest to use
the `software_installation` stage for this purpose.

Please see the [HCLS Blueprint](../../docs/videos/healthcare-and-life-sciences/README.md) example
for how the `software_installation` stage can be used to use the spack package manager
to install all dependencies for a particular version of the software, including compiling
the software or its dependencies from source.

Please also see the [OpenFOAM](../../docs/tutorials/openfoam/spack-openfoam.md) example
for how this can be used to install the OpenFOAM software.

Once that is completed, the software will persist on the NFS Filestore share for as long as you
do not destroy the `setup` stage.
Loading