diff --git a/docs/network_storage.md b/docs/network_storage.md index 5693b6126e..b09cc70e2c 100644 --- a/docs/network_storage.md +++ b/docs/network_storage.md @@ -5,6 +5,7 @@ storage. The Toolkit contains modules that will **provision**: +- [Google Cloud NetApp Volumes (GCP managed enterprise NFS and SMB)][netapp-volumes] - [Filestore (GCP managed NFS)][filestore] - [DDN EXAScaler lustre][ddn-exascaler] (Deprecated, removal on July 1, 2025) - [Managed Lustre][managed-lustre] @@ -106,6 +107,7 @@ nfs-server | via USE | via USE | via USE | via STARTUP | via USE | via USE cloud-storage-bucket (GCS)| via USE | via USE | via USE | via STARTUP | via USE | via USE DDN EXAScaler lustre | via USE | via USE | via USE | Needs Testing | via USE | via USE Managed Lustre | via USE | Needs Testing | via USE | Needs Testing | Needs Testing | Needs Testing +netapp-volume | Needs Testing | Needs Testing | via USE | Needs Testing | Needs Testing | Needs Testing |  |   |   |   |   |   filestore (pre-existing) | via USE | via USE | via USE | via STARTUP | via USE | via USE nfs-server (pre-existing) | via USE | via USE | via USE | via STARTUP | via USE | via USE @@ -129,3 +131,4 @@ GCS FUSE (pre-existing) | via USE | via USE | via USE | via STARTUP | via USE | [ddn-exascaler]: ../community/modules/file-system/DDN-EXAScaler/README.md [managed-lustre]: ../modules/file-system/managed-lustre/README.md [nfs-server]: ../community/modules/file-system/nfs-server/README.md +[netapp-volumes]: ../modules/file-system/netapp-volume/README.md diff --git a/examples/README.md b/examples/README.md index 08866c9e72..44d9c70608 100644 --- a/examples/README.md +++ b/examples/README.md @@ -63,6 +63,7 @@ md_toc github examples/README.md | sed -e "s/\s-\s/ * /" * [xpk-n2-filestore](#xpk-n2-filestore--) ![community-badge] ![experimental-badge] * [gke-h4d](#gke-h4d-) ![core-badge] * [gke-g4](#gke-g4-) ![core-badge] + * [netapp-volumes.yaml](#netapp-volumesyaml--) ![community-badge] * [Blueprint Schema](#blueprint-schema) * [Writing an HPC Blueprint](#writing-an-hpc-blueprint) * [Blueprint Boilerplate](#blueprint-boilerplate) @@ -1631,6 +1632,79 @@ This blueprint uses GKE to provision a Kubernetes cluster and a G4 node pool, al [gke-g4]: ../examples/gke-g4 +### [netapp-volumes.yaml] ![core-badge] + +This blueprint demonstrates how to provision NFS volumes as shares filesystems for compute VMs, using Google Cloud NetApp Volumes. It can be used as an alternative to FileStore in blueprints. + +NetApp Volumes is a first-party Google service that provides NFS and/or SMB shared file-systems to VMs. It offers advanced data management capabilities and highly scalable capacity and performance. + +NetApp Volume provides: + +* robust support for NFSv3, NFSv4.x and SMB 2.1 and 3.x +* a [rich feature set][service-levels] +* scalable [performance](https://cloud.google.com/netapp/volumes/docs/performance/performance-benchmarks) +* FlexCache: Caching of ONTAP-based volumes to provide high-throughput and low latency read access to compute clusters of on-premises data +* [Auto-tiering](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/manage-auto-tiering) of unused data to optimse cost + +Support for NetApp Volumes is split into two modules. + +* **netapp-storage-pool** provisions a [storage pool](https://cloud.google.com/netapp/volumes/docs/configure-and-use/storage-pools/overview). Storage pools are pre-provisioned storage capacity containers which host volumes. A pool also defines fundamental properties of all the volumes within, like the region, the attached network, the [service level][service-levels], CMEK encryption, Active Directory and LDAP settings. +* **netapp-volume** provisions a [volume](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview) inside an existing storage pool. A volume is a file-system which is shared using NFS or SMB. It provides advanced data management capabilities. + +You can provision multiple volumes in a pool. For service levels Standard, Premium and Extreme the throughput capability depends on volume size and service level. Every GiB of provisioned volume space adds 16/64/128 KiBps of throughput capability. + +### [eda-all-on-cloud] ![core-badge] + +Creates a basic auto-scaling Slurm cluster intended for EDA use cases. The blueprint also creates two new VPC networks, one frontend network which connects VMs, SLURM and storage and the other for fast RDMA networking between the H4D nodes, along with four [Google Cloud NetApp Volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview) mounted to `/home`, `/tools`, `/libraries` and `/scratch`. There is an `h4d` partition that uses compute-optimized `h4d-highmem-192-lssd` machine type. + +The deployment instructions can be found in the [README](/examples/eda/README.md). + +[eda-all-on-cloud]: ../examples/eda/eda-all-on-cloud.yaml + +### [eda-hybrid-cloud] ![core-badge] + +Creates a basic auto-scaling Slurm cluster intended for EDA use cases. The blueprint also connects to one exiting frontend network which connects VMs, SLURM and storage and creates a new RDMA network for low latency communication between the compute nodes. There is an `h4d` partition that uses compute-optimized `h4d-highmem-192-lssd` machine type. + +Four pre-existing NFS volumes are mounted to `/home`, `/tools`, `/libraries` and `/scratch`. Using [FlexCache](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/cache-ontap-volumes/overview) volumes allows to bring on-premises data to Google Cloud compute, without having to manually copy the data. This enables "burst to the cloud" use cases. + +The deployment instructions can be found in the [README](/examples/eda/README.md). + +[eda-hybrid-cloud]: ../examples/eda/eda-hybrid-cloud.yaml + +#### Steps to deploy the blueprint + +To provision the bluebrint, please run: + +```shell +./gcluster create examples/netapp-volumes.yaml --vars "project_id=${GOOGLE_CLOUD_PROJECT}" +./gcluster deploy netapp-volumes +``` + +After the blueprint deployed, you can login to the VM created: + +```shell +gcloud compute ssh --zone "us-central1-a" "netapp-volumes-0" --project ${GOOGLE_CLOUD_PROJECT} --tunnel-through-iap +``` + +A NetApp Volumes volume is provisioned and mounted to /home in all the provisioned VMs. A home directory for your user is created automatically: + +```shell +pwd +df -h -t nfs +``` + +#### Clean Up +To destroy all resources associated with creating the GKE cluster, run the following command: + +```sh +./gcluster destroy netapp-volumes +``` + +[netapp-storage-pool]: ../netapp-storage-pool/README.md +[service-levels]: https://cloud.google.com/netapp/volumes/docs/discover/service-levels +[auto-tiering]: https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/manage-auto-tiering +[netapp-volumes.yaml]: ../examples/netapp-volumes.yaml + ## Blueprint Schema Similar documentation can be found on diff --git a/examples/eda/ClusterToolkit-EDA-AllCloud.png b/examples/eda/ClusterToolkit-EDA-AllCloud.png new file mode 100644 index 0000000000..f850e5b8b7 Binary files /dev/null and b/examples/eda/ClusterToolkit-EDA-AllCloud.png differ diff --git a/examples/eda/ClusterToolkit-EDA-Hybrid.png b/examples/eda/ClusterToolkit-EDA-Hybrid.png new file mode 100644 index 0000000000..f6c3c4131e Binary files /dev/null and b/examples/eda/ClusterToolkit-EDA-Hybrid.png differ diff --git a/examples/eda/README.md b/examples/eda/README.md new file mode 100644 index 0000000000..9c8fc80419 --- /dev/null +++ b/examples/eda/README.md @@ -0,0 +1,249 @@ +# Electronic Design Automation (EDA) Reference Architecture + +The Electronic Design Automation (EDA) blueprints in +this folder captures a reference architecture where the right cloud components +are assembled to optimally cater to the requirements of EDA workloads. + +For file IO, Google Cloud NetApp Volumes NFS storage services are available. +It scales from small to high capacity and high performance and provides fan-out +caching of on-premises ONTAP systems into Google Cloud to enable hybrid cloud +architecture. The scheduling of the workloads is done by a workload +manager. + +## Architecture +The EDA blueprints are intended to be a starting point for more tailored +explorations of EDA. + +This blueprint features a general setup suited for EDA applications on +Google Cloud including: + +- Google Compute Engine partitions +- Google Cloud NetApp Volumes NFS-based shared storage +- SLURM workload scheduler + +Two example blueprints are provided. + +### Blueprint [eda-all-on-cloud](eda-all-on-cloud.yaml) + +This blueprint assumes that all compute and data resides in the cloud. + +![EDA all-cloud architecture](./ClusterToolkit-EDA-AllCloud.png) + +In the setup deploment group (see [deployment stages](#deployment_stages)) it provisions a new network and multiple NetApp Volumes volumes to store your data. Adjust the volume sizes to suit your requirements before deployment. If your volumes are larger than 15 TiB, creating them as [large volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview#large-capacity-volumes) adds performance benefits. + +The cluster deployment group deploys a managed instance group which is managed by SLURM. + +When scaling down the deploment, make sure to only destroy the *compute* deployment group. If you destroy the *setup* group too, all the volumes will be deleted and you will lose your data. + +### Blueprint [eda-hybrid-cloud](./eda-hybrid-cloud.yaml) + +This blueprint assumes you are using an pre-existing Google VPC with pre-existing NFS shares on NetApp Volumes, managed outside of Cluster Toolkit. + +![EDA hybrid-cloud architecture](./ClusterToolkit-EDA-Hybrid.png) + +The setup deployment group (see [deployment stages](#deployment_stages)) connects to an existing network and mounts multiple NetApp Volumes volumes. This blueprint assumes you have pre-existing volumes for "tools", "libraries", "home" and "scratch". Before deployment, update `server_ip` and `remote_mount` parameters of the respective volumes in the blueprint declarations to reflect the actual IP and export path of your existing volumes. Using existing volumes also avoids the danger of being deleted accidentally when deleting the setup deployment group. + +The volumes used can be regular NetApp Volume [volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview), [large volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview#large-capacity-volumes) or [FlexCache volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/cache-ontap-volumes/overview). + +FlexCache offers the following features which enable bursting on-premises workloads into Google Cloud to use its powerful compute options: + +- Read-writable sparse volume +- Block-level, “pull only” paradigm +- 100% consistent, coherent, current +- write-around +- LAN-like latencies after first read +- Fan-out. Use multiple caches to scale out workload + +It can accelerate metadata- or throughput-heavy read workloads considerably. +It can accelerate metadata- or throughput-heavy read workloads considerably. + +FlexCache and Large Volumes offer six IP addresses per volume which all provide access to the same data. Currently Cluster Toolkit only uses one of these IPs. Support for using all 6 IPs is planned for a later release. To spread you compute nodes over all IPs today, you can use CloudDNS to create an DNS record with all 6 IPs and specify that DNS name instead of individual IPs in the blueprint. CloudDNS will return one of the 6 IPs in a round-robin fashion on lookups. + +The cluster deployment group deploys a managed instance group which is managed by SLURM. + +## Getting Started +To explore the reference architecture, you should follow the these steps: + +Before you start, make sure your prerequisites and dependencies are set up: +[Set up Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/setup/configure-environment). + +For deploying the EDA reference blueprint follow the +[Deployment Instructions](#deployment-instructions). + +### Deployment Stages + +This blueprint has the following deployment groups: + +- `setup`: Setup backbone infrastructure such as networking and file systems +- `software_installation`(_optional_): This deployment group is a stub for + custom software installation on the network storage before the cluster is brought up +- `cluster`: Deploys an auto-scaling cluster + +Having multiple deployment groups decouples the life cycle of some +infrastructure. For example a) you can tear down the cluster while leaving the +storage intact and b) you can build software before you deploy your cluster. + +## Deployment Instructions + +> [!WARNING] +> Installing this blueprint uses the following billable components of Google +> Cloud: +> +> - Compute Engine +> - NetApp Volumes +> +> To avoid continued billing after use closely follow the +> [teardown instructions](#teardown-instructions). To generate a cost estimate based on +> your projected usage, use the [pricing calculator](https://cloud.google.com/products/calculator). +> +> [!WARNING] +> Before attempting to execute the following instructions, it is important to +> consider your project's quota. The blueprints create an +> autoscaling cluster that, when fully scaled up, can deploy many powerful VMs. +> +> This is merely an example for an instance of this reference architecture. +> Node counts can easily be adjusted in the blueprint. + +1. Clone the repo + + ```bash + git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git + cd cluster-toolkit + ``` + +1. Build the Cluster Toolkit + + ```bash + make + ``` + +1. Change parameters in your blueprint file to reflect your requirements. Examples are VPC names for exiting networks, H4D instance group node limits or export paths of existing NFS volumes. + +1. Generate the deployment folder after replacing `` with the name of the blueprint (`eda-all-cloud` or `eda-gybrid-cloud`) and `` with the project id. + + ```bash + ./gcluster create examples/eda/.yaml -w --vars project_id= + ``` + +1. Deploy the `setup` group + + Call the following gcluster command to deploy the blueprint. + + ```bash + ./gcluster deploy + ``` + + The next `gcluster` prompt will ask you to **display**, **apply**, **stop**, or + **continue** without applying the `setup` group. Select 'apply'. + + This group will create a network and file systems to be used by the cluster. + + > [!WARNING] + > This gcluster command will run through 2 deployment groups (3 if you populate + > & activate the `software_installation` stage) and prompt you to apply each one. + > If the command is cancelled or exited by accident before finishing, it can + > be rerun to continue deploying the blueprint. + +1. Deploy the `software_installation` group (_optional_). + + > [!NOTE] + > Installation processes differ between applications. Some come as a + > precompiled binary with all dependencies included, others may need to + > be built from source, while others can be deployed through package + > managers such as spack. This deployment group is intended to be used + > if the software installation process requires substantial amount of time (e.g. + > compilation from source). By building the software in a separate + > deployment group, this process can be done before the cluster is + > up, minimizing costs. + > + > [!NOTE] + > By default, this deployment group is disabled in the reference design. See + > [Software Installation Patterns](#software-installation-patterns) for more information. + + If this deployment group is used (needs to be uncomment in the blueprint first), + you can return to the gcluster command which will ask you to **display**, **apply**, + **stop**, or **continue** without applying the `software_installation` group. + Select 'apply'. + +1. Deploy the `cluster` group + + The next `gcluster` prompt will ask you to **display**, **apply**, **stop**, or + **continue** without applying the `cluster` group. Select 'apply'. + + This deployment group contains the Slurm cluster and compute partitions. + +## Teardown Instructions + +> [!NOTE] +> If you created a new project for testing of the EDA solution, the easiest way to +> eliminate billing is to delete the project. + +When you would like to tear down the deployment, each stage must be destroyed. +Since the `software_installation` and `cluster` depend on the network deployed +in the `setup` stage, they must be destroyed first. You can use the following +commands to destroy the deployment in this reverse order. You will be prompted +to confirm the deletion of each stage. + +```bash +./gcluster destroy +``` + +> [!WARNING] +> If you do not destroy all three deployment groups then there may be continued +> associated costs. + +## Software Installation Patterns + +This section is intended to illustrate how software can be installed in the context +of the EDA reference solution. + +Depending on the software you want to use, different installation paths may be required. + +- **Installation with binary** + Commercial-off-the-shelf applications typically come with precompiled binaries which + are provided by the ISV. If you not share them using the toolsfs or libraryfs shares, + you can install software using the following method. + + In general, you need to bring the binaries to your EDA cluster for which it is + useful to use a Google Clouds Storage bucket, which is accessible from any machine using the + gsutil command and which can be mounted in the cluster. + + As this installation process only needs to be done once and at the same time may require time, + we recommend to do this installation in a separate deployment group before you bring up the cluster. + The `software_installation' stage is meant to accommodate this. You can for example bring up + a dedicated VM + + ``` {.yaml} + - id: sw-installer-vm + source: modules/compute/vm-instance + use: [network1, appsfs] + settings: + name_prefix: sw-installer + add_deployment_name_before_prefix: true + threads_per_core: 2 + machine_type: c2-standard-16 + ``` + + where you can follow the installation steps manually. Or using the toolkit's + [startup-script](../../modules/scripts/startup-scripts/README.md) module, the process + can be automated. + + Once that is completed, the software will persist on the NFS Filestore share for as long as you + do not destroy the `setup` stage. + +- **Installation from source/with package manager** + For open source software, you may want to compile the software from scratch or use a + package manager such as spack for the installation. This process typically takes + a non-negligible amount of time (~hours). We therefore strongly suggest to use + the `software_installation` stage for this purpose. + + Please see the [HCLS Blueprint](../../docs/videos/healthcare-and-life-sciences/README.md) example + for how the `software_installation` stage can be used to use the spack package manager + to install all dependencies for a particular version of the software, including compiling + the software or its dependencies from source. + + Please also see the [OpenFOAM](../../docs/tutorials/openfoam/spack-openfoam.md) example + for how this can be used to install the OpenFOAM software. + + Once that is completed, the software will persist on the NFS Filestore share for as long as you + do not destroy the `setup` stage. diff --git a/examples/eda/eda-all-on-cloud.yaml b/examples/eda/eda-all-on-cloud.yaml new file mode 100644 index 0000000000..f0b576bc05 --- /dev/null +++ b/examples/eda/eda-all-on-cloud.yaml @@ -0,0 +1,238 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- + +blueprint_name: eda-all-on-cloud + +vars: + project_id: ## Set GCP Project ID Here ## + deployment_name: eda-all-on-cloud + region: us-central1 + zone: us-central1-a + rdma_net_range: 192.168.128.0/18 + +# Documentation for each of the modules used below can be found at +# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md + +deployment_groups: + +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# +# +# Deployment Group: Setup +# +# Sets up VPC network, persistent NFS shares +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +- group: setup + modules: + # Frontend network for GCE, NetApp Volumes and other services + - id: frontend-network + source: modules/network/vpc + + # Backend RDMA network for GCE instances with RDMA capabilities + - id: backend-network + source: modules/network/vpc + settings: + network_name: $(vars.deployment_name)-rdma-net-0 + mtu: 8896 + network_profile: https://www.googleapis.com/compute/beta/projects/$(vars.project_id)/global/networkProfiles/$(vars.zone)-vpc-falcon + network_routing_mode: REGIONAL + enable_cloud_router: false + enable_cloud_nat: false + enable_internal_traffic: false + subnetworks: + - subnet_name: $(vars.deployment_name)-rdma-sub-0 + subnet_region: $(vars.region) + subnet_ip: $(vars.rdma_net_range) + region: $(vars.region) + + # PSA is required for Google Cloud NetApp Volumes. + # Private Service Access (PSA) requires the compute.networkAdmin role which is + # included in the Owner role, but not Editor. + # https://cloud.google.com/vpc/docs/configure-private-services-access#permissions + - id: private_service_access + source: community/modules/network/private-service-access + use: [frontend-network] + settings: + prefix_length: 24 + service_name: "netapp.servicenetworking.goog" + deletion_policy: "ABANDON" + + # NetApp Storage Pool. All NetApp Volumes will be created in this pool. + - id: netapp_pool + source: modules/file-system/netapp-storage-pool + use: [frontend-network, private_service_access] + settings: + pool_name: "eda-pool" + network_id: $(frontend-network.network_id) + capacity_gib: 4096 + service_level: "EXTREME" + region: $(vars.region) + # allow_auto_tiering: true + + # NFS volume for shared tools and utilities + - id: toolsfs + source: modules/file-system/netapp-volume + use: [netapp_pool] + settings: + region: $(vars.region) + volume_name: "toolsfs" + capacity_gib: 1024 # Adjust size as needed + large_capacity: false + local_mount: "/tools" + protocols: ["NFSV3"] + unix_permissions: "0777" + # Mount options are optimized for aggressive caching, assuming rare changes oon the volume + mount_options: "nocto,actimeo=600,hard,rsize=262144,wsize=262144,vers=3,tcp,mountproto=tcp" + + # NFS volume for shared libraries + - id: libraryfs + source: modules/file-system/netapp-volume + use: [netapp_pool] + settings: + region: $(vars.region) + volume_name: "libraryfs" + capacity_gib: 1024 # Adjust size as needed + large_capacity: false + local_mount: "/library" + protocols: ["NFSV3"] + unix_permissions: "0777" + # Mount options are optimized for aggressive caching, assuming rare changes oon the volume + mount_options: "nocto,actimeo=600,hard,rsize=262144,wsize=262144,vers=3,tcp,mountproto=tcp" + + # NFS volume for home directories + - id: homefs + source: modules/file-system/netapp-volume + use: [netapp_pool] + settings: + region: $(vars.region) + volume_name: "homefs" + capacity_gib: 1024 # Adjust size as needed + large_capacity: false + local_mount: "/home" + protocols: ["NFSV3"] + unix_permissions: "0777" + + # NFS volume for scratch space + - id: scratchfs + source: modules/file-system/netapp-volume + use: [netapp_pool] + settings: + region: $(vars.region) + volume_name: "scratchfs" + capacity_gib: 1024 # Adjust size as needed + large_capacity: false + local_mount: "/scratch" + protocols: ["NFSV3"] + unix_permissions: "0777" + +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# +# +# Deployment Group: Software Installation +# +# This deployment group is a stub for installing software before +# bringing up the actual cluster. +# See the README.md for useful software deployment patterns. +# +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# - group: software_installation +# modules: + +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# +# +# Deployment Group: Cluster +# +# Provisions the actual EDA cluster with compute partitions, +# Connects to the previously set up NFS shares. +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +- group: cluster + modules: + - id: h4d_startup + source: modules/scripts/startup-script + settings: + set_ofi_cloud_rdma_tunables: true + local_ssd_filesystem: + fs_type: ext4 + mountpoint: /mnt/lssd + permissions: "1777" + + - id: h4d_nodeset + source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset + use: + - h4d_startup + - frontend-network + - homefs + - toolsfs + - libraryfs + - scratchfs + settings: + bandwidth_tier: gvnic_enabled + machine_type: h4d-highmem-192-lssd + node_count_static: 1 # Adjust as needed + node_count_dynamic_max: 0 # Adjust as needed + enable_placement: false + disk_type: hyperdisk-balanced + on_host_maintenance: TERMINATE + additional_networks: + $(concat( + [{ + network=null, + subnetwork=backend-network.subnetwork_self_link, + subnetwork_project=vars.project_id, + nic_type="IRDMA", + queue_count=null, + network_ip=null, + stack_type=null, + access_config=null, + ipv6_access_config=[], + alias_ip_range=[] + }] + )) + + - id: h4d_partition + source: community/modules/compute/schedmd-slurm-gcp-v6-partition + use: + - h4d_nodeset + settings: + exclusive: false + partition_name: h4d + is_default: true + partition_conf: + ResumeTimeout: 900 + SuspendTimeout: 600 + + - id: slurm_login + source: community/modules/scheduler/schedmd-slurm-gcp-v6-login + use: [frontend-network] + settings: + machine_type: n2-standard-4 + enable_login_public_ips: true + + - id: slurm_controller + source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller + use: + - frontend-network + - h4d_partition + - slurm_login + - homefs + - toolsfs + - libraryfs + - scratchfs + settings: + enable_controller_public_ips: true + cloud_parameters: + slurmd_timeout: 900 diff --git a/examples/eda/eda-hybrid-cloud.yaml b/examples/eda/eda-hybrid-cloud.yaml new file mode 100644 index 0000000000..0733a9e577 --- /dev/null +++ b/examples/eda/eda-hybrid-cloud.yaml @@ -0,0 +1,235 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- + +blueprint_name: eda-hybrid-cloud + +vars: + project_id: ## Set GCP Project ID Here ## + deployment_name: eda-hybrid-cloud + region: us-central1 + zone: us-central1-a + network: default + rdma_net_range: 192.168.128.0/18 + +# Documentation for each of the modules used below can be found at +# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md + +deployment_groups: + +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# +# +# Deployment Group: Setup +# +# Sets up VPC network, persistent NFS shares +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +- group: setup + modules: + # Frontend network for GCE, NetApp Volumes and other services. Make sure it has internet access. + - id: frontend-network + source: modules/network/pre-existing-vpc + settings: + project_id: $(vars.project_id) + region: $(vars.region) + network_name: $(vars.network) + + - id: firewall-rule-frontend + source: modules/network/firewall-rules + use: + - frontend-network + settings: + ingress_rules: + - name: $(vars.deployment_name)-allow-internal-traffic + description: Allow internal traffic + destination_ranges: + - $(frontend-network.subnetwork_address) + source_ranges: + - $(frontend-network.subnetwork_address) + allow: + - protocol: tcp + ports: + - 0-65535 + - protocol: udp + ports: + - 0-65535 + - protocol: icmp + - name: $(vars.deployment_name)-allow-iap-ssh + description: Allow IAP-tunneled SSH connections + destination_ranges: + - $(frontend-network.subnetwork_address) + source_ranges: + - 35.235.240.0/20 + allow: + - protocol: tcp + ports: + - 22 + + # Backend RDMA network for GCE instances with RDMA capabilities + - id: backend-network + source: modules/network/vpc + settings: + network_name: $(vars.deployment_name)-rdma-net-0 + mtu: 8896 + network_profile: https://www.googleapis.com/compute/beta/projects/$(vars.project_id)/global/networkProfiles/$(vars.zone)-vpc-falcon + network_routing_mode: REGIONAL + enable_cloud_router: false + enable_cloud_nat: false + enable_internal_traffic: false + subnetworks: + - subnet_name: $(vars.deployment_name)-rdma-sub-0 + subnet_region: $(vars.region) + subnet_ip: $(vars.rdma_net_range) + region: $(vars.region) + +# Connect existing Google Cloud NetApp Volumes +# Replace server_ip, remote_mount, and local_mount values as needed for toolsfs, libraryfs, homefs, scratchfs +# Make sure the root inode of each volume has appropriate permissions for intended users, otherwise SLURM jobs may fail + - id: toolsfs + source: modules/file-system/pre-existing-network-storage + settings: + server_ip: # Set IP address of the NFS server here + remote_mount: # Set exported path of NFS share here + local_mount: /tools + fs_type: nfs + # Mount options are optimized for aggressive caching, assuming rare changes oon the volume + mount_options: "nocto,actimeo=600,hard,rsize=262144,wsize=262144,vers=3,tcp,mountproto=tcp" + + - id: libraryfs + source: modules/file-system/pre-existing-network-storage + settings: + server_ip: # Set IP address of the NFS server here + remote_mount: # Set exported path of NFS share here + local_mount: /library + fs_type: nfs + # Mount options are optimized for aggressive caching, assuming rare changes oon the volume + mount_options: "nocto,actimeo=600,hard,rsize=262144,wsize=262144,vers=3,tcp,mountproto=tcp" + + - id: homefs + source: modules/file-system/pre-existing-network-storage + settings: + server_ip: # Set IP address of the NFS server here + remote_mount: # Set exported path of NFS share here + local_mount: /home + fs_type: nfs + mount_options: "hard,rsize=262144,wsize=262144,vers=3,tcp,mountproto=tcp" + + - id: scratchfs + source: modules/file-system/pre-existing-network-storage + settings: + server_ip: # Set IP address of the NFS server here + remote_mount: # Set exported path of NFS share here + local_mount: /scratch + fs_type: nfs + mount_options: "hard,rsize=262144,wsize=262144,vers=3,tcp,mountproto=tcp" + +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# +# +# Deployment Group: Software Installation +# +# This deployment group is a stub for installing software before +# bringing up the actual cluster. +# See the README.md for useful software deployment patterns. +# +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# - group: software_installation +# modules: + +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# +# +# Deployment Group: Cluster +# +# Provisions the actual EDA cluster with compute partitions, +# Connects to the previously set up NFS shares. +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +- group: cluster + modules: + - id: h4d_startup + source: modules/scripts/startup-script + settings: + set_ofi_cloud_rdma_tunables: true + local_ssd_filesystem: + fs_type: ext4 + mountpoint: /mnt/lssd + permissions: "1777" + + - id: h4d_nodeset + source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset + use: + - h4d_startup + - frontend-network + - homefs + - toolsfs + - libraryfs + - scratchfs + settings: + bandwidth_tier: gvnic_enabled + machine_type: h4d-highmem-192-lssd + node_count_static: 1 # Adjust as needed + node_count_dynamic_max: 0 # Adjust as needed + enable_placement: false + disk_type: hyperdisk-balanced + on_host_maintenance: TERMINATE + additional_networks: + $(concat( + [{ + network=null, + subnetwork=backend-network.subnetwork_self_link, + subnetwork_project=vars.project_id, + nic_type="IRDMA", + queue_count=null, + network_ip=null, + stack_type=null, + access_config=null, + ipv6_access_config=[], + alias_ip_range=[] + }] + )) + + - id: h4d_partition + source: community/modules/compute/schedmd-slurm-gcp-v6-partition + use: + - h4d_nodeset + settings: + exclusive: false + partition_name: h4d + is_default: true + partition_conf: + ResumeTimeout: 900 + SuspendTimeout: 600 + + - id: slurm_login + source: community/modules/scheduler/schedmd-slurm-gcp-v6-login + use: [frontend-network] + settings: + machine_type: n2-standard-4 + enable_login_public_ips: true + + - id: slurm_controller + source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller + use: + - frontend-network + - h4d_partition + - slurm_login + - homefs + - toolsfs + - libraryfs + - scratchfs + settings: + enable_controller_public_ips: true + cloud_parameters: + slurmd_timeout: 900 diff --git a/examples/netapp-volumes.yaml b/examples/netapp-volumes.yaml new file mode 100644 index 0000000000..c2c5d85382 --- /dev/null +++ b/examples/netapp-volumes.yaml @@ -0,0 +1,110 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- + +# This blueprint show how to provision shared file systems with Google Cloud NetApp Volumes. +# It creates a NetApp storage pool and a volume for use by VM instances. +# It can be used to build compute clusters on top of it and as a drop-in replacement +# for Filestore in existing blueprints. + +blueprint_name: netapp-volumes + +vars: + project_id: ## Set GCP Project ID Here ## + deployment_name: netapp-volumes + region: us-central1 + zone: us-central1-a + pool_service_level: "EXTREME" # Options: "STANDARD", "PREMIUM", "EXTREME" + +# Documentation for each of the modules used below can be found at +# https://github.com/GoogleCloudPlatform/hpc-toolkit + +deployment_groups: +- group: primary + modules: + # Source is an embedded module, denoted by "modules/*" without ./, ../, / + # as a prefix. To refer to a local module, prefix with ./, ../ or / + - id: network + source: modules/network/vpc + + # Private Service Access (PSA) requires the compute.networkAdmin role which is + # included in the Owner role, but not Editor. + # https://cloud.google.com/vpc/docs/configure-private-services-access#permissions + - id: private_service_access + source: community/modules/network/private-service-access + use: [network] + settings: + prefix_length: 24 + service_name: "netapp.servicenetworking.goog" + deletion_policy: "ABANDON" + + - id: netapp_pool + source: modules/file-system/netapp-storage-pool + use: [network, private_service_access] + settings: + pool_name: "netapp-pool" + network_id: $(network.network_id) + capacity_gib: 2048 + service_level: $(vars.pool_service_level) + region: $(vars.region) + # allow_auto_tiering: true + + - id: homefs + source: modules/file-system/netapp-volume + use: [netapp_pool] # Create this pool using the netapp-storage-pool module + settings: + region: $(vars.region) + volume_name: "homefs" + capacity_gib: 2048 # Size up to available capacity in the pool + large_capacity: false + local_mount: "/home" # Mount point at client when client uses USE directive + # mount_options: "..." # Use custom mount_options for special use cases. Defaults are sane. + protocols: ["NFSV3"] # List of protocols. ["NFSV3], ["NFSv4] or ["NFSV3, "NFSV4"] + unix_permissions: "0777" # Specify default permissions for root inode owned by root:root + # If no export policy is specified, a permissive default policy will be applied, which is: + # allowed_clients = "10.0.0.0/8,172.16.0.0/12,192.168.0.0/16" # RFC1918 + # has_root_access = true # no_root_squash enabled + # access_type = "READ_WRITE" + # export_policy_rules: + # - allowed_clients: "10.10.20.8,10.10.20.9" + # has_root_access: true # no_root_squash enabled + # access_type: "READ_WRITE" + # nfsv3: true + # nfsv4: false + # - allowed_clients: "10.0.0.0/8" + # has_root_access: false # no_root_squash disabled + # access_type: "READ_WRITE" + # nfsv3: true + # nfsv4: false + # tiering_policy: # Enable auto-tiering. Requires auto-tiering enabled storage pool + # tier_action: "ENABLED" + # cooling_threshold_days: 31 # tier data blocks which have not been touched for 31 days + # description: "Shared volume for EDA job" + # labels: + # department: eda + + # Example VMs which use homefs + - id: gcnv_ubuntu_instances + source: modules/compute/vm-instance + use: [network, homefs] + settings: + instance_count: 1 + machine_type: n2-standard-2 + + - id: wait-for-vms + source: community/modules/scripts/wait-for-startup + settings: + instance_names: $(gcnv_ubuntu_instances.name) + timeout: 7200 diff --git a/modules/file-system/netapp-storage-pool/README.md b/modules/file-system/netapp-storage-pool/README.md new file mode 100644 index 0000000000..d654458c9c --- /dev/null +++ b/modules/file-system/netapp-storage-pool/README.md @@ -0,0 +1,195 @@ +## Description + +This module creates a [Google Cloud NetApp Volumes](https://cloud.google.com/netapp/volumes/docs/discover/overview) +storage pool. + +NetApp Volumes is a first-party Google service that provides NFS and/or SMB shared file-systems to VMs. It offers advanced data management capabilities and highly scalable capacity and performance. +NetApp Volume provides: + +- robust support for NFSv3, NFSv4.x and SMB 2.1 and 3.x +- a [rich feature set][service-levels] +- scalable [performance](https://cloud.google.com/netapp/volumes/docs/performance/performance-benchmarks) +- FlexCache: Caching of ONTAP-based volumes to provide high-throughput and low latency read access to compute clusters of on-premises data +- [Auto-tiering](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/manage-auto-tiering) of unused data to optimse cost + +Support for NetApp Volumes is split into two modules. + +- **netapp-storage-pool** provisions a [storage pool](https://cloud.google.com/netapp/volumes/docs/configure-and-use/storage-pools/overview). Storage pools are pre-provisioned storage capacity containers which host volumes. A pool also defines fundamental properties of all the volumes within, like the region, the attached network, the [service level][service-levels], CMEK encryption, Active Directory and LDAP settings. +- **netapp-volume** provisions a [volume](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview) inside an existing storage pool. A volume file-system container which is shared using NFS or SMB. It provides advanced data management capabilities. + +For more information on this and other network storage options in the Cluster +Toolkit, see the extended [Network Storage documentation](../../../docs/network_storage.md). + +### NetApp storage pool service levels + +The netapp-storage-pool module currently supports the following NetApp Volumes [service levels][service-levels]: + +- Standard: 16 KiBps throughput per provisioned KiB of volume capacity. +- Premium: 64 KiBps throughput per provisioned KiB of volume capacity. Optional [auto-tiering]. +- Extreme: 128 KiBps throughput per provisioned KiB of volume capacity. Optional [auto-tiering]. + +Check the [service level matrix][service-levels] for additional information on capability differences between service levels. Flex service levels are currently not supported, but you can connect to existing Flex volumes using the [pre-existing-network-storage module][pre-existing]. + +### On-boarding NetApp Volumes +NetApp Volumes uses [Private Service Access](https://cloud.google.com/vpc/docs/private-services-access) (PSA) to connect volumes to your network. Before you create a storage pool, make sure to [connect NetApp Volumes to your network](https://cloud.google.com/netapp/volumes/docs/get-started/configure-access/networking). + +Example of creating a storage pool using a new network: + +```yaml +deployment_groups: +- group: primary + modules: + - id: network + source: modules/network/vpc + settings: + region: $(vars.region) + + - id: private_service_access + source: community/modules/network/private-service-access + use: [network] + settings: + prefix_length: 24 + service_name: "netapp.servicenetworking.goog" + deletion_policy: "ABANDON" + + - id: netapp_pool + source: modules/file-system/netapp-storage-pool + use: [network, private_service_access] + settings: + pool_name: "eda-pool" + network_id: $(network.network_id) + capacity_gib: 20000 + service_level: "EXTREME" + region: $(vars.region) +``` + +Example of creating a storage pool using an existing network which was already PSA-peered with NetApp Volume: + +```yaml +deployment_groups: + - group: primary + modules: + - id: network + source: modules/network/pre-existing-vpc + settings: + project_id: $(vars.project_id) + region: $(vars.region) + network_name: $(vars.network) + + - id: netapp_pool + source: modules/file-system/netapp-storage-pool + use: [network] + settings: + pool_name: "eda-pool" + network_id: $(network.network_id) + capacity_gib: 20000 + service_level: "EXTREME" + region: $(vars.region) +``` + +### Storage pool example + +The following example shows all available parameters in use: + +```yaml + - id: netapp_pool + source: modules/file-system/netapp-storage-pool + use: [network, private_service_access] + settings: + pool_name: "mypool" + region: "us-west4" + capacity_gib: 2048 + service_level: "EXTREME" + active_directory_policy: "projects/myproject/locations/us-east4/activeDirectories/my-ad" + cmek_policy: "projects/myproject/locations/us-east4/kmsConfigs/my-cmek-policy" + ldap_enabled: false + allow_auto_tiering: false + description: "Demo storage pool" + labels: + owner: bob +``` + +### NetApp Volumes quota + +Your project must have unused quota for NetApp Volumes in the region you will +provision the storage pool. This can be found by browsing to the [Quota tab within IAM & Admin](https://console.cloud.google.com/iam-admin/quotas) in the Cloud Console. +Please note that there are separate quota limits for Standard and Premium/Extreme service levels. + +See also NetApp Volumes [default quotas](https://cloud.google.com/netapp/volumes/docs/quotas#netapp-volumes-default-quotas). + +[service-levels]: https://cloud.google.com/netapp/volumes/docs/discover/service-levels +[auto-tiering]: https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/manage-auto-tiering +[pre-existing]: ../pre-existing-network-storage/README.md +[matrix]: ../../../docs/network_storage.md#compatibility-matrix + +## License + + +Copyright 2025 Google LLC + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. + +## Requirements + +| Name | Version | +|------|---------| +| [terraform](#requirement\_terraform) | >= 1.9.0 | +| [google](#requirement\_google) | >= 6.45.0 | +| [random](#requirement\_random) | ~> 3.0 | + +## Providers + +| Name | Version | +|------|---------| +| [google](#provider\_google) | >= 6.45.0 | +| [random](#provider\_random) | ~> 3.0 | + +## Modules + +No modules. + +## Resources + +| Name | Type | +|------|------| +| [google_netapp_storage_pool.netapp_storage_pool](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/netapp_storage_pool) | resource | +| [random_id.resource_name_suffix](https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/id) | resource | +| [google_compute_network_peering.private_peering](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_network_peering) | data source | + +## Inputs + +| Name | Description | Type | Default | Required | +|------|-------------|------|---------|:--------:| +| [active\_directory\_policy](#input\_active\_directory\_policy) | The ID of the Active Directory policy to apply to the storage pool in the format:
`projects//locations//activeDirectoryPolicies/` | `string` | `null` | no | +| [allow\_auto\_tiering](#input\_allow\_auto\_tiering) | Whether to allow automatic tiering for the storage pool. | `bool` | `false` | no | +| [capacity\_gib](#input\_capacity\_gib) | The capacity of the storage pool in GiB. | `number` | `2048` | no | +| [cmek\_policy](#input\_cmek\_policy) | The ID of the Customer Managed Encryption Key (CMEK) policy to apply to the storage pool in the format:
`projects//locations//kmsConfigs/` | `string` | `null` | no | +| [deployment\_name](#input\_deployment\_name) | Name of the deployment, used as name of the NetApp storage pool if no name is specified. | `string` | n/a | yes | +| [description](#input\_description) | A description of the NetApp storage pool. | `string` | `""` | no | +| [labels](#input\_labels) | Labels to add to the NetApp storage pool. Key-value pairs. | `map(string)` | n/a | yes | +| [ldap\_enabled](#input\_ldap\_enabled) | Whether to enable LDAP for the storage pool. | `bool` | `false` | no | +| [network\_id](#input\_network\_id) | The ID of the GCE VPC network to which the NetApp storage pool is connected given in the format:
`projects//global/networks/`" | `string` | n/a | yes | +| [network\_self\_link](#input\_network\_self\_link) | Network self-link the pool will be on, required for checking private service access | `string` | n/a | yes | +| [pool\_name](#input\_pool\_name) | The name of the storage pool. Leave empty to use generates name based on deployment name. | `string` | `null` | no | +| [private\_vpc\_connection\_peering](#input\_private\_vpc\_connection\_peering) | The name of the private VPC connection peering. | `string` | `"sn-netapp-prod"` | no | +| [project\_id](#input\_project\_id) | ID of project in which the NetApp storage pool will be created. | `string` | n/a | yes | +| [region](#input\_region) | Location for NetApp storage pool. | `string` | n/a | yes | +| [service\_level](#input\_service\_level) | The service level of the storage pool. | `string` | `"PREMIUM"` | no | + +## Outputs + +| Name | Description | +|------|-------------| +| [capacity\_gb](#output\_capacity\_gb) | Storage pool capacity in GiB. | +| [netapp\_storage\_pool\_id](#output\_netapp\_storage\_pool\_id) | An identifier for the resource with format `projects/{{project}}/locations/{{location}}/storagePools/{{name}}` | + diff --git a/modules/file-system/netapp-storage-pool/main.tf b/modules/file-system/netapp-storage-pool/main.tf new file mode 100644 index 0000000000..b9d63c11c3 --- /dev/null +++ b/modules/file-system/netapp-storage-pool/main.tf @@ -0,0 +1,56 @@ +/** + * Copyright 2025 Google LLC + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +locals { + # This label allows for billing report tracking based on module. + labels = merge(var.labels, { ghpc_module = "netapp-storage-pool", ghpc_role = "file-system" }) +} + +resource "random_id" "resource_name_suffix" { + byte_length = 4 +} + +data "google_compute_network_peering" "private_peering" { + name = var.private_vpc_connection_peering + network = var.network_self_link +} + +resource "google_netapp_storage_pool" "netapp_storage_pool" { + project = var.project_id + + name = var.pool_name != null ? var.pool_name : "${var.deployment_name}-${random_id.resource_name_suffix.hex}" + location = var.region + network = var.network_id + service_level = var.service_level + capacity_gib = var.capacity_gib + + active_directory = var.active_directory_policy + kms_config = var.cmek_policy + ldap_enabled = var.ldap_enabled + allow_auto_tiering = var.allow_auto_tiering + + description = var.description + labels = local.labels + + depends_on = [data.google_compute_network_peering.private_peering] + + lifecycle { + precondition { + condition = data.google_compute_network_peering.private_peering.state == "ACTIVE" + error_message = "The network for the storage pool must have private service access." + } + } +} diff --git a/modules/file-system/netapp-storage-pool/metadata.yaml b/modules/file-system/netapp-storage-pool/metadata.yaml new file mode 100644 index 0000000000..7a5291f9d5 --- /dev/null +++ b/modules/file-system/netapp-storage-pool/metadata.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 "Google LLC" +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +--- + +spec: + requirements: + services: + - netapp.googleapis.com + - servicenetworking.googleapis.com diff --git a/modules/file-system/netapp-storage-pool/outputs.tf b/modules/file-system/netapp-storage-pool/outputs.tf new file mode 100644 index 0000000000..91379631c6 --- /dev/null +++ b/modules/file-system/netapp-storage-pool/outputs.tf @@ -0,0 +1,23 @@ +# Copyright 2025 "Google LLC" +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +output "netapp_storage_pool_id" { + description = "An identifier for the resource with format `projects/{{project}}/locations/{{location}}/storagePools/{{name}}`" + value = google_netapp_storage_pool.netapp_storage_pool.id +} + +output "capacity_gb" { + description = "Storage pool capacity in GiB." + value = google_netapp_storage_pool.netapp_storage_pool.capacity_gib +} diff --git a/modules/file-system/netapp-storage-pool/variables.tf b/modules/file-system/netapp-storage-pool/variables.tf new file mode 100644 index 0000000000..6e8469c5a7 --- /dev/null +++ b/modules/file-system/netapp-storage-pool/variables.tf @@ -0,0 +1,133 @@ +/** + * Copyright 2025 Google LLC + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +variable "project_id" { + description = "ID of project in which the NetApp storage pool will be created." + type = string +} + +variable "deployment_name" { + description = "Name of the deployment, used as name of the NetApp storage pool if no name is specified." + type = string +} + +variable "region" { + description = "Location for NetApp storage pool." + type = string +} + +variable "network_id" { + description = <<-EOT + The ID of the GCE VPC network to which the NetApp storage pool is connected given in the format: + `projects//global/networks/`" + EOT + type = string + validation { + condition = length(split("/", var.network_id)) == 5 + error_message = "The network id must be provided in the following format: projects//global/networks/." + } +} + +variable "network_self_link" { + description = "Network self-link the pool will be on, required for checking private service access" + type = string + nullable = false +} + +variable "private_vpc_connection_peering" { + description = "The name of the private VPC connection peering." + type = string + default = "sn-netapp-prod" +} + +variable "pool_name" { + description = "The name of the storage pool. Leave empty to use generates name based on deployment name." + type = string + default = null +} + +variable "service_level" { + description = "The service level of the storage pool." + type = string + default = "PREMIUM" + validation { + condition = contains(["STANDARD", "PREMIUM", "EXTREME"], var.service_level) + error_message = "Allowed values for service_level are 'STANDARD', 'PREMIUM', or 'EXTREME'." + } +} + +variable "capacity_gib" { + description = "The capacity of the storage pool in GiB." + type = number + default = 2048 + validation { + condition = var.capacity_gib >= 2048 + error_message = "The minimum capacity for the storage pool is 2048 GiB." + } +} + +variable "active_directory_policy" { + description = <<-EOT + The ID of the Active Directory policy to apply to the storage pool in the format: + `projects//locations//activeDirectoryPolicies/` + EOT + type = string + default = null + validation { + condition = var.active_directory_policy == null ? true : length(split("/", var.active_directory_policy)) == 6 + error_message = "The active directory policy must be provided in the following format: projects//locations//activeDirectoryPolicies/." + } +} + +variable "cmek_policy" { + description = <<-EOT + The ID of the Customer Managed Encryption Key (CMEK) policy to apply to the storage pool in the format: + `projects//locations//kmsConfigs/` + EOT + type = string + default = null + validation { + condition = var.cmek_policy == null ? true : length(split("/", var.cmek_policy)) == 6 + error_message = "The CMEK policy must be provided in the following format: projects//locations//kmsConfigs/." + } +} + +variable "ldap_enabled" { + description = "Whether to enable LDAP for the storage pool." + type = bool + default = false +} + +variable "allow_auto_tiering" { + description = "Whether to allow automatic tiering for the storage pool." + type = bool + default = false +} + +variable "description" { + description = "A description of the NetApp storage pool." + type = string + default = "" + validation { + condition = length(var.description) <= 2048 + error_message = "NetApp storage pool description must be 2048 characters or fewer" + } +} + +variable "labels" { + description = "Labels to add to the NetApp storage pool. Key-value pairs." + type = map(string) +} diff --git a/modules/file-system/netapp-storage-pool/versions.tf b/modules/file-system/netapp-storage-pool/versions.tf new file mode 100644 index 0000000000..938509d334 --- /dev/null +++ b/modules/file-system/netapp-storage-pool/versions.tf @@ -0,0 +1,39 @@ +/** + * Copyright 2025 Google LLC + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. +*/ + +terraform { + required_providers { + google = { + source = "hashicorp/google" + version = ">= 6.45.0" + } + random = { + source = "hashicorp/random" + version = "~> 3.0" + } + } + + provider_meta "google" { + module_name = "blueprints/terraform/hpc-toolkit:netapp-storage-pool/v1.70.0" + } + provider_meta "google-beta" { + module_name = "blueprints/terraform/hpc-toolkit:netapp-storage-pool/v1.70.0" + } + + # Require Terraform version 1.9.0 or higher, since that added support + # for variable conditions. See https://support.hashicorp.com/hc/en-us/articles/43291233547027-Error-Invalid-reference-in-variable-validation-in-Terraform-versions-prior-to-1-9 + required_version = ">= 1.9.0" +} diff --git a/modules/file-system/netapp-volume/README.md b/modules/file-system/netapp-volume/README.md new file mode 100644 index 0000000000..478fd74ab2 --- /dev/null +++ b/modules/file-system/netapp-volume/README.md @@ -0,0 +1,201 @@ +## Description + +This module creates a [Google Cloud NetApp Volumes](https://cloud.google.com/netapp/volumes/docs/discover/overview) +volume. + +NetApp Volumes is a first-party Google service that provides NFS and/or SMB shared file-systems to VMs. It offers advanced data management capabilities and highly scalable capacity and performance. +NetApp Volume provides: + +- robust support for NFSv3, NFSv4.x and SMB 2.1 and 3.x +- a [rich feature set][service-levels] +- scalable [performance](https://cloud.google.com/netapp/volumes/docs/performance/performance-benchmarks) +- FlexCache: Caching of ONTAP-based volumes to provide high-throughput and low latency read access to compute clusters of on-premises data +- [Auto-tiering](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/manage-auto-tiering) of unused data to optimse cost + +Support for NetApp Volumes is split into two modules. + +- **netapp-storage-pool** provisions a [storage pool](https://cloud.google.com/netapp/volumes/docs/configure-and-use/storage-pools/overview). Storage pools are pre-provisioned storage capacity containers which host volumes. A pool also defines fundamental properties of all the volumes within, like the region, the attached network, the [service level][service-levels], CMEK encryption, Active Directory and LDAP settings. +- **netapp-volume** provisions a [volume](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview) inside an existing storage pool. A volume file-system container which is shared using NFS or SMB. It provides advanced data management capabilities. + +For more information on this and other network storage options in the Cluster +Toolkit, see the extended [Network Storage documentation](../../../docs/network_storage.md). + +## Deletion protection +The netapp-volume module currently doesn't implement volume deletion protection. If you create a volume with Cluster Toolkit by using this module, Cluster Toolkit will also delete it when you run `gcluster destroy`. All the data in the volume will be gone. If you want to retain the volume instead, it is advised to [use existing volumes not created by Cluster Toolkit](#using-existing-volumes-not-created-by-cluster-toolkit). + +## Volumes overview +Volumes are filesystem containers which can be shared using NFS or SMB filesharing protocols. Volumes *live* inside of [storage pools](https://cloud.google.com/netapp/volumes/docs/configure-and-use/storage-pools/overview), which can be provisioned using the [netapp-storage-pool] module. Volumes inherit fundamental settings from the pool. They *consume* capacity provided by the pool. You can create one or multiple volumes *inside* a pool. + +[netapp-storage-pool]: ../netapp-storage-pool/README.md +[service-levels]: https://cloud.google.com/netapp/volumes/docs/discover/service-levels +[auto-tiering]: https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/manage-auto-tiering +[pre-existing]: ../pre-existing-network-storage/README.md +[matrix]: ../../../docs/network_storage.md#compatibility-matrix + +## Volume examples +The following examples show the use of netapp-volume. They builds on top of an storage pool which can be provisioned using the [netapp-storage-pool][netapp-storage-pool] module. + +### Example with minimal parameters + +```yaml + - id: home_volume + source: modules/file-system/netapp-volume + use: [netapp_pool] # Create this pool using the netapp-storage-pool module + settings: + volume_name: "eda-home" + capacity_gib: 1024 # Size up to available capacity in the pool + local_mount: "/eda-home" # Mount point at client when client uses USE directive + protocols: ["NFSV3"] + region: $(vars.region) + # Default export policy exports to "10.0.0.0/8,172.16.0.0/12,192.168.0.0/16" and no_root_squash +``` + +### Example with all parameters + +```yaml + - id: shared_volume + source: modules/file-system/netapp-volume + use: [netapp_pool] # Create this pool using the netapp-storage-pool module + settings: + volume_name: "eda-shared" + capacity_gib: 25000 # Size up to available capacity in the pool + large_capacity: true + local_mount: "/shared" # Mount point at client when client uses USE directive + mount_options: "rw" # Allows customizing mount options for special workloads + protocols: ["NFSV3","NFSV4"] # List of protocols. ["NFSV3], ["NFSv4] or ["NFSV3, "NFSV4"] + region: $(vars.region) + unix_permissions: "0777" # Specify default permissions for roo inode owned by root:root + # If no export policy is specified, a permissive default policy will be applied, which is: + # allowed_clients = "10.0.0.0/8,172.16.0.0/12,192.168.0.0/16" # RFC1918 + # has_root_access = true # no_root_squash enabled + # access_type = "READ_WRITE" + export_policy: + - allowed_clients: "10.10.20.8,10.10.20.9" + has_root_access: true # no_root_squash enabled + access_type: "READ_WRITE" + nfsv3: false # allow only NFSv4 for these hosts + nfsv4: true + - allowed_clients: "10.0.0.0/8" + has_root_access: false # no_root_squash disabled + access_type: "READ_WRITE" + nfsv3: true # allow only NFSv3 for these hosts + nfsv4: false + tiering_policy: # Enable auto-tiering. Requires auto-tiering enabled storage pool + tier_action: "ENABLED" + cooling_threshold_days: 31 # tier data blocks which have not been touched for 31 days + + description: "Shared volume for EDA job" + labels: + owner: bob +``` + +## Protocol support +Since Cluster Toolkit is currently built to provision Linux-based compute clusters, this module supports NFSv3 and NFSv4.1 only. SMB is blocked. + +## Large volumes +Volumes larger than 15 TiB can be created as [Large Volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview#large-capacity-volumes). Such volumes can grow up to 3 PiB and can scale read performance up to 29 GiBps. They provide six IP addresses to the volume. They are exported via the `server_ips` output. When connecting a large volume to a client using the USE directive, cluster toolkit currently uses the first IP only. This will be improved in the future. + +This feature is allow-listed GA. To request allow-listing, see [Large Volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview#large-capacity-volumes). + +## Auto-tiering support +For auto-tiering enabled storage pools you can enable auto-tiering on the volume. For more information, see [manage auto-tiering](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/manage-auto-tiering). + +## Using existing volumes not created by Cluster Toolkit +NetApp Volumes volumes are regular NFS exports. You can use the [pre-existing-network-storage] module to integrate them into Cluster Toolkit. + +Example code: + +```yaml +- id: homefs + source: modules/file-system/pre-existing-network-storage + settings: + server_ip: ## Set server IP here ## + remote_mount: nfsshare + local_mount: /home + fs_type: nfs +``` + +This creates a resource in Cluster Toolkit which references the specified NFS export, which will be mounted at `/home` by clients whuch USE it. + +Note that the `server_ip` must be known before deployment and this module does not allow +to specify a list of IPs for large volumes. + +[pre-existing-network-storage]: ../pre-existing-network-storage/README.md + +## FlexCache support +NetApp FlexCache technology accelerates data access, reduces WAN latency and lowers WAN bandwidth costs for read-intensive workloads, especially where clients need to access the same data repeatedly. When you create a FlexCache volume, you create a remote cache of an already existing (origin) volume that contains only the actively accessed data (hot data) of the origin volume. + +The FlexCache support in Google Cloud NetApp Volumes allows you to provision a cache volume in your Google network to improve performance for hybrid cloud environments. A FlexCache volume can help you transition workloads to the hybrid cloud by caching data from an on-premises data center to cloud. + +Deploying FlexCache volumes requires manual steps on the ONTAP origin side, which are not automated. Therefore this module has no support to deploy FlexCache volumes today. Deploy them manually and use the [pre-existing-network-storage](#using-existing-volumes-not-created-by-cluster-toolkit) instead. + +## License + +Copyright 2025 Google LLC + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. + +## Requirements + +| Name | Version | +|------|---------| +| [terraform](#requirement\_terraform) | >= 1.9.0 | +| [google](#requirement\_google) | >= 6.45.0 | + +## Providers + +| Name | Version | +|------|---------| +| [google](#provider\_google) | >= 6.45.0 | + +## Modules + +No modules. + +## Resources + +| Name | Type | +|------|------| +| [google_netapp_volume.netapp_volume](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/netapp_volume) | resource | + +## Inputs + +| Name | Description | Type | Default | Required | +|------|-------------|------|---------|:--------:| +| [capacity\_gib](#input\_capacity\_gib) | The capacity of the volume in GiB. | `number` | `1024` | no | +| [description](#input\_description) | A description of the NetApp volume. | `string` | `""` | no | +| [export\_policy\_rules](#input\_export\_policy\_rules) | Define NFS export policy. |
list(object({
allowed_clients = optional(string)
has_root_access = optional(bool, false)
access_type = optional(string, "READ_WRITE")
nfsv3 = optional(bool)
nfsv4 = optional(bool)
}))
|
[
{
"access_type": "READ_WRITE",
"allowed_clients": "10.0.0.0/8,172.16.0.0/12,192.168.0.0/16",
"has_root_access": true
}
]
| no | +| [labels](#input\_labels) | Labels to add to the NetApp volume. Key-value pairs. | `map(string)` | n/a | yes | +| [large\_capacity](#input\_large\_capacity) | If true, the volume will be created with large capacity.
Large capacity volumes have 6 IP addresses and a minimal size of 15 TiB. | `bool` | `false` | no | +| [local\_mount](#input\_local\_mount) | Mountpoint for this volume. Note: If set to the same as the `name`, it will trigger a known Slurm bug ([troubleshooting](../../../docs/slurm-troubleshooting.md)). | `string` | `"/shared"` | no | +| [mount\_options](#input\_mount\_options) | NFS mount options to mount file system. | `string` | `"rw,hard,rsize=65536,wsize=65536,tcp"` | no | +| [netapp\_storage\_pool\_id](#input\_netapp\_storage\_pool\_id) | The ID of the NetApp storage pool to use for the volume. If not specified, a new storage pool will be created. | `string` | n/a | yes | +| [project\_id](#input\_project\_id) | ID of project in which the NetApp storage pool will be created. | `string` | n/a | yes | +| [protocols](#input\_protocols) | The protocols that the volume supports. Currently, only NFSv3 and NFSv4 is supported. | `list(string)` |
[
"NFSV3"
]
| no | +| [region](#input\_region) | Location for NetApp storage pool. | `string` | n/a | yes | +| [tiering\_policy](#input\_tiering\_policy) | Define the tiering policy for the NetApp volume. |
object({
tier_action = optional(string)
cooling_threshold_days = optional(number)
})
| `null` | no | +| [unix\_permissions](#input\_unix\_permissions) | UNIX permissions for root inode the volume. | `string` | `"0777"` | no | +| [volume\_name](#input\_volume\_name) | The name of the volume. Leave empty to use generates name based on deployment name. | `string` | `null` | no | + +## Outputs + +| Name | Description | +|------|-------------| +| [capacity\_gb](#output\_capacity\_gb) | Volume capacity in GiB. | +| [install\_nfs\_client](#output\_install\_nfs\_client) | Script for installing NFS client | +| [install\_nfs\_client\_runner](#output\_install\_nfs\_client\_runner) | Runner to install NFS client using the startup-script module | +| [mount\_runner](#output\_mount\_runner) | Runner to mount the file-system using an ansible playbook. The startup-script
module will automatically handle installation of ansible.
- id: example-startup-script
source: modules/scripts/startup-script
settings:
runners:
- $(your-fs-id.mount\_runner)
... | +| [netapp\_volume\_id](#output\_netapp\_volume\_id) | An identifier for the resource with format `projects/{{project}}/locations/{{location}}/volumes/{{name}}` | +| [network\_storage](#output\_network\_storage) | Describes a NetApp Volumes volume. | +| [server\_ips](#output\_server\_ips) | List of IP addresses of the volume. | + diff --git a/modules/file-system/netapp-volume/main.tf b/modules/file-system/netapp-volume/main.tf new file mode 100644 index 0000000000..d8345bf347 --- /dev/null +++ b/modules/file-system/netapp-volume/main.tf @@ -0,0 +1,92 @@ +/** + * Copyright 2025 Google LLC + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +locals { + # This label allows for billing report tracking based on module. + labels = merge(var.labels, { ghpc_module = "netapp-volume", ghpc_role = "file-system" }) +} + +# resource "random_id" "resource_name_suffix" { +# byte_length = 4 +# } + +locals { + full_path = split(":", google_netapp_volume.netapp_volume.mount_options[0].export_full) + server_ip = local.full_path[0] + remote_mount = local.full_path[1] + # Large volumes will have 6 IPs + server_ips = [for ip in google_netapp_volume.netapp_volume.mount_options[*].export_full : split(":", ip)[0]] + fs_type = "nfs" + mount_options = var.mount_options + + install_nfs_client_runner = { + "type" = "shell" + "source" = "${path.module}/scripts/install-nfs-client.sh" + "destination" = "install-nfs${replace(var.local_mount, "/", "_")}.sh" + } + mount_runner = { + "type" = "shell" + "source" = "${path.module}/scripts/mount.sh" + "args" = "\"${join(",", local.server_ips)}\" \"${local.remote_mount}\" \"${var.local_mount}\" \"${local.fs_type}\" \"${local.mount_options}\"" + "destination" = "mount${replace(var.local_mount, "/", "_")}.sh" + } + + split_pool_id = split("/", var.netapp_storage_pool_id) + pool_name = local.split_pool_id[5] +} + +resource "google_netapp_volume" "netapp_volume" { + project = var.project_id + + name = var.volume_name + share_name = var.volume_name + location = var.region + protocols = var.protocols + capacity_gib = var.capacity_gib + large_capacity = var.large_capacity + multiple_endpoints = var.large_capacity == true ? true : null + storage_pool = local.pool_name + unix_permissions = var.unix_permissions + + dynamic "tiering_policy" { + for_each = var.tiering_policy == null ? [] : [0] + content { + cooling_threshold_days = lookup(var.tiering_policy, "cooling_threshold_days", null) + tier_action = lookup(var.tiering_policy, "tier_action", null) + } + } + + description = var.description + labels = local.labels + + dynamic "export_policy" { + for_each = var.export_policy_rules == null ? [] : [0] + content { + dynamic "rules" { + for_each = var.export_policy_rules + content { + access_type = rules.value.access_type + allowed_clients = rules.value.allowed_clients + has_root_access = rules.value.has_root_access + nfsv3 = rules.value.nfsv3 == null ? contains([for p in var.protocols : lower(p)], "nfsv3") : rules.value.nfsv3 + nfsv4 = rules.value.nfsv4 == null ? contains([for p in var.protocols : lower(p)], "nfsv4") : rules.value.nfsv4 + } + } + } + } + + depends_on = [var.netapp_storage_pool_id] +} diff --git a/modules/file-system/netapp-volume/metadata.yaml b/modules/file-system/netapp-volume/metadata.yaml new file mode 100644 index 0000000000..e4a7aaaa14 --- /dev/null +++ b/modules/file-system/netapp-volume/metadata.yaml @@ -0,0 +1,19 @@ +# Copyright 2025 "Google LLC" +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +--- + +spec: + requirements: + services: + - netapp.googleapis.com diff --git a/modules/file-system/netapp-volume/outputs.tf b/modules/file-system/netapp-volume/outputs.tf new file mode 100644 index 0000000000..641eae007a --- /dev/null +++ b/modules/file-system/netapp-volume/outputs.tf @@ -0,0 +1,66 @@ +# Copyright 2025 "Google LLC" +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +output "network_storage" { + description = "Describes a NetApp Volumes volume." + value = { + server_ip = local.server_ip + remote_mount = local.remote_mount + local_mount = var.local_mount + fs_type = local.fs_type + mount_options = local.mount_options + client_install_runner = local.install_nfs_client_runner + mount_runner = local.mount_runner + } +} + +output "install_nfs_client" { + description = "Script for installing NFS client" + value = file("${path.module}/scripts/install-nfs-client.sh") +} + +output "install_nfs_client_runner" { + description = "Runner to install NFS client using the startup-script module" + value = local.install_nfs_client_runner +} + +output "mount_runner" { + description = <<-EOT + Runner to mount the file-system using an ansible playbook. The startup-script + module will automatically handle installation of ansible. + - id: example-startup-script + source: modules/scripts/startup-script + settings: + runners: + - $(your-fs-id.mount_runner) + ... + EOT + value = local.mount_runner +} + +output "netapp_volume_id" { + description = "An identifier for the resource with format `projects/{{project}}/locations/{{location}}/volumes/{{name}}`" + value = google_netapp_volume.netapp_volume.id +} + +output "capacity_gb" { + description = "Volume capacity in GiB." + value = google_netapp_volume.netapp_volume.capacity_gib +} + +output "server_ips" { + description = "List of IP addresses of the volume." + value = local.server_ips +} diff --git a/modules/file-system/netapp-volume/scripts/install-nfs-client.sh b/modules/file-system/netapp-volume/scripts/install-nfs-client.sh new file mode 100644 index 0000000000..008c8fd0a1 --- /dev/null +++ b/modules/file-system/netapp-volume/scripts/install-nfs-client.sh @@ -0,0 +1,37 @@ +#!/bin/sh +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +if [ ! "$(which mount.nfs)" ]; then + if [ -f /etc/centos-release ] || [ -f /etc/redhat-release ] || + [ -f /etc/oracle-release ] || [ -f /etc/system-release ]; then + major_version=$(rpm -E "%{rhel}") + enable_repo="" + if [ "${major_version}" -eq "7" ]; then + enable_repo="base,epel" + elif [ "${major_version}" -eq "8" ] || [ "${major_version}" -eq "9" ]; then + enable_repo="baseos" + else + echo "Unsupported version of centos/RHEL/Rocky" + return 1 + fi + yum install --disablerepo="*" --enablerepo=${enable_repo} -y nfs-utils + elif [ -f /etc/debian_version ] || grep -qi ubuntu /etc/lsb-release || grep -qi ubuntu /etc/os-release; then + apt-get update --allow-releaseinfo-change-origin --allow-releaseinfo-change-label + apt-get -y install nfs-common + else + echo 'Unsuported distribution' + return 1 + fi +fi diff --git a/modules/file-system/netapp-volume/scripts/mount.sh b/modules/file-system/netapp-volume/scripts/mount.sh new file mode 100644 index 0000000000..8253d40a24 --- /dev/null +++ b/modules/file-system/netapp-volume/scripts/mount.sh @@ -0,0 +1,66 @@ +#!/bin/bash +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +set -e +SERVER_IPS=$1 +REMOTE_MOUNT=$2 +LOCAL_MOUNT=$3 +FS_TYPE=$4 +MOUNT_OPTIONS=$5 + +# accept a list of colon-separated IPs and randomly pick one to enable load balancing +# In recent changes cluster toolkit doesn't seem to use this file anymore, +# which makes all mounts use the first IP in the list. Needs to be investigated in future. +IFS="," read -r -a arrIPS <<<"${SERVER_IPS}" +rand1=$(od -vAn -t d -N1 /dev/null && EXACT_MOUNTED=true || EXACT_MOUNTED=false + +# Do nothing and success if exact entry is already in fstab and mounted +if [ "$EXACT_IN_FSTAB" = true ] && [ "${EXACT_MOUNTED}" = true ]; then + echo "Skipping mounting source: ${FS_SPEC}, already mounted to target:${LOCAL_MOUNT}" + exit 0 +fi + +# Fail if previous fstab entry is using same local mount +if [ "$SAME_LOCAL_IN_FSTAB" = true ] && [ "${EXACT_IN_FSTAB}" = false ]; then + echo "Mounting failed as local mount: ${LOCAL_MOUNT} was already in use in fstab" + exit 1 +fi + +# Add to fstab if entry is not already there +if [ "${EXACT_IN_FSTAB}" = false ]; then + echo "Adding ${FS_SPEC} -> ${LOCAL_MOUNT} to /etc/fstab" + echo "${FS_SPEC} ${LOCAL_MOUNT} ${FS_TYPE} ${POPULATED_MOUNT_OPTIONS} 0 0" >>/etc/fstab +fi + +# Mount from fstab +echo "Mounting --target ${LOCAL_MOUNT} from fstab" +mkdir -p "${LOCAL_MOUNT}" +mount --target "${LOCAL_MOUNT}" diff --git a/modules/file-system/netapp-volume/variables.tf b/modules/file-system/netapp-volume/variables.tf new file mode 100644 index 0000000000..4ca688d011 --- /dev/null +++ b/modules/file-system/netapp-volume/variables.tf @@ -0,0 +1,145 @@ +/** + * Copyright 2025 Google LLC + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +variable "project_id" { + description = "ID of project in which the NetApp storage pool will be created." + type = string +} + +variable "netapp_storage_pool_id" { + description = "The ID of the NetApp storage pool to use for the volume. If not specified, a new storage pool will be created." + type = string + validation { + condition = length(split("/", var.netapp_storage_pool_id)) == 6 + error_message = "The storage pool id must be provided in the following format: projects//locations//storagePools/." + } +} + +variable "region" { + description = "Location for NetApp storage pool." + type = string +} + +variable "volume_name" { + description = "The name of the volume. Leave empty to use generates name based on deployment name." + type = string + default = null +} + +variable "capacity_gib" { + description = "The capacity of the volume in GiB." + type = number + default = 1024 + validation { + condition = var.capacity_gib >= 100 + error_message = "The minimum capacity for the volume is 100 GiB." + } +} + +variable "protocols" { + description = "The protocols that the volume supports. Currently, only NFSv3 and NFSv4 is supported." + type = list(string) + default = ["NFSV3"] + validation { + condition = alltrue([for p in var.protocols : contains(["NFSV3", "NFSV4"], p)]) + error_message = "Allowed values for protocols are 'NFSV3' or 'NFSV4'." + } +} + +variable "description" { + description = "A description of the NetApp volume." + type = string + default = "" + validation { + condition = length(var.description) <= 2048 + error_message = "NetApp volume description must be 2048 characters or fewer" + } +} + +variable "labels" { + description = "Labels to add to the NetApp volume. Key-value pairs." + type = map(string) +} + +variable "local_mount" { + description = "Mountpoint for this volume. Note: If set to the same as the `name`, it will trigger a known Slurm bug ([troubleshooting](../../../docs/slurm-troubleshooting.md))." + type = string + default = "/shared" +} + +variable "mount_options" { + description = "NFS mount options to mount file system." + type = string + default = "rw,hard,rsize=65536,wsize=65536,tcp" +} + +variable "large_capacity" { + description = <<-EOT + If true, the volume will be created with large capacity. + Large capacity volumes have 6 IP addresses and a minimal size of 15 TiB. + EOT + type = bool + default = false + validation { + condition = var.large_capacity == false ? true : var.capacity_gib >= 15360 + error_message = "The minimum capacity for a large volume is 15360 GiB." + } +} + +variable "unix_permissions" { + description = "UNIX permissions for root inode the volume." + type = string + default = "0777" + validation { + condition = length(var.unix_permissions) <= 4 + error_message = "UNIX permissions must be a 4-digit octal number." + } +} + +variable "tiering_policy" { + description = "Define the tiering policy for the NetApp volume." + type = object({ + tier_action = optional(string) + cooling_threshold_days = optional(number) + }) + default = null + validation { + condition = var.tiering_policy == null ? true : contains(["ENABLED", "PAUSED"], var.tiering_policy.tier_action) + error_message = "Allowed values for tier_action are 'ENABLED' or 'PAUSED'." + } +} + +variable "export_policy_rules" { + description = "Define NFS export policy." + type = list(object({ + allowed_clients = optional(string) + has_root_access = optional(bool, false) + access_type = optional(string, "READ_WRITE") + nfsv3 = optional(bool) + nfsv4 = optional(bool) + })) + # Permissive default if user does not specify nfs_export_options. Allow all RFC1918 CIDRS with no_root_squash + default = [{ + allowed_clients = "10.0.0.0/8,172.16.0.0/12,192.168.0.0/16", + has_root_access = true, + access_type = "READ_WRITE", + }] + nullable = true + validation { + condition = var.export_policy_rules == null ? true : alltrue([for p in var.export_policy_rules : contains(["READ_ONLY", "READ_WRITE", "NONE"], p.access_type)]) + error_message = "Allowed values for access_type are 'READ_ONLY', 'READ_WRITE', or 'NONE'." + } +} diff --git a/modules/file-system/netapp-volume/versions.tf b/modules/file-system/netapp-volume/versions.tf new file mode 100644 index 0000000000..fd7520e65d --- /dev/null +++ b/modules/file-system/netapp-volume/versions.tf @@ -0,0 +1,34 @@ +/** + * Copyright 2025 Google LLC + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. +*/ + +terraform { + required_providers { + google = { + source = "hashicorp/google" + version = ">= 6.45.0" + } + } + provider_meta "google" { + module_name = "blueprints/terraform/hpc-toolkit:netapp-volume/v1.70.0" + } + provider_meta "google-beta" { + module_name = "blueprints/terraform/hpc-toolkit:netapp-volume/v1.70.0" + } + + # Require Terraform version 1.9.0 or higher, since that added support + # for variable conditions. See https://support.hashicorp.com/hc/en-us/articles/43291233547027-Error-Invalid-reference-in-variable-validation-in-Terraform-versions-prior-to-1-9 + required_version = ">= 1.9.0" +} diff --git a/tools/cloud-build/daily-tests/builds/netapp-volumes.yaml b/tools/cloud-build/daily-tests/builds/netapp-volumes.yaml new file mode 100644 index 0000000000..77fdd28d51 --- /dev/null +++ b/tools/cloud-build/daily-tests/builds/netapp-volumes.yaml @@ -0,0 +1,43 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- +tags: +- m.vpc +- m.private-service-access +- m.netapp-storage-pool +- m.netapp-volume +- m.vm-instance +- m.wait-for-startup +- vm + +timeout: 14400s # 4hr +steps: +- id: ansible-vm + name: us-central1-docker.pkg.dev/$PROJECT_ID/hpc-toolkit-repo/test-runner + entrypoint: /bin/bash + env: + - "ANSIBLE_HOST_KEY_CHECKING=false" + - "ANSIBLE_CONFIG=/workspace/tools/cloud-build/ansible.cfg" + args: + - -c + - | + set -x -e + cd /workspace && make + BUILD_ID_FULL=$BUILD_ID + BUILD_ID_SHORT=$${BUILD_ID_FULL:0:6} + + ansible-playbook tools/cloud-build/daily-tests/ansible_playbooks/base-integration-test.yml \ + --user=sa_106486320838376751393 --extra-vars="project=${PROJECT_ID} build=$${BUILD_ID_SHORT}" \ + --extra-vars="@tools/cloud-build/daily-tests/tests/netapp-volumes.yml" diff --git a/tools/cloud-build/daily-tests/tests/netapp-volumes.yml b/tools/cloud-build/daily-tests/tests/netapp-volumes.yml new file mode 100644 index 0000000000..575d337920 --- /dev/null +++ b/tools/cloud-build/daily-tests/tests/netapp-volumes.yml @@ -0,0 +1,28 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +--- + +test_name: "netapp-volumes" +deployment_name: "netapp-volumes-{{ build }}" +workspace: /workspace +blueprint_yaml: "{{ workspace }}/examples/netapp-volumes.yaml" +region: us-central1 +zone: us-central1-a +network: "{{ test_name }}-net" +remote_node: "{{ deployment_name }}-0" +post_deploy_tests: +- test-validation/test-mounts.yml +custom_vars: + mounts: + - /home