@@ -7,30 +7,26 @@ a3-highgpu-8g compute nodes running NVIDIA H100 GPUs.
77
88> [ !IMPORTANT]
99> Before beginning, submit a request to your Google Cloud representative for
10- > access to the Deep Learning VM Image for a3-highgpu-8g. It is currently
11- > available only by Private Preview request. This image contains patches that
12- > significantly enhance the network performance of workloads that span multiple
13- > a3-highgpu-8g VMs. You will use the image ID in the steps shown below.
10+ > access credentials to install the linux-gcp-tcpx kernel for a3-highgpu-8g.
11+ > This kernel contains patches that significantly enhance the network
12+ > performance of workloads that span multiple
13+ > a3-highgpu-8g VMs.
1414
15- ## Upgrading from the v5 "legacy" solution to v6
16- There is no direct path for upgrading the Slurm-GCP v5 solution in-place to v6 .
17- The recommended path requires temporarily bringing down your v5 cluster and
18- replacing it with the v6 solution described in this document.
15+ ## Upgrading from the "legacy" solution
16+ There is no direct path for upgrading the a3-highgpu-8g legacy solution .
17+ The recommended path requires temporarily bringing down your cluster and
18+ replacing it with the solution described in this document.
1919
20- > [ !NOTE]
21- > The ` ml-slurm-a3-0-base.yaml ` blueprint is identical for the "legacy" v5 and
22- > v6 solutions. If you are upgrading from v5 to v6, do not destroy the v5 base
23- > blueprint or re-deploy the v6 base blueprint. Simply copy the Filestore IP
24- > address as instructed below.
20+ We recommend using ` gcluster destroy ` to destroy the deployments provisioned by the legacy blueprints:
2521
26- We recommend using ` gcluster destroy ` to destroy the deployments provisioned by the
27- v5 legacy blueprints:
22+ - ![ deprecated-badge] [ Legacy v5 image building blueprint] ( v5-legacy/ml-slurm-a3-1-image-v5-legacy.yaml )
23+ - ![ deprecated-badge] [ Legacy v5 cluster provisioning blueprint] ( v5-legacy/ml-slurm-a3-2-cluster-v5-legacy.yaml )
24+ - ![ deprecated-badge] [ Legacy base provisioning blueprint] ( /examples/machine-learning/a3-highgpu-8g/ml-slurm-a3-0-base.yaml )
25+ - ![ deprecated-badge] [ Legacy image provisioning blueprint] ( /examples/machine-learning/a3-highgpu-8g/ml-slurm-a3-1-image.yaml )
26+ - ![ deprecated-badge] [ Legacy cluster provisioning blueprint] ( /examples/machine-learning/a3-highgpu-8g/ml-slurm-a3-2-cluster.yaml )
2827
29- - [ Legacy v5 image building blueprint] ( v5-legacy/ml-slurm-a3-1-image-v5-legacy.yaml )
30- - [ Legacy v5 cluster provisioning blueprint] ( v5-legacy/ml-slurm-a3-2-cluster-v5-legacy.yaml )
28+ Then follow the instructions below.
3129
32- Then follow the instructions below while skipping the re-deployment of the base
33- blueprint.
3430
3531## Required setup
3632
@@ -47,41 +43,21 @@ gcluster --version
4743
4844## Top-Level Design of Solution
4945
50- The solution is split into 3 Cluster Toolkit blueprints :
46+ The blueprint is split into 3 deployment groups :
5147
52- 1 . Provision 1 system network and 1 Filestore instance for mounting ` /home `
48+ 1 . Group 1 provisions the system network, gpu network and 1 Filestore instance for mounting ` /home `
5349across the cluster.
54- 2 . Build a custom image installing Slurm in an Ubuntu 20 .04 image. The image
50+ 2 . Group 2 builds a custom image installing Slurm on an Ubuntu 22 .04 image. The image
5551runs a kernel patched with performance enhancements for the a3-highgpu-8g VM.
56- 3 . Provision 4 GPU networks and a Slurm cluster using the custom image.
52+ 3 . Group 3 provisions Slurm cluster and a3-highgpu-8g nodes using the custom image.
5753
58- The 1st and 2nd blueprints should be provisioned once and rarely need further
59- modification. This approach separates the lifecycle of a Filestore instance from
60- the lifecycle of the cluster, allowing the cluster to be deleted while retaining
61- access to data and home directories. The 3rd cluster blueprint may be more
62- frequently updated and re-provisioned as discussed below.
6354
6455## First time considerations
6556
6657> [ !IMPORTANT]
6758> These steps do not need to be repeated when a cluster is re-provisioned. They
6859> are initial setup steps in a project.
6960
70- Replace the values for ` PROJECT_ID ` , ` REGION ` , and ` ZONE ` with the project,
71- region, and zone in which you have an a3-highgpu-8g allocation. The value for
72- ` BUCKET ` must be unique and will be used to create a new bucket. After replacing
73- the values, execute them so that they automatically populate parameters in the
74- commands shown below. Note that each a3-highgpu-8g VM (` N_VMS ` ) contains 8 NVIDIA
75- H100 GPUs.
76-
77- ``` shell
78- export PROJECT_ID=customer-project-id
79- export BUCKET=customer-bucket
80- export REGION=customer-region
81- export ZONE=customer-zone
82- export N_VMS=32
83- ```
84-
8561### Saving Terraform state
8662Create a bucket with versioning enabled to store Terraform state:
8763
@@ -92,7 +68,7 @@ gcloud storage buckets create gs://${BUCKET} --project=${PROJECT_ID} \
9268gcloud storage buckets update gs://${BUCKET} --versioning
9369```
9470
95- Modify all 3 blueprints to configure the new bucket to serve as a Terraform
71+ Modify the blueprints to configure the new bucket to serve as a Terraform
9672remote backend:
9773
9874``` yaml
@@ -102,55 +78,51 @@ terraform_backend_defaults:
10278 bucket : customer-bucket # modify to bucket created above
10379` ` `
10480
105- ### Set default values
81+ ### Deployment variables
10682
107- Modify the the deployment variables ` project_id`, `region`, `zone`, in the
108- `vars` block of all 3 blueprints :
83+ Set the values in a3high-slurm-deployment.yaml for your deployment
10984
11085` ` ` yaml
86+ deployment_name : unique-name
11187 project_id : customer-project
11288 region : customer-region
11389 zone : customer-zone
11490` ` `
11591
116- # ## Set kernel-patched OS image
92+ ** Set kernel-patched OS image**
11793
118- Obtain values for `source_image_project_id` and `source_image` from your Google
119- Cloud representative. Set them at approximately lines 33 and 34 of
120- ` ml-slurm-a3-1-image.yaml` .
94+ Obtain values for ` tcpx_kernel_login`, `tcpx_kernel_password` and `keyserver_ubuntu_key` from your Google Cloud representative. Set them at the deployment file.
12195
12296` ` ` yaml
123- source_image_project_id: source-image-project-id # use value supplied by Google Cloud staff
124- source_image: source-image-name # use value supplied by Google Cloud staff
97+ tcpx_kernel_login: # use value supplied by Google Cloud staff
98+ tcpx_kernel_password: # use value supplied by Google Cloud staff
99+ keyserver_ubuntu_key: # use value supplied by Google Cloud staff
125100` ` `
126101
127- # ## Reservation created by Google
102+ ** Reservation created by Google**
128103
129104> [!IMPORTANT]
130105> If you have ***not*** received a VM reservation from Google Cloud staff, then
131106> skip this step and proceed to [manual reservation creation](#manual-creation-of-reservation).
132107
133- Set the deployment variable `a3_reservation_name` at approximately line 38 of
134- ` ml-slurm-a3-2-cluster.yaml` to the reservation name provided by Google.
108+ Set the deployment variable `a3_reservation_name` to the reservation name provided by Google.
135109
136110` ` ` yaml
137111 # a3_reservation_name must be specified; if Google staff have provided you
138112 # with a reservation name, use it. Otherwise supply user-created reservation.
139113 a3_reservation_name: reservation-name-provided-by-google
140114` ` `
141115
142- # ## Manual creation of reservation
116+ ** Manual creation of reservation**
143117
144118> [!IMPORTANT]
145- > If you received a VM reservation from Google Cloud staff, then skip this step
146- > after confirming that you followed the instructions in [reservation created by
147- > Google](#reservation-created-by-google).
119+ > If you received a VM reservation from Google Cloud staff, then skip this step.
148120
149121We recommend creating a reservation to ensure reliable access to re-create VMs
150122if you need to redeploy or otherwise maintain your cluster.
151123
152124` ` ` shell
153- gcloud compute reservations create a3-reservation-0 \
125+ gcloud compute reservations create ${A3_RESERVATION_NAME} \
154126 --project=${PROJECT_ID} \
155127 --machine-type=a3-highgpu-8g \
156128 --vm-count=${N_VMS} \
@@ -160,24 +132,21 @@ gcloud compute reservations create a3-reservation-0 \
160132` ` `
161133
162134This reservation be must be specified when creating VMs with matching parameters
163- (e.g. a3-highgpu-8g VM in configured zone). If you executed the command above
164- without modification, you may leave `a3_reservation_name` at their default values in
165- ` ml-slurm-a3-2-cluster.yaml` . Otherwise, ensure that the reservation name in the
166- blueprint matches the name of the user-created reservation.
135+ (e.g. a3-highgpu-8g VM in configured zone). Ensure that the reservation name in the
136+ deployment file matches the name of the user-created reservation.
167137
168138` ` ` yaml
169139 # a3_reservation_name must be specified; if Google staff have provided you
170140 # with a reservation name, use it. Otherwise supply user-created reservation.
171- a3_reservation_name: a3-reservation-0
141+ a3_reservation_name:
172142` ` `
173143
174- # ## Using Spot VM or DWS Flex
144+ ** Using Spot VM or DWS Flex**
175145
176146> [!IMPORTANT]
177147> Select one of the provisioning models : either spot vm , dws flex or reservation
178148
179- In order to make use of DWS Flex Start mode with SlurmGCP, you must use the `dws_flex` variable in the `schedmd-slurm-gcp-v6-nodeset` module.
180- For setting this variable , set the `a3_dws_flex_enabled` variable as shown below
149+ In order to make use of DWS Flex Start mode with SlurmGCP, set the `a3_dws_flex_enabled` variable as shown below
181150
182151` ` ` yaml
183152 vars:
@@ -187,9 +156,7 @@ For setting this variable , set the `a3_dws_flex_enabled` variable as shown belo
187156
188157To learn more about DWS Flex-Start, visit https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/docs/slurm-dws-flex.md
189158
190- Similarly ,to make use of Spot VMs,
191- In order to make use of Spot VMs with Slurm, you must use the `enable_spot_vm` variable in the `schedmd-slurm-gcp-v6-nodeset` module.
192- For setting this variable , set the `a3_enable_spot_vm` variable as shown below
159+ Similarly ,to make use of Spot VMs with Slurm, set the `a3_enable_spot_vm` variable as shown below
193160
194161` ` ` yaml
195162 vars:
@@ -199,10 +166,10 @@ For setting this variable , set the `a3_enable_spot_vm` variable as shown below
199166
200167To learn more about Spot VM visit : https://cloud.google.com/compute/docs/instances/spot
201168
202- # ## Set cluster size
169+ ** Set cluster size**
203170
204- At approximately line 37 of `ml-slurm-a3-2-cluster.yaml`, set the static cluster
205- size. Recall that there are 8 NVIDIA H100 GPUs per a3-highgpu-8g VM.
171+ Set the static cluster
172+ size using `a3_static_cluster_size` variable . Recall that there are 8 NVIDIA H100 GPUs per a3-highgpu-8g VM.
206173
207174` ` ` yaml
208175 a3_static_cluster_size: 32
@@ -211,66 +178,12 @@ size. Recall that there are 8 NVIDIA H100 GPUs per a3-highgpu-8g VM.
211178# # Cluster creation
212179
213180> [!NOTE]
214- > The `ml-slurm-a3-0-base.yaml` blueprint is identical for the "legacy" v5 and
215- > v6 solutions. If you are upgrading from v5 to v6, do not destroy the v5 base
216- > blueprint or re-deploy the v6 base blueprint. Simply copy the Filestore IP
217- > address as instructed below.
218-
219- The blueprint `ml-slurm-a3-0-base.yaml` will create 1 system network and a
220- Filestore `/home` filesystem. Run the standard Toolkit workflow at the command
221- line (approx. 5 minutes) :
222-
223- ` ` ` shell
224- gcluster deploy ml-slurm-a3-0-base.yaml --auto-approve
225- ` ` `
226-
227- Several values will be output to the screen. The output will be similar to :
228-
229- ` ` ` hcl
230- network_name_sysnet = "sys-net"
231- network_storage_homefs = {
232- "client_install_runner" = {
233- "destination" = "install-nfs_home.sh"
234- "source" = "modules/embedded/modules/file-system/filestore/scripts/install-nfs-client.sh"
235- "type" = "shell"
236- }
237- "fs_type" = "nfs"
238- "local_mount" = "/home"
239- "mount_options" = "defaults,_netdev"
240- "mount_runner" = {
241- "args" = "\" 10.224.153.226\" \" /nfsshare\" \" /home\" \" nfs\" \" defaults,_netdev\" "
242- "destination" = "mount_home.sh"
243- "source" = "modules/embedded/modules/file-system/filestore/scripts/mount.sh"
244- "type" = "shell"
245- }
246- "remote_mount" = "/nfsshare"
247- "server_ip" = "10.224.153.226"
248- }
249- subnetwork_name_sysnet = "sys-subnet"
250- ` ` `
251-
252- Build the custom image using ml-slurm-a3-1-image.yaml and the same workflow
253- as above. Run at the command line :
254-
255- ` ` ` shell
256- gcluster deploy ml-slurm-a3-1-image.yaml --auto-approve
257- ` ` `
258-
259- The image will take approximately 30 minutes to build.
260-
261- > [!IMPORTANT]
262- > You must modify `ml-slurm-a3-2-cluster.yaml` to update the IP address of the
263- > Filestore instance for `/home`. Your IP address will differ from that shown
264- > below and must match the output from deploying the base blueprint above:
265- >
266- > ```yaml
267- > server_ip_homefs: 10.224.153.226
268- > ```
181+ > This blueprint is not compatible with the legacy a3-highgpu-8g blueprints. We recommend bringing down the earlier cluster and redeploying using the below mentioned step.
269182
270- Provision the cluster blueprint (approximately 5-10 minutes) :
183+ Provision the cluster blueprint (approximately 40 minutes) :
271184
272185` ` ` shell
273- gcluster deploy ml- slurm-a3-2-cluster .yaml --auto-approve
186+ ./ gcluster deploy -d a3high- slurm-deployment.yaml a3high-slurm-blueprint .yaml --auto-approve
274187` ` `
275188
276189# # Receive Data Path Manager (RxDM)
@@ -291,8 +204,7 @@ The Prolog will perform the following actions:
291204 - Mounts `/var/lib/nvidia/lib64` into `/usr/lib/nvidia/lib64` of the container
292205 - Mount `/opt/tcpdirect_benchmark/` from the host into the container so that a
293206 textproto file defining the mapping from GPU to NIC is available. This file
294- is present in the Deep Learning VM (DLVM) images that contain TCPDirect
295- patches.
207+ is present in the images that is used in this solution.
296208 - Mount `/run/tcpx-${SLURM_JOB_ID}` from the container into the host. This is
297209 set to the environment variables `${UDS_PATH}` in the script. This directory
298210 contains Unix socket files that implement a TCPx interface available to the
@@ -328,7 +240,7 @@ The example workload below demonstrates the pattern recommended in Activating
328240the Receive Data Path Manager during jobs while running the standard nccl-tests
329241benchmark. It assumes the availability of a GPU/NIC topology file at
330242` /opt/tcpdirect_benchmark/gpu_rxq_configuration.textproto` . This file is built
331- into the DLVM images used by this solution, but may need to be provided if
243+ into the image used by this solution, but may need to be provided if
332244using an alternative image.
333245
334246# ## Clone the Cluster Toolkit repository containing the NCCL benchmark
@@ -359,3 +271,4 @@ sbatch run-nccl-tests.sh
359271[consume] : https://cloud.google.com/compute/docs/instances/reservations-consume#consuming_instances_from_any_matching_reservation
360272[tkdeps] : https://cloud.google.com/cluster-toolkit/docs/setup/install-dependencies
361273[tkinstall] : https://github.com/GoogleCloudPlatform/cluster-toolkit/#quickstart
274+ [deprecated-badge] : https://img.shields.io/badge/-deprecated-%23fea2a2?style=plastic
0 commit comments