Skip to content

Commit 7a30b2c

Browse files
committed
a3high single blueprint to use the tcpx patched kernel
1 parent 8c7c3a9 commit 7a30b2c

File tree

7 files changed

+807
-718
lines changed

7 files changed

+807
-718
lines changed

examples/machine-learning/a3-highgpu-8g/README.md

Lines changed: 48 additions & 135 deletions
Original file line numberDiff line numberDiff line change
@@ -7,30 +7,26 @@ a3-highgpu-8g compute nodes running NVIDIA H100 GPUs.
77

88
> [!IMPORTANT]
99
> Before beginning, submit a request to your Google Cloud representative for
10-
> access to the Deep Learning VM Image for a3-highgpu-8g. It is currently
11-
> available only by Private Preview request. This image contains patches that
12-
> significantly enhance the network performance of workloads that span multiple
13-
> a3-highgpu-8g VMs. You will use the image ID in the steps shown below.
10+
> access credentials to install the linux-gcp-tcpx kernel for a3-highgpu-8g.
11+
> This kernel contains patches that significantly enhance the network
12+
> performance of workloads that span multiple
13+
> a3-highgpu-8g VMs.
1414
15-
## Upgrading from the v5 "legacy" solution to v6
16-
There is no direct path for upgrading the Slurm-GCP v5 solution in-place to v6.
17-
The recommended path requires temporarily bringing down your v5 cluster and
18-
replacing it with the v6 solution described in this document.
15+
## Upgrading from the "legacy" solution
16+
There is no direct path for upgrading the a3-highgpu-8g legacy solution.
17+
The recommended path requires temporarily bringing down your cluster and
18+
replacing it with the solution described in this document.
1919

20-
> [!NOTE]
21-
> The `ml-slurm-a3-0-base.yaml` blueprint is identical for the "legacy" v5 and
22-
> v6 solutions. If you are upgrading from v5 to v6, do not destroy the v5 base
23-
> blueprint or re-deploy the v6 base blueprint. Simply copy the Filestore IP
24-
> address as instructed below.
20+
We recommend using `gcluster destroy` to destroy the deployments provisioned by the legacy blueprints:
2521

26-
We recommend using `gcluster destroy` to destroy the deployments provisioned by the
27-
v5 legacy blueprints:
22+
- ![deprecated-badge] [Legacy v5 image building blueprint](v5-legacy/ml-slurm-a3-1-image-v5-legacy.yaml)
23+
- ![deprecated-badge] [Legacy v5 cluster provisioning blueprint](v5-legacy/ml-slurm-a3-2-cluster-v5-legacy.yaml)
24+
- ![deprecated-badge] [Legacy base provisioning blueprint](/examples/machine-learning/a3-highgpu-8g/ml-slurm-a3-0-base.yaml)
25+
- ![deprecated-badge] [Legacy image provisioning blueprint](/examples/machine-learning/a3-highgpu-8g/ml-slurm-a3-1-image.yaml)
26+
- ![deprecated-badge] [Legacy cluster provisioning blueprint](/examples/machine-learning/a3-highgpu-8g/ml-slurm-a3-2-cluster.yaml)
2827

29-
- [Legacy v5 image building blueprint](v5-legacy/ml-slurm-a3-1-image-v5-legacy.yaml)
30-
- [Legacy v5 cluster provisioning blueprint](v5-legacy/ml-slurm-a3-2-cluster-v5-legacy.yaml)
28+
Then follow the instructions below.
3129

32-
Then follow the instructions below while skipping the re-deployment of the base
33-
blueprint.
3430

3531
## Required setup
3632

@@ -47,41 +43,21 @@ gcluster --version
4743

4844
## Top-Level Design of Solution
4945

50-
The solution is split into 3 Cluster Toolkit blueprints:
46+
The blueprint is split into 3 deployment groups:
5147

52-
1. Provision 1 system network and 1 Filestore instance for mounting `/home`
48+
1. Group 1 provisions the system network, gpu network and 1 Filestore instance for mounting `/home`
5349
across the cluster.
54-
2. Build a custom image installing Slurm in an Ubuntu 20.04 image. The image
50+
2. Group 2 builds a custom image installing Slurm on an Ubuntu 22.04 image. The image
5551
runs a kernel patched with performance enhancements for the a3-highgpu-8g VM.
56-
3. Provision 4 GPU networks and a Slurm cluster using the custom image.
52+
3. Group 3 provisions Slurm cluster and a3-highgpu-8g nodes using the custom image.
5753

58-
The 1st and 2nd blueprints should be provisioned once and rarely need further
59-
modification. This approach separates the lifecycle of a Filestore instance from
60-
the lifecycle of the cluster, allowing the cluster to be deleted while retaining
61-
access to data and home directories. The 3rd cluster blueprint may be more
62-
frequently updated and re-provisioned as discussed below.
6354

6455
## First time considerations
6556

6657
> [!IMPORTANT]
6758
> These steps do not need to be repeated when a cluster is re-provisioned. They
6859
> are initial setup steps in a project.
6960
70-
Replace the values for `PROJECT_ID`, `REGION`, and `ZONE` with the project,
71-
region, and zone in which you have an a3-highgpu-8g allocation. The value for
72-
`BUCKET` must be unique and will be used to create a new bucket. After replacing
73-
the values, execute them so that they automatically populate parameters in the
74-
commands shown below. Note that each a3-highgpu-8g VM (`N_VMS`) contains 8 NVIDIA
75-
H100 GPUs.
76-
77-
```shell
78-
export PROJECT_ID=customer-project-id
79-
export BUCKET=customer-bucket
80-
export REGION=customer-region
81-
export ZONE=customer-zone
82-
export N_VMS=32
83-
```
84-
8561
### Saving Terraform state
8662
Create a bucket with versioning enabled to store Terraform state:
8763

@@ -92,7 +68,7 @@ gcloud storage buckets create gs://${BUCKET} --project=${PROJECT_ID} \
9268
gcloud storage buckets update gs://${BUCKET} --versioning
9369
```
9470

95-
Modify all 3 blueprints to configure the new bucket to serve as a Terraform
71+
Modify the blueprints to configure the new bucket to serve as a Terraform
9672
remote backend:
9773

9874
```yaml
@@ -102,55 +78,51 @@ terraform_backend_defaults:
10278
bucket: customer-bucket # modify to bucket created above
10379
```
10480
105-
### Set default values
81+
### Deployment variables
10682
107-
Modify the the deployment variables `project_id`, `region`, `zone`, in the
108-
`vars` block of all 3 blueprints:
83+
Set the values in a3high-slurm-deployment.yaml for your deployment
10984
11085
```yaml
86+
deployment_name: unique-name
11187
project_id: customer-project
11288
region: customer-region
11389
zone: customer-zone
11490
```
11591
116-
### Set kernel-patched OS image
92+
**Set kernel-patched OS image**
11793
118-
Obtain values for `source_image_project_id` and `source_image` from your Google
119-
Cloud representative. Set them at approximately lines 33 and 34 of
120-
`ml-slurm-a3-1-image.yaml`.
94+
Obtain values for `tcpx_kernel_login`, `tcpx_kernel_password` and `keyserver_ubuntu_key` from your Google Cloud representative. Set them at the deployment file.
12195

12296
```yaml
123-
source_image_project_id: source-image-project-id # use value supplied by Google Cloud staff
124-
source_image: source-image-name # use value supplied by Google Cloud staff
97+
tcpx_kernel_login: # use value supplied by Google Cloud staff
98+
tcpx_kernel_password: # use value supplied by Google Cloud staff
99+
keyserver_ubuntu_key: # use value supplied by Google Cloud staff
125100
```
126101

127-
### Reservation created by Google
102+
**Reservation created by Google**
128103

129104
> [!IMPORTANT]
130105
> If you have ***not*** received a VM reservation from Google Cloud staff, then
131106
> skip this step and proceed to [manual reservation creation](#manual-creation-of-reservation).
132107

133-
Set the deployment variable `a3_reservation_name` at approximately line 38 of
134-
`ml-slurm-a3-2-cluster.yaml` to the reservation name provided by Google.
108+
Set the deployment variable `a3_reservation_name` to the reservation name provided by Google.
135109

136110
```yaml
137111
# a3_reservation_name must be specified; if Google staff have provided you
138112
# with a reservation name, use it. Otherwise supply user-created reservation.
139113
a3_reservation_name: reservation-name-provided-by-google
140114
```
141115

142-
### Manual creation of reservation
116+
**Manual creation of reservation**
143117

144118
> [!IMPORTANT]
145-
> If you received a VM reservation from Google Cloud staff, then skip this step
146-
> after confirming that you followed the instructions in [reservation created by
147-
> Google](#reservation-created-by-google).
119+
> If you received a VM reservation from Google Cloud staff, then skip this step.
148120

149121
We recommend creating a reservation to ensure reliable access to re-create VMs
150122
if you need to redeploy or otherwise maintain your cluster.
151123

152124
```shell
153-
gcloud compute reservations create a3-reservation-0 \
125+
gcloud compute reservations create ${A3_RESERVATION_NAME} \
154126
--project=${PROJECT_ID} \
155127
--machine-type=a3-highgpu-8g \
156128
--vm-count=${N_VMS} \
@@ -160,24 +132,21 @@ gcloud compute reservations create a3-reservation-0 \
160132
```
161133

162134
This reservation be must be specified when creating VMs with matching parameters
163-
(e.g. a3-highgpu-8g VM in configured zone). If you executed the command above
164-
without modification, you may leave `a3_reservation_name` at their default values in
165-
`ml-slurm-a3-2-cluster.yaml`. Otherwise, ensure that the reservation name in the
166-
blueprint matches the name of the user-created reservation.
135+
(e.g. a3-highgpu-8g VM in configured zone). Ensure that the reservation name in the
136+
deployment file matches the name of the user-created reservation.
167137

168138
```yaml
169139
# a3_reservation_name must be specified; if Google staff have provided you
170140
# with a reservation name, use it. Otherwise supply user-created reservation.
171-
a3_reservation_name: a3-reservation-0
141+
a3_reservation_name:
172142
```
173143

174-
### Using Spot VM or DWS Flex
144+
**Using Spot VM or DWS Flex**
175145

176146
> [!IMPORTANT]
177147
> Select one of the provisioning models : either spot vm , dws flex or reservation
178148

179-
In order to make use of DWS Flex Start mode with SlurmGCP, you must use the `dws_flex` variable in the `schedmd-slurm-gcp-v6-nodeset` module.
180-
For setting this variable , set the `a3_dws_flex_enabled` variable as shown below
149+
In order to make use of DWS Flex Start mode with SlurmGCP, set the `a3_dws_flex_enabled` variable as shown below
181150

182151
```yaml
183152
vars:
@@ -187,9 +156,7 @@ For setting this variable , set the `a3_dws_flex_enabled` variable as shown belo
187156

188157
To learn more about DWS Flex-Start, visit https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/docs/slurm-dws-flex.md
189158

190-
Similarly ,to make use of Spot VMs,
191-
In order to make use of Spot VMs with Slurm, you must use the `enable_spot_vm` variable in the `schedmd-slurm-gcp-v6-nodeset` module.
192-
For setting this variable , set the `a3_enable_spot_vm` variable as shown below
159+
Similarly ,to make use of Spot VMs with Slurm, set the `a3_enable_spot_vm` variable as shown below
193160

194161
```yaml
195162
vars:
@@ -199,10 +166,10 @@ For setting this variable , set the `a3_enable_spot_vm` variable as shown below
199166

200167
To learn more about Spot VM visit: https://cloud.google.com/compute/docs/instances/spot
201168

202-
### Set cluster size
169+
**Set cluster size**
203170

204-
At approximately line 37 of `ml-slurm-a3-2-cluster.yaml`, set the static cluster
205-
size. Recall that there are 8 NVIDIA H100 GPUs per a3-highgpu-8g VM.
171+
Set the static cluster
172+
size using `a3_static_cluster_size` variable. Recall that there are 8 NVIDIA H100 GPUs per a3-highgpu-8g VM.
206173

207174
```yaml
208175
a3_static_cluster_size: 32
@@ -211,66 +178,12 @@ size. Recall that there are 8 NVIDIA H100 GPUs per a3-highgpu-8g VM.
211178
## Cluster creation
212179

213180
> [!NOTE]
214-
> The `ml-slurm-a3-0-base.yaml` blueprint is identical for the "legacy" v5 and
215-
> v6 solutions. If you are upgrading from v5 to v6, do not destroy the v5 base
216-
> blueprint or re-deploy the v6 base blueprint. Simply copy the Filestore IP
217-
> address as instructed below.
218-
219-
The blueprint `ml-slurm-a3-0-base.yaml` will create 1 system network and a
220-
Filestore `/home` filesystem. Run the standard Toolkit workflow at the command
221-
line (approx. 5 minutes):
222-
223-
```shell
224-
gcluster deploy ml-slurm-a3-0-base.yaml --auto-approve
225-
```
226-
227-
Several values will be output to the screen. The output will be similar to:
228-
229-
```hcl
230-
network_name_sysnet = "sys-net"
231-
network_storage_homefs = {
232-
"client_install_runner" = {
233-
"destination" = "install-nfs_home.sh"
234-
"source" = "modules/embedded/modules/file-system/filestore/scripts/install-nfs-client.sh"
235-
"type" = "shell"
236-
}
237-
"fs_type" = "nfs"
238-
"local_mount" = "/home"
239-
"mount_options" = "defaults,_netdev"
240-
"mount_runner" = {
241-
"args" = "\"10.224.153.226\" \"/nfsshare\" \"/home\" \"nfs\" \"defaults,_netdev\""
242-
"destination" = "mount_home.sh"
243-
"source" = "modules/embedded/modules/file-system/filestore/scripts/mount.sh"
244-
"type" = "shell"
245-
}
246-
"remote_mount" = "/nfsshare"
247-
"server_ip" = "10.224.153.226"
248-
}
249-
subnetwork_name_sysnet = "sys-subnet"
250-
```
251-
252-
Build the custom image using ml-slurm-a3-1-image.yaml and the same workflow
253-
as above. Run at the command line:
254-
255-
```shell
256-
gcluster deploy ml-slurm-a3-1-image.yaml --auto-approve
257-
```
258-
259-
The image will take approximately 30 minutes to build.
260-
261-
> [!IMPORTANT]
262-
> You must modify `ml-slurm-a3-2-cluster.yaml` to update the IP address of the
263-
> Filestore instance for `/home`. Your IP address will differ from that shown
264-
> below and must match the output from deploying the base blueprint above:
265-
>
266-
> ```yaml
267-
> server_ip_homefs: 10.224.153.226
268-
> ```
181+
> This blueprint is not compatible with the legacy a3-highgpu-8g blueprints. We recommend bringing down the earlier cluster and redeploying using the below mentioned step.
269182

270-
Provision the cluster blueprint (approximately 5-10 minutes):
183+
Provision the cluster blueprint (approximately 40 minutes):
271184

272185
```shell
273-
gcluster deploy ml-slurm-a3-2-cluster.yaml --auto-approve
186+
./gcluster deploy -d a3high-slurm-deployment.yaml a3high-slurm-blueprint.yaml --auto-approve
274187
```
275188

276189
## Receive Data Path Manager (RxDM)
@@ -291,8 +204,7 @@ The Prolog will perform the following actions:
291204
- Mounts `/var/lib/nvidia/lib64` into `/usr/lib/nvidia/lib64` of the container
292205
- Mount `/opt/tcpdirect_benchmark/` from the host into the container so that a
293206
textproto file defining the mapping from GPU to NIC is available. This file
294-
is present in the Deep Learning VM (DLVM) images that contain TCPDirect
295-
patches.
207+
is present in the images that is used in this solution.
296208
- Mount `/run/tcpx-${SLURM_JOB_ID}` from the container into the host. This is
297209
set to the environment variables `${UDS_PATH}` in the script. This directory
298210
contains Unix socket files that implement a TCPx interface available to the
@@ -328,7 +240,7 @@ The example workload below demonstrates the pattern recommended in Activating
328240
the Receive Data Path Manager during jobs while running the standard nccl-tests
329241
benchmark. It assumes the availability of a GPU/NIC topology file at
330242
`/opt/tcpdirect_benchmark/gpu_rxq_configuration.textproto`. This file is built
331-
into the DLVM images used by this solution, but may need to be provided if
243+
into the image used by this solution, but may need to be provided if
332244
using an alternative image.
333245

334246
### Clone the Cluster Toolkit repository containing the NCCL benchmark
@@ -359,3 +271,4 @@ sbatch run-nccl-tests.sh
359271
[consume]: https://cloud.google.com/compute/docs/instances/reservations-consume#consuming_instances_from_any_matching_reservation
360272
[tkdeps]: https://cloud.google.com/cluster-toolkit/docs/setup/install-dependencies
361273
[tkinstall]: https://github.com/GoogleCloudPlatform/cluster-toolkit/#quickstart
274+
[deprecated-badge]: https://img.shields.io/badge/-deprecated-%23fea2a2?style=plastic

0 commit comments

Comments
 (0)