Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions examples/gke-h4d/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,9 @@ This blueprint uses GKE to provision a Kubernetes cluster and a H4D node pool, a
## Run a test using the MPI Operator
The MPI Operator is installed on the cluster during the deployment. To run a test using the MPI Operator on the GKE H4D cluster, refer to https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/tree/main/hpc/mpi.

## iRDMA Health Check
The [irdma-health-check](./irdma-health-check/) folder includes the [readme](./irdma-health-check/README.md) with steps on how to create a mutating webhook that can inject the initContainer code to check the health of the iRDMA network on H4D nodes.

## Clean Up
To destroy all resources associated with creating the GKE cluster, run the following command:

Expand Down
131 changes: 131 additions & 0 deletions examples/gke-h4d/irdma-health-check/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# iRDMA Health Check Mutating Webhook for GKE

This document provides instructions for setting up a mutating webhook in a GKE cluster to automatically run an iRDMA health check on H4D nodes.

## Overview

The solution consists of the following components:

1. **Health Check Container**: A container image based on Rocky Linux 8 that contains a script to check the iRDMA device status.
1. **Mutating Webhook**: A Go application that runs in the cluster and injects the health check container as an init container into pods *on creation* that have the `nodeSelector` `node.kubernetes.io/instance-type: h4d-highmem-192-lssd`.
1. **Kubernetes Manifests**: A set of YAML files to deploy the webhook and its dependencies.

## Prerequisites

- A GKE cluster with H4D nodes.
- `gcloud`, `docker`, and `kubectl` CLIs installed and configured.
- `cert-manager` installed in your cluster. If you don't have it, install it by running:
```plaintext
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.0/cert-manager.yaml
```
**Note**: After applying the manifest, wait for the `cert-manager` pods to be in the `Running` state before proceeding. You can check the status with the following command:
```plaintext
kubectl get pods -n cert-manager
```

## Setup and Deployment

**Note**: Before you begin, please change into the directory containing the solution files:
```plaintext
cd examples/gke-h4d/irdma-health-check
```

### 1. Customization
Before proceeding, you need to customize the following values in the `build-and-push.sh` and `build-and-push-webhook.sh files`:

* **`PROJECT_ID`**: Your Google Cloud project ID.
* **`IMAGE_NAME`**: The name for the Docker image (e.g., `irdma-health-check`, `irdma-webhook-server`).
* **`IMAGE_TAG`**: The version tag for your Docker images (e.g., `v1.0.0`).
* **`REGION`**: The Google Cloud region where your Artifact Registry is located (e.g., `us-central1`).


### 2. Build and Push the Health Check Init Container Image

The health check script and its Dockerfile are provided.

1. **Run the build script**:
```sh
./build-and-push.sh
```
This will build the Docker image (`us-central1-docker.pkg.dev/MY-GCP-PROJECT/h4d/irdma-health-check:v1.0.0`) and push it to your project's Google Container Registry.

### 3. Build and Push the Webhook Server Image

The webhook server Go application and its Dockerfile are provided in the `webhook/` directory.

1. **Verify Health Check Init Container Image**: Before applying, ensure the `imageURI` field in `webhook/main.go` matches the URI of the `irdma-health-check` image you pushed (e.g., `us-central1-docker.pkg.dev/MY-GCP-PROJECT/h4d/irdma-health-check:v1.0.0`).

1. **Run the build script**:
```sh
./build-and-push-webhook.sh
```
This will build the Docker image (`us-central1-docker.pkg.dev/MY-GCP-PROJECT/h4d/irdma-webhook-server:v1.0.0`) and push it to your project's Google Container Registry.

### 4. Deploy the Webhook

The `manifests` directory contains all the necessary Kubernetes resources.

1. **Verify Webhook Deployment Image**: Before applying, ensure the `image` field in `manifests/04-webhook-deployment.yaml` matches the URI of the `irdma-webhook-server` image you pushed (e.g., `us-central1-docker.pkg.dev/MY-GCP-PROJECT/h4d/irdma-webhook-server:v1.0.0`).

1. **Apply the manifests**:
```sh
kubectl apply -f manifests/
```
This will create:
- A namespace `irdma-health-check`.
- A self-signed issuer and a certificate for the webhook using `cert-manager`.
- The webhook deployment and service.
- The `MutatingWebhookConfiguration` that tells the Kubernetes API server to forward pod creation requests to the webhook.

### 5. Test the Webhook

A sample pod manifest `test-pod-trigger.yaml` is provided to test the webhook.

1. **Deploy the test pod**:
```plaintext
kubectl apply -f test-pod-trigger.yaml
```
This pod has the `nodeSelector` `node.kubernetes.io/instance-type: h4d-highmem-192-lssd`, so the webhook will act on it. It also has tolerations to ensure it gets scheduled on an H4D node.

1. **Verify the injection**:
Check the pod's definition to see if the `irdma-health-check` init container was injected by the webhook:
```plaintext
kubectl get pod my-h4d-app-irdma-check-rocky8 -o yaml
```
You should see the `irdma-health-check` container in the `spec.initContainers` section, along with the `securityContext` and `resources` injected by the webhook.

1. **Check the logs**:
If the init container runs, you can check its logs:
```plaintext
kubectl logs my-h4d-app-irdma-check-rocky8 -c irdma-health-check
```
If the health check fails, the pod will not start, and you can see the error by describing the pod:
```plaintext
kubectl describe pod my-h4d-app-irdma-check-rocky8
```

## How It Works

1. **Webhook Trigger**: On *pod creation requests*, if a pod includes the `nodeSelector` `node.kubernetes.io/instance-type: h4d-highmem-192-lssd`, the Kubernetes API server, as configured by the `MutatingWebhookConfiguration`, sends an admission review request to the `irdma-webhook` service. The `MutatingWebhookConfiguration` specifies that the webhook should intercept `CREATE` operations on `pods` resources.
2. **Webhook Logic**: The webhook server receives the request, validates the `nodeSelector`, and if present, generates a JSON patch to inject the `irdma-health-check` init container into the pod's `spec`.
3. **Init Container Execution**: The injected init container runs before any main application containers. It executes the `irdma-health-check.sh` script.
4. **Health Check Outcome**: The script checks the RDMA device status and performs a loopback bandwidth test.
- If the health check fails (e.g., due to low bandwidth), the script attempts to recover the interface. If recovery is successful, it re-runs the test.
- If the health check (or re-check after recovery) ultimately fails, the script exits with a non-zero status code, causing the init container to fail.
5. **Pod Scheduling Impact**: If the init container fails persistently, Kubernetes will not schedule the main application containers, indicating a problem with the node's iRDMA setup.

**Important Note on Namespace Selectors**: The `MutatingWebhookConfiguration` is configured to *not* run on pods in the `irdma-health-check` and `cert-manager` namespaces. This is critical to prevent a circular dependency where the webhook tries to mutate its own pods or the `cert-manager` pods, which would cause the system to become unstable.

## Cleanup

To remove all the resources created by this example, run:
```plaintext
kubectl delete -f test-pod-trigger.yaml
kubectl delete -f manifests/
```

## Troubleshooting

1. **Error**: `denied: Permission "artifactregistry.repositories.uploadArtifacts" denied on resource "projects/hpc-topolkit-dev/locations/us-central1/repositories/h4d" (or it may not exist)`

Run `gcloud auth configure-docker us-central1-docker.pkg.dev`
40 changes: 40 additions & 0 deletions examples/gke-h4d/irdma-health-check/build-and-push-webhook.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/bin/bash
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -e
cd "$(dirname "$0")"

# EDIT THIS: Your Google Cloud project ID
PROJECT_ID=MY-GCP-PROJECT
# EDIT THIS: The name of the image to build
IMAGE_NAME="irdma-webhook-server"
# EDIT THIS: The image tag
IMAGE_TAG="v1.0.0"

REGION="us-central1"

IMAGE_URI="${REGION}-docker.pkg.dev/${PROJECT_ID}/h4d/${IMAGE_NAME}:${IMAGE_TAG}"

echo "Building and pushing webhook image: ${IMAGE_URI}"

echo "IMAGE_URI: ${IMAGE_URI}"

cd webhook
docker build --no-cache -t "${IMAGE_NAME}:${IMAGE_TAG}" -f Dockerfile .
docker tag "${IMAGE_NAME}:${IMAGE_TAG}" "${IMAGE_URI}"
cd ..
docker push "${IMAGE_URI}"

echo "Webhook image pushed successfully."
35 changes: 35 additions & 0 deletions examples/gke-h4d/irdma-health-check/build-and-push.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/bin/bash
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -e
cd "$(dirname "$0")"

# EDIT THIS: Your Google Cloud project ID
PROJECT_ID=MY-GCP-PROJECT
# EDIT THIS: The name of the image to build
IMAGE_NAME="irdma-health-check"
# EDIT THIS: The image tag
IMAGE_TAG="v1.0.0"

REGION="us-central1"

IMAGE_URI="${REGION}-docker.pkg.dev/${PROJECT_ID}/h4d/${IMAGE_NAME}:${IMAGE_TAG}"

echo "Building and pushing image: ${IMAGE_URI}"

docker build -t "${IMAGE_URI}" .
docker push "${IMAGE_URI}"

echo "Image pushed successfully."
138 changes: 138 additions & 0 deletions examples/gke-h4d/irdma-health-check/irdma-health-check.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
#!/bin/bash
# Copyright 2025 Google LLC
#
# RDMA Health Check for GKE Init Container on H4D Nodes
#
# Description:
# This script performs RDMA health checks before the main container starts.
# It's intended to run as an initContainer in a GKE Pod scheduled on H4D nodes.
#
# Functionality:
# 1. Checks if the RDMA link state is ACTIVE.
# 2. If the link is active, it performs a loopback bandwidth test using ib_send_bw,
# binding to the IP of the 'eth1' interface.
# 3. If either test fails, it attempts a single recovery by bringing the 'eth1'
# network interface down and then up.
# 4. It then re-runs the failed test.
# 5. If a test fails a second time, the script will exit with a non-zero status code (1),
# causing the Pod to fail and preventing the main application from starting.

PATH=${PATH}:/usr/sbin:/usr/local/bin

# --- Configuration ---
# The RDMA device and network interface names as exposed in the Pod.
# Based on common GKE multi-network setups for RDMA.
RDMA_DEVICE="irdma0/1" # Corresponds to the device/port
# Corrected: The network interface for network "rdma-0" is eth1 inside the Pod.
NET_IFACE="eth1" # Corresponds to eth1 in the Pod

# Number of loopback tests to run.
LOOPBACK_ITERATIONS=1

# Set to DRY_RUN to only print actions instead of taking them.
DRY_RUN=0

# --- Script Functions ---
# Log a message to stderr, visible in init container logs.
log() {
echo "$(date): $1" >&2
}

# Check if the RDMA link is active. Returns 0 if active, 1 otherwise.
check_rdma_link() {
log "Checking RDMA link state for $RDMA_DEVICE..."
if rdma link show "$RDMA_DEVICE" | grep -q "state ACTIVE"; then
log "RDMA link is ACTIVE."
return 0
else
log "RDMA link is not ACTIVE."
return 1
fi
}

# Run the ib_send_bw loopback test, binding to the RDMA interface IP.
# Returns 0 if all tests pass, 1 otherwise.
run_loopback_test() {
log "Running loopback test for $NET_IFACE ($RDMA_DEVICE)..."
local success_count=0

# Determine the IP address of the eth1 interface within the pod.
RDMA_IP=$(ip -4 -o addr show dev "$NET_IFACE" | awk '{print $4}' | cut -d/ -f1)
if [[ -z "$RDMA_IP" ]]; then
log "ERROR: Could not determine IP address for interface $NET_IFACE. Skipping loopback test."
return 1
fi
log "Discovered RDMA IP for $NET_IFACE: $RDMA_IP"

for ((i=1; i<=LOOPBACK_ITERATIONS; i++)); do
log "Running ib_send_bw iteration $i..."
# Start the server in the background, binding to the RDMA IP
# Using the full path: /usr/bin/ib_send_bw
/usr/bin/ib_send_bw -F -n 5 -q 10 -s 8388608 --mr_per_qp --bind_source_ip="$RDMA_IP" &
local server_pid=$!
sleep 1 # Wait for the server to be ready

# Run the client, connecting to the server on the RDMA IP
# Using the full path: /usr/bin/ib_send_bw
if /usr/bin/ib_send_bw -F -n 5 -q 10 -s 8388608 --mr_per_qp --bind_source_ip="$RDMA_IP" "$RDMA_IP"; then
((success_count++))
else
log "ib_send_bw client failed in iteration $i."
fi
# Clean up the server process
kill $server_pid 2>/dev/null || true
wait $server_pid 2>/dev/null
done

log "Loopback test result: $success_count/$LOOPBACK_ITERATIONS successful."
if [ "$success_count" -eq "$LOOPBACK_ITERATIONS" ]; then
log "Loopback test PASSED."
return 0
else
log "Loopback test FAILED."
return 1
fi
}

# Attempt to recover the network interface by bouncing it.
try_recover_rdma() {
log "Attempting to recover interface $NET_IFACE..."
if [[ ${DRY_RUN} == 0 ]] ; then
ifconfig "$NET_IFACE" down
sleep 2
ifconfig "$NET_IFACE" up
sleep 5 # Allow time for the interface to initialize
else
log "DRY_RUN: Would have run ifconfig down/up on $NET_IFACE."
fi
log "Recovery attempt finished."
}


# --- Main Logic ---
log "Starting RDMA health check init container."

# 1. First, check the RDMA link state.
if ! check_rdma_link; then
log "RDMA link check failed. Attempting recovery..."
try_recover_rdma
if ! check_rdma_link; then
log "ERROR: RDMA link is not ACTIVE after recovery attempt. Failing pod."
exit 1
fi
log "RDMA link check passed after recovery."
fi

# 2. If the link is good, perform the loopback test.
if ! run_loopback_test; then
log "RDMA loopback test failed. Attempting recovery..."
try_recover_rdma
if ! run_loopback_test; then
log "ERROR: RDMA loopback test failed after recovery attempt. Failing pod."
exit 1
fi
log "RDMA loopback test passed after recovery."
fi

log "RDMA health checks passed. Init container exiting successfully."
exit 0
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: irdma-health-check
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# cert-manager is a popular open-source tool for managing TLS certificates in Kubernetes.
# It can automatically provision certificates from various sources, such as Let's Encrypt,
# or self-signed issuers, and keep them up-to-date.
#
# For this webhook, cert-manager will create a self-signed certificate and automatically
# inject the CA bundle into the MutatingWebhookConfiguration.
#
# Please install cert-manager by following the official instructions:
# https://cert-manager.io/docs/installation/
#
# For example, using kubectl:
# kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.0/cert-manager.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: selfsigned-issuer
namespace: irdma-health-check
spec:
selfSigned: {}
Loading
Loading