GoogleCloudPlatform · SwarnaBharathiMantena · Nov 9, 2025
@@ -58,6 +58,9 @@ This blueprint uses GKE to provision a Kubernetes cluster and a H4D node pool, a
 ## Run a test using the MPI Operator
 The MPI Operator is installed on the cluster during the deployment. To run a test using the MPI Operator on the GKE H4D cluster, refer to https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/tree/main/hpc/mpi.
 
+## iRDMA Health Check
+The [irdma-health-check](./irdma-health-check/) folder includes the [readme](./irdma-health-check/README.md) with steps on how to create a mutating webhook that can inject the initContainer code to check the health of the iRDMA network on H4D nodes.
+
 ## Clean Up
 To destroy all resources associated with creating the GKE cluster, run the following command:
 

@@ -0,0 +1,131 @@
+# iRDMA Health Check Mutating Webhook for GKE
+
+This document provides instructions for setting up a mutating webhook in a GKE cluster to automatically run an iRDMA health check on H4D nodes.
+
+## Overview
+
+The solution consists of the following components:
+
+1. **Health Check Container**: A container image based on Rocky Linux 8 that contains a script to check the iRDMA device status.
+1. **Mutating Webhook**: A Go application that runs in the cluster and injects the health check container as an init container into pods *on creation* that have the `nodeSelector` `node.kubernetes.io/instance-type: h4d-highmem-192-lssd`.
+1. **Kubernetes Manifests**: A set of YAML files to deploy the webhook and its dependencies.
+
+## Prerequisites
+
+- A GKE cluster with H4D nodes.
+- `gcloud`, `docker`, and `kubectl` CLIs installed and configured.
+- `cert-manager` installed in your cluster. If you don't have it, install it by running:
+  ```plaintext
+  kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.0/cert-manager.yaml
+  ```
+  **Note**: After applying the manifest, wait for the `cert-manager` pods to be in the `Running` state before proceeding. You can check the status with the following command:
+  ```plaintext
+  kubectl get pods -n cert-manager
+  ```
+
+## Setup and Deployment
+
+**Note**: Before you begin, please change into the directory containing the solution files:
+```plaintext
+cd examples/gke-h4d/irdma-health-check
+```
+
+### 1. Customization
+Before proceeding, you need to customize the following values in the `build-and-push.sh` and `build-and-push-webhook.sh files`:
+
+*   **`PROJECT_ID`**: Your Google Cloud project ID.
+*   **`IMAGE_NAME`**: The name for the Docker image (e.g., `irdma-health-check`, `irdma-webhook-server`).
+*   **`IMAGE_TAG`**: The version tag for your Docker images (e.g., `v1.0.0`).
+*   **`REGION`**: The Google Cloud region where your Artifact Registry is located (e.g., `us-central1`).
+
+
+### 2. Build and Push the Health Check Init Container Image
+
+The health check script and its Dockerfile are provided.
+
+1.  **Run the build script**:
+    ```sh
+    ./build-and-push.sh
+    ```
+This will build the Docker image (`us-central1-docker.pkg.dev/MY-GCP-PROJECT/h4d/irdma-health-check:v1.0.0`) and push it to your project's Google Container Registry.
+
+### 3. Build and Push the Webhook Server Image
+
+The webhook server Go application and its Dockerfile are provided in the `webhook/` directory.
+
+1.  **Verify Health Check Init Container Image**: Before applying, ensure the `imageURI` field in `webhook/main.go` matches the URI of the `irdma-health-check` image you pushed (e.g., `us-central1-docker.pkg.dev/MY-GCP-PROJECT/h4d/irdma-health-check:v1.0.0`).
+
+1.  **Run the build script**:
+    ```sh
+    ./build-and-push-webhook.sh
+    ```
+    This will build the Docker image (`us-central1-docker.pkg.dev/MY-GCP-PROJECT/h4d/irdma-webhook-server:v1.0.0`) and push it to your project's Google Container Registry.
+
+### 4. Deploy the Webhook
+
+The `manifests` directory contains all the necessary Kubernetes resources.
+
+1.  **Verify Webhook Deployment Image**: Before applying, ensure the `image` field in `manifests/04-webhook-deployment.yaml` matches the URI of the `irdma-webhook-server` image you pushed (e.g., `us-central1-docker.pkg.dev/MY-GCP-PROJECT/h4d/irdma-webhook-server:v1.0.0`).
+
+1.  **Apply the manifests**:
+    ```sh
+    kubectl apply -f manifests/
+    ```
+    This will create:
+    - A namespace `irdma-health-check`.
+    - A self-signed issuer and a certificate for the webhook using `cert-manager`.
+    - The webhook deployment and service.
+    - The `MutatingWebhookConfiguration` that tells the Kubernetes API server to forward pod creation requests to the webhook.
+
+### 5. Test the Webhook
+
+A sample pod manifest `test-pod-trigger.yaml` is provided to test the webhook.
+
+1.  **Deploy the test pod**:
+    ```plaintext
+    kubectl apply -f test-pod-trigger.yaml
+    ```
+     This pod has the `nodeSelector` `node.kubernetes.io/instance-type: h4d-highmem-192-lssd`, so the webhook will act on it. It also has tolerations to ensure it gets scheduled on an H4D node.
+
+1.  **Verify the injection**:
+    Check the pod's definition to see if the `irdma-health-check` init container was injected by the webhook:
+    ```plaintext
+    kubectl get pod my-h4d-app-irdma-check-rocky8 -o yaml
+    ```
+    You should see the `irdma-health-check` container in the `spec.initContainers` section, along with the `securityContext` and `resources` injected by the webhook.
+
+1.  **Check the logs**:
+    If the init container runs, you can check its logs:
+    ```plaintext
+    kubectl logs my-h4d-app-irdma-check-rocky8 -c irdma-health-check
+    ```
+    If the health check fails, the pod will not start, and you can see the error by describing the pod:
+    ```plaintext
+    kubectl describe pod my-h4d-app-irdma-check-rocky8
+    ```
+
+## How It Works
+
+1.  **Webhook Trigger**: On *pod creation requests*, if a pod includes the `nodeSelector` `node.kubernetes.io/instance-type: h4d-highmem-192-lssd`, the Kubernetes API server, as configured by the `MutatingWebhookConfiguration`, sends an admission review request to the `irdma-webhook` service. The `MutatingWebhookConfiguration` specifies that the webhook should intercept `CREATE` operations on `pods` resources.
+2.  **Webhook Logic**: The webhook server receives the request, validates the `nodeSelector`, and if present, generates a JSON patch to inject the `irdma-health-check` init container into the pod's `spec`.
+3.  **Init Container Execution**: The injected init container runs before any main application containers. It executes the `irdma-health-check.sh` script.
+4.  **Health Check Outcome**: The script checks the RDMA device status and performs a loopback bandwidth test.
+    - If the health check fails (e.g., due to low bandwidth), the script attempts to recover the interface. If recovery is successful, it re-runs the test.
+    - If the health check (or re-check after recovery) ultimately fails, the script exits with a non-zero status code, causing the init container to fail.
+5.  **Pod Scheduling Impact**: If the init container fails persistently, Kubernetes will not schedule the main application containers, indicating a problem with the node's iRDMA setup.
+
+**Important Note on Namespace Selectors**: The `MutatingWebhookConfiguration` is configured to *not* run on pods in the `irdma-health-check` and `cert-manager` namespaces. This is critical to prevent a circular dependency where the webhook tries to mutate its own pods or the `cert-manager` pods, which would cause the system to become unstable.
+
+## Cleanup
+
+To remove all the resources created by this example, run:
+```plaintext
+kubectl delete -f test-pod-trigger.yaml
+kubectl delete -f manifests/
+```
+
+## Troubleshooting
+
+1. **Error**: `denied: Permission "artifactregistry.repositories.uploadArtifacts" denied on resource "projects/hpc-topolkit-dev/locations/us-central1/repositories/h4d" (or it may not exist)`
+
+   Run `gcloud auth configure-docker us-central1-docker.pkg.dev`
@@ -0,0 +1,40 @@
+#!/bin/bash
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -e
+cd "$(dirname "$0")"
+
+# EDIT THIS: Your Google Cloud project ID
+PROJECT_ID=MY-GCP-PROJECT
+# EDIT THIS: The name of the image to build
+IMAGE_NAME="irdma-webhook-server"
+# EDIT THIS: The image tag
+IMAGE_TAG="v1.0.0"
+
+REGION="us-central1"
+
+IMAGE_URI="${REGION}-docker.pkg.dev/${PROJECT_ID}/h4d/${IMAGE_NAME}:${IMAGE_TAG}"
+
+echo "Building and pushing webhook image: ${IMAGE_URI}"
+
+echo "IMAGE_URI: ${IMAGE_URI}"
+
+cd webhook
+docker build --no-cache -t "${IMAGE_NAME}:${IMAGE_TAG}" -f Dockerfile .
+docker tag "${IMAGE_NAME}:${IMAGE_TAG}" "${IMAGE_URI}"
+cd ..
+docker push "${IMAGE_URI}"
+
+echo "Webhook image pushed successfully."
@@ -0,0 +1,35 @@
+#!/bin/bash
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -e
+cd "$(dirname "$0")"
+
+# EDIT THIS: Your Google Cloud project ID
+PROJECT_ID=MY-GCP-PROJECT
+# EDIT THIS: The name of the image to build
+IMAGE_NAME="irdma-health-check"
+# EDIT THIS: The image tag
+IMAGE_TAG="v1.0.0"
+
+REGION="us-central1"
+
+IMAGE_URI="${REGION}-docker.pkg.dev/${PROJECT_ID}/h4d/${IMAGE_NAME}:${IMAGE_TAG}"
+
+echo "Building and pushing image: ${IMAGE_URI}"
+
+docker build -t "${IMAGE_URI}" .
+docker push "${IMAGE_URI}"
+
+echo "Image pushed successfully."
@@ -0,0 +1,138 @@
+#!/bin/bash
+# Copyright 2025 Google LLC
+#
+# RDMA Health Check for GKE Init Container on H4D Nodes
+#
+# Description:
+# This script performs RDMA health checks before the main container starts.
+# It's intended to run as an initContainer in a GKE Pod scheduled on H4D nodes.
+#
+# Functionality:
+# 1. Checks if the RDMA link state is ACTIVE.
+# 2. If the link is active, it performs a loopback bandwidth test using ib_send_bw,
+#    binding to the IP of the 'eth1' interface.
+# 3. If either test fails, it attempts a single recovery by bringing the 'eth1'
+#    network interface down and then up.
+# 4. It then re-runs the failed test.
+# 5. If a test fails a second time, the script will exit with a non-zero status code (1),
+#    causing the Pod to fail and preventing the main application from starting.
+
+PATH=${PATH}:/usr/sbin:/usr/local/bin
+
+# --- Configuration ---
+# The RDMA device and network interface names as exposed in the Pod.
+# Based on common GKE multi-network setups for RDMA.
+RDMA_DEVICE="irdma0/1" # Corresponds to the device/port
+# Corrected: The network interface for network "rdma-0" is eth1 inside the Pod.
+NET_IFACE="eth1"    # Corresponds to eth1 in the Pod
+
+# Number of loopback tests to run.
+LOOPBACK_ITERATIONS=1
+
+# Set to DRY_RUN to only print actions instead of taking them.
+DRY_RUN=0
+
+# --- Script Functions ---
+# Log a message to stderr, visible in init container logs.
+log() {
+  echo "$(date): $1" >&2
+}
+
+# Check if the RDMA link is active. Returns 0 if active, 1 otherwise.
+check_rdma_link() {
+  log "Checking RDMA link state for $RDMA_DEVICE..."
+  if rdma link show "$RDMA_DEVICE" | grep -q "state ACTIVE"; then
+    log "RDMA link is ACTIVE."
+    return 0
+  else
+    log "RDMA link is not ACTIVE."
+    return 1
+  fi
+}
+
+# Run the ib_send_bw loopback test, binding to the RDMA interface IP.
+# Returns 0 if all tests pass, 1 otherwise.
+run_loopback_test() {
+  log "Running loopback test for $NET_IFACE ($RDMA_DEVICE)..."
+  local success_count=0
+
+  # Determine the IP address of the eth1 interface within the pod.
+  RDMA_IP=$(ip -4 -o addr show dev "$NET_IFACE" | awk '{print $4}' | cut -d/ -f1)
+  if [[ -z "$RDMA_IP" ]]; then
+    log "ERROR: Could not determine IP address for interface $NET_IFACE. Skipping loopback test."
+    return 1
+  fi
+  log "Discovered RDMA IP for $NET_IFACE: $RDMA_IP"
+
+  for ((i=1; i<=LOOPBACK_ITERATIONS; i++)); do
+    log "Running ib_send_bw iteration $i..."
+    # Start the server in the background, binding to the RDMA IP
+    # Using the full path: /usr/bin/ib_send_bw
+    /usr/bin/ib_send_bw -F -n 5 -q 10 -s 8388608 --mr_per_qp --bind_source_ip="$RDMA_IP" &
+    local server_pid=$!
+    sleep 1 # Wait for the server to be ready
+
+    # Run the client, connecting to the server on the RDMA IP
+    # Using the full path: /usr/bin/ib_send_bw
+    if /usr/bin/ib_send_bw -F -n 5 -q 10 -s 8388608 --mr_per_qp --bind_source_ip="$RDMA_IP" "$RDMA_IP"; then
+      ((success_count++))
+    else
+      log "ib_send_bw client failed in iteration $i."
+    fi
+    # Clean up the server process
+    kill $server_pid 2>/dev/null || true
+    wait $server_pid 2>/dev/null
+  done
+
+  log "Loopback test result: $success_count/$LOOPBACK_ITERATIONS successful."
+  if [ "$success_count" -eq "$LOOPBACK_ITERATIONS" ]; then
+    log "Loopback test PASSED."
+    return 0
+  else
+    log "Loopback test FAILED."
+    return 1
+  fi
+}
+
+# Attempt to recover the network interface by bouncing it.
+try_recover_rdma() {
+  log "Attempting to recover interface $NET_IFACE..."
+  if [[ ${DRY_RUN} == 0 ]] ; then
+    ifconfig "$NET_IFACE" down
+    sleep 2
+    ifconfig "$NET_IFACE" up
+    sleep 5 # Allow time for the interface to initialize
+  else
+    log "DRY_RUN: Would have run ifconfig down/up on $NET_IFACE."
+  fi
+  log "Recovery attempt finished."
+}
+
+
+# --- Main Logic ---
+log "Starting RDMA health check init container."
+
+# 1. First, check the RDMA link state.
+if ! check_rdma_link; then
+  log "RDMA link check failed. Attempting recovery..."
+  try_recover_rdma
+  if ! check_rdma_link; then
+    log "ERROR: RDMA link is not ACTIVE after recovery attempt. Failing pod."
+    exit 1
+  fi
+  log "RDMA link check passed after recovery."
+fi
+
+# 2. If the link is good, perform the loopback test.
+if ! run_loopback_test; then
+  log "RDMA loopback test failed. Attempting recovery..."
+  try_recover_rdma
+  if ! run_loopback_test; then
+    log "ERROR: RDMA loopback test failed after recovery attempt. Failing pod."
+    exit 1
+  fi
+  log "RDMA loopback test passed after recovery."
+fi
+
+log "RDMA health checks passed. Init container exiting successfully."
+exit 0
@@ -0,0 +1,4 @@
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: irdma-health-check
@@ -0,0 +1,12 @@
+# cert-manager is a popular open-source tool for managing TLS certificates in Kubernetes.
+# It can automatically provision certificates from various sources, such as Let's Encrypt,
+# or self-signed issuers, and keep them up-to-date.
+#
+# For this webhook, cert-manager will create a self-signed certificate and automatically
+# inject the CA bundle into the MutatingWebhookConfiguration.
+#
+# Please install cert-manager by following the official instructions:
+# https://cert-manager.io/docs/installation/
+#
+# For example, using kubectl:
+# kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.0/cert-manager.yaml
@@ -0,0 +1,7 @@
+apiVersion: cert-manager.io/v1
+kind: Issuer
+metadata:
+  name: selfsigned-issuer
+  namespace: irdma-health-check
+spec:
+  selfSigned: {}