Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
aaa379b
chore: implement image copy to gcp using oidc
mchmarny Nov 4, 2025
36d1078
chore: refactor to env vars
mchmarny Nov 4, 2025
4a1b42c
chore: setup env vars
mchmarny Nov 4, 2025
3e1b34d
chore: debug variables
mchmarny Nov 4, 2025
572238b
chore: move vars to the job
mchmarny Nov 4, 2025
2584c8d
chore: inline variables
mchmarny Nov 4, 2025
ccd50ca
chore: move to branch
mchmarny Nov 4, 2025
785e705
fix: use if-then syntax instead of || for crane check
mchmarny Nov 4, 2025
ae1d7b8
fix: replace }
mchmarny Nov 4, 2025
1d1433f
chore: added cluster bringup
mchmarny Nov 4, 2025
0de0cf2
fix: remove deprecated logging/monitoring flags from cluster creation
mchmarny Nov 4, 2025
59c9b8a
chore: run delete always
mchmarny Nov 4, 2025
5ae0468
chore: add default network check
mchmarny Nov 4, 2025
cf82026
chore: move script to uat
mchmarny Nov 4, 2025
0c66246
chore: unique cluster name per test
mchmarny Nov 4, 2025
4e724a5
chore: add auth plugin
mchmarny Nov 4, 2025
40822f6
chore: add gpu node pool
mchmarny Nov 4, 2025
2d091e5
Merge branch 'main' into feature/oidc-gcp
mchmarny Nov 4, 2025
9aacbea
debug: add GitHub context output to troubleshoot OIDC auth
mchmarny Nov 4, 2025
dffb7b5
chore: remove debug info
mchmarny Nov 4, 2025
ce31be2
chore: remove spaces
mchmarny Nov 5, 2025
895372e
chore: add app install and tests
mchmarny Nov 5, 2025
bd338a1
chore: repaint project
mchmarny Nov 5, 2025
006d28b
chore: update oidc info
mchmarny Nov 5, 2025
5be1e13
chore: bind current user to cluster admin role
mchmarny Nov 5, 2025
421bed9
chore: add container admin role
mchmarny Nov 5, 2025
03f7651
chore: set csp env var
mchmarny Nov 5, 2025
e3e776e
chore: add missing values files
mchmarny Nov 5, 2025
1e444ac
chore: added gpu operator values
mchmarny Nov 5, 2025
04b4f07
chore: removing delete, skipping driver install
mchmarny Nov 5, 2025
56cc680
chore: set nvs version
mchmarny Nov 5, 2025
d6bdc2f
chore: add GPU node
mchmarny Nov 5, 2025
795455f
chore: switch to 1 node per zone
mchmarny Nov 5, 2025
4cc6187
chore: validate node pool command
mchmarny Nov 5, 2025
deeead6
chore: fix node pool creation
mchmarny Nov 5, 2025
83a935f
chore: remove unused variable
mchmarny Nov 5, 2025
c224307
chore: add pool as it's own step in cluster bringup
mchmarny Nov 5, 2025
fcfa31a
chore: reconcile identity
mchmarny Nov 5, 2025
e71a6d2
chore: remove sample config files
mchmarny Nov 6, 2025
a2bd526
chore: default sa
mchmarny Nov 6, 2025
5abdbb6
chore: cluster create
mchmarny Nov 6, 2025
9c96ee9
chore: gke uat bring up
mchmarny Nov 6, 2025
00e48cc
chore: add opportunistic maint
mchmarny Nov 13, 2025
a34ef81
chore: update gcloud cli
mchmarny Nov 13, 2025
cae819d
chore: default CLI
mchmarny Nov 13, 2025
c88fd88
chore: refactor to tf
mchmarny Nov 13, 2025
4ea8f42
chore: add plugin, test outputs
mchmarny Nov 13, 2025
ab3631a
chore: cleanup CI
mchmarny Nov 13, 2025
756474d
chore: add dcgm
mchmarny Nov 13, 2025
29c2121
chore: debug gcp install
mchmarny Nov 14, 2025
c774fae
chore: install driver by default
mchmarny Nov 14, 2025
c1664b7
chore: add gpu ds and quota prior to installing operator chart
mchmarny Nov 14, 2025
0ceb624
chore: fix test to account for labels
mchmarny Nov 14, 2025
6c1d124
chore: set to ubuntu
mchmarny Nov 14, 2025
8e53d61
chore: disabled on pool, enable in values
mchmarny Nov 14, 2025
7adec73
chore: simplify gpu op values
mchmarny Nov 14, 2025
c64bcb6
chore: disable secure boot
mchmarny Nov 14, 2025
8c5168b
chore; gpu operator values
mchmarny Nov 14, 2025
98afbda
fix: tests and values
lalitadithya Nov 14, 2025
55ecda1
fix: retry node event check
lalitadithya Nov 15, 2025
b12232c
fix: tests
lalitadithya Nov 15, 2025
0dac7e6
fix: tests
lalitadithya Nov 15, 2025
ad010ac
chore: trigger ci
lalitadithya Nov 15, 2025
6482ff2
chore: bump timeout
lalitadithya Nov 15, 2025
801afe6
fix: test
lalitadithya Nov 15, 2025
b9f3c95
fix: test
lalitadithya Nov 15, 2025
c03e7bd
fix: test
lalitadithya Nov 15, 2025
59fc941
fix: rerun
lalitadithya Nov 15, 2025
b074b87
fix: values
lalitadithya Nov 15, 2025
935cb87
Merge remote-tracking branch 'origin/main' into feature/oidc-gcp
lalitadithya Nov 15, 2025
dc447ce
fix: use latest
lalitadithya Nov 15, 2025
46913a4
fix: values
lalitadithya Nov 15, 2025
b10cca5
fix: test
lalitadithya Nov 15, 2025
2d6905d
chore: trigger ci
lalitadithya Nov 16, 2025
ba4047d
chore: set janitor value via env vars
mchmarny Nov 17, 2025
ca87986
chore: set janitor value via env vars
mchmarny Nov 17, 2025
f29a2d7
Merge branch 'main' into feature/oidc-gcp
mchmarny Nov 17, 2025
f00598d
chore: resolve pr feedback
mchmarny Nov 17, 2025
059db93
chore: Merge remote-tracking branch 'refs/remotes/origin/feature/oidc…
mchmarny Nov 17, 2025
7a2004a
chore: resolve pr feedback
mchmarny Nov 17, 2025
8fb4235
Merge remote-tracking branch 'refs/remotes/origin/feature/oidc-gcp' i…
mchmarny Nov 17, 2025
2d3547c
chore: clean up env vars
mchmarny Nov 17, 2025
eeedf93
chore: add missing headers
mchmarny Nov 17, 2025
7f91ede
chore: handle branch tags
mchmarny Nov 17, 2025
5bdf0ce
chore: fix bash condition
mchmarny Nov 17, 2025
efa8b47
chore: add debugging
mchmarny Nov 17, 2025
5a37acd
chore: update chart project value
mchmarny Nov 17, 2025
5812a80
chore: clean nvs values
mchmarny Nov 17, 2025
74d0250
chore: update comment
mchmarny Nov 17, 2025
c9ba4b6
chore: add missing service account file
mchmarny Nov 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions .github/workflows/integration-gcp.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: Integration Tests - GCP

on:
workflow_dispatch: {} # allow manual runs for testing
schedule:
- cron: '30 14 * * *' # daily at 14:30 UTC, runs on default branch only
push:
branches:
- main
- feature/oidc-gcp

permissions:
contents: read
actions: read
id-token: write

jobs:
integration-test-gcp:
runs-on: ubuntu-latest
timeout-minutes: 60
env:
CSP: "gcp"
PREFIX: "nvs"
PROJECT_ID: "nv-dgxck8s-20250306"
IDENTITY_PROVIDER: "projects/1015254933832/locations/global/workloadIdentityPools/github-pool/providers/github-provider"
SERVICE_ACCOUNT: "github-actions-user"
# Terraform Vars
TF_VAR_deployment_id: "d${{ github.run_id }}"
TF_VAR_project_id: "nv-dgxck8s-20250306"
TF_VAR_region: "europe-west4"
TF_VAR_zone: "europe-west4-b"
TF_VAR_system_node_type: "e2-standard-4"
TF_VAR_system_node_count: "3"
TF_VAR_gpu_node_pool_name: "gpu-pool"
TF_VAR_gpu_machine_type: "a3-megagpu-8g"
TF_VAR_gpu_node_count: "1"
TF_VAR_gpu_reservation_project: "nv-dgxcloudprodgsc-20240206"
TF_VAR_gpu_reservation_name: "gsc-a3-megagpu-8g-shared-res-2"
TF_VAR_gpu_driver_version: "INSTALLATION_DISABLED"
TF_VAR_resource_labels: '{"environment":"test","team":"nvsentinel","managed_by":"terraform"}'
# Debug
SKIP_DELETE: "false" # skip cluster deletion
TEST_TAG: "main-33c1d03"

steps:
# Checkout
- name: Checkout
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0

# Terraform
- name: Terraform
uses: hashicorp/setup-terraform@b9cd54a3c349d3f38e8881555d616ced269862dd # v3.1.2
with:
terraform_version: "1.13.5"

# Auth
- name: Get AuthN Token
id: auth
uses: google-github-actions/auth@7c6bc770dae815cd3e89ee6cdf493a5fab2cc093 # v3
with:
token_format: access_token
workload_identity_provider: ${{ env.IDENTITY_PROVIDER }}
service_account: "${{ env.SERVICE_ACCOUNT }}@${{ env.PROJECT_ID }}.iam.gserviceaccount.com"

# Gcloud
- name: Setup gcloud CLI
uses: google-github-actions/setup-gcloud@aa5489c8933f4cc7a4f7d45035b3b1440c9c10db # v3.0.1

# Cluster
- name: Create Cluster
id: cluster
shell: bash
continue-on-error: true
run: |
set -euo pipefail
cd tests/uat/gcp/cluster
terraform init
terraform apply -auto-approve

# Connect
- name: Connect to Cluster
id: client
if: steps.cluster.outcome == 'success'
shell: bash
run: |
set -euo pipefail
echo "Installing GKE auth plugin..."
gcloud components install gke-gcloud-auth-plugin --quiet --project ${{ env.TF_VAR_project_id }}
echo "Getting cluster credentials..."
gcloud container clusters get-credentials "${{ env.PREFIX }}-${{ env.TF_VAR_deployment_id }}" \
--zone ${{ env.TF_VAR_zone }} --project ${{ env.TF_VAR_project_id }}

# Image Tag
- name: Compute ref name with short SHA
id: ref-name
run: |
if [[ "${{ github.ref_type }}" == "tag" ]]; then
SAFE_REF="${{ github.ref_name }}"
elif [[ "${{ github.ref_name }}" == "main" ]]; then
SHORT_SHA=$(echo "${{ github.sha }}" | cut -c1-7)
SAFE_REF="${{ github.ref_name }}-${SHORT_SHA}"
else
SAFE_REF="${{ env.TEST_TAG }}"
fi
# Sanitize ref name: replace slashes with hyphens for Docker tag compatibility
SAFE_REF=$(echo "$SAFE_REF" | sed 's/\//-/g')
echo "value=$SAFE_REF" >> $GITHUB_OUTPUT

# Apps
- name: Install NVS
id: apps
if: steps.client.outcome == 'success'
shell: bash
env:
GCP_PROJECT_ID: "${{ env.PROJECT_ID }}"
GCP_ZONE: "${{ env.TF_VAR_zone }}"
GCP_SERVICE_ACCOUNT: "${{ env.SERVICE_ACCOUNT }}"
NVSENTINEL_VERSION: "${{ steps.ref-name.outputs.value }}"
run: |
set -euxo pipefail
tests/uat/install-apps.sh

# Test
- name: Run UAT Tests
id: tests
if: steps.apps.outcome == 'success'
shell: bash
run: |
set -euxo pipefail
tests/uat/tests.sh

# Teardown
- name: Destroy Cluster
if: always() && steps.cluster.outcome != 'skipped' && env.SKIP_DELETE != 'true'
shell: bash
run: |
set -euxo pipefail
cd tests/uat/gcp/cluster
terraform destroy -auto-approve

# Summary
- name: Test Summary
if: always()
run: |
echo "## Test Results" >> $GITHUB_STEP_SUMMARY
echo "- Cluster: ${{ steps.cluster.outcome }}" >> $GITHUB_STEP_SUMMARY
echo "- Connection: ${{ steps.client.outcome }}" >> $GITHUB_STEP_SUMMARY
echo "- Apps: ${{ steps.apps.outcome }}" >> $GITHUB_STEP_SUMMARY
echo "- Tests: ${{ steps.tests.outcome }}" >> $GITHUB_STEP_SUMMARY
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,4 @@ kind: ServiceAccount
metadata:
name: {{ include "kubernetes-object-monitor.fullname" . }}
labels:
{{- include "kubernetes-object-monitor.labels" . | nindent 4 }}

{{- include "kubernetes-object-monitor.labels" . | nindent 4 }}
Original file line number Diff line number Diff line change
Expand Up @@ -92,4 +92,4 @@ spec:
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}

runtimeClassName: nvidia
Empty file removed tests/uat/gcp/.gitkeep
Empty file.
54 changes: 54 additions & 0 deletions tests/uat/gcp/cert-manager-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

installCRDs: true

# Optimize for faster startup in Kind/test environments
global:
leaderElection:
# Reduce leader election timeout for faster startup
leaseDuration: 30s
renewDeadline: 20s
retryPeriod: 5s

# Reduce resource requests for Kind (local testing)
resources:
requests:
cpu: 10m
memory: 32Mi

webhook:
# Reduce webhook resource requirements
resources:
requests:
cpu: 10m
memory: 32Mi
# Faster readiness checks
readinessProbe:
initialDelaySeconds: 3
periodSeconds: 3

cainjector:
# Reduce cainjector resource requirements
resources:
requests:
cpu: 10m
memory: 32Mi

startupapicheck:
# Reduce startup check resource requirements
resources:
requests:
cpu: 10m
memory: 32Mi
94 changes: 94 additions & 0 deletions tests/uat/gcp/cluster/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# GKE Cluster Terraform Configuration

This Terraform configuration creates a GKE cluster with GPU nodes for NVSentinel testing.

- Single zone (zonal cluster)
- GPU nodes use specific reservation affinity
- Service account `gke-cluster-kubernetes@PROJECT_ID.iam.gserviceaccount.com` must exist

## Prerequisites

- [Terraform](https://www.terraform.io/downloads.html) `>= 1.9.5`
- [gcloud CLI](https://cloud.google.com/sdk/docs/install) configured with appropriate credentials
- GCP project with necessary APIs enabled:
- Kubernetes Engine API
- Compute Engine API

## Known Issues

⚠️ **Reservation Maintenance Interval Mismatch - RESOLVED**

**Previous Issue:** gSC (Google Supercomputer) reservations require instances to have `maintenanceInterval=PERIODIC`, but GKE created instances with `maintenanceInterval=MAINTENANCE_INTERVAL_UNSPECIFIED` by default.

**Solution:** The configuration now includes `host_maintenance_policy` block in the GPU node pool with `maintenance_interval = "PERIODIC"`, which resolves this issue.

```hcl
host_maintenance_policy {
maintenance_interval = "PERIODIC"
}
```

## Usage

1. **Initialize Terraform:**
```bash
terraform init
```

2. **Configure variables (optional):**
```bash
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your values
```

3. **Preview changes:**
```bash
terraform plan
```

4. **Create the cluster:**
```bash
terraform apply
```

5. **Get kubeconfig:**
```bash
gcloud container clusters get-credentials nvs-d2 --zone europe-west4-b --project nv-dgxck8s-20250306
```

Or use the output command:
```bash
terraform output -raw kubeconfig_command | bash
```

6. **Destroy the cluster:**
```bash
terraform destroy
```

## Configuration Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `deployment_id` | Deployment identifier for cluster naming | `d2` |
| `project_id` | GCP project ID | `nv-dgxck8s-20250306` |
| `zone` | GCP zone for the cluster | `europe-west4-b` |
| `system_node_type` | Machine type for system nodes | `e2-standard-4` |
| `system_node_count` | Number of system nodes | `3` |
| `gpu_node_pool_name` | Name of the GPU node pool | `gpu-pool` |
| `gpu_machine_type` | Machine type for GPU nodes | `a3-megagpu-8g` |
| `gpu_node_count` | Number of GPU nodes | `1` |
| `gpu_reservation_project` | Project containing GPU reservation | `nv-dgxcloudprodgsc-20240206` |
| `gpu_reservation_name` | Name of GPU reservation | `gsc-a3-megagpu-8g-shared-res-2` |
| `gpu_driver_version` | GPU driver installation mode | `INSTALLATION_DISABLED` |
| `resource_labels` | Labels to apply to resources | `{}` |


## Outputs

- `cluster_name`: Name of the created cluster
- `cluster_location`: Zone where cluster is deployed
- `cluster_endpoint`: API endpoint (sensitive)
- `cluster_ca_certificate`: CA certificate (sensitive)
- `gpu_node_pool_name`: Name of GPU node pool
- `kubeconfig_command`: Command to configure kubectl
Loading
Loading