Skip to content
Merged
Show file tree
Hide file tree
Changes from 76 commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
aaa379b
chore: implement image copy to gcp using oidc
mchmarny Nov 4, 2025
36d1078
chore: refactor to env vars
mchmarny Nov 4, 2025
4a1b42c
chore: setup env vars
mchmarny Nov 4, 2025
3e1b34d
chore: debug variables
mchmarny Nov 4, 2025
572238b
chore: move vars to the job
mchmarny Nov 4, 2025
2584c8d
chore: inline variables
mchmarny Nov 4, 2025
ccd50ca
chore: move to branch
mchmarny Nov 4, 2025
785e705
fix: use if-then syntax instead of || for crane check
mchmarny Nov 4, 2025
ae1d7b8
fix: replace }
mchmarny Nov 4, 2025
1d1433f
chore: added cluster bringup
mchmarny Nov 4, 2025
0de0cf2
fix: remove deprecated logging/monitoring flags from cluster creation
mchmarny Nov 4, 2025
59c9b8a
chore: run delete always
mchmarny Nov 4, 2025
5ae0468
chore: add default network check
mchmarny Nov 4, 2025
cf82026
chore: move script to uat
mchmarny Nov 4, 2025
0c66246
chore: unique cluster name per test
mchmarny Nov 4, 2025
4e724a5
chore: add auth plugin
mchmarny Nov 4, 2025
40822f6
chore: add gpu node pool
mchmarny Nov 4, 2025
2d091e5
Merge branch 'main' into feature/oidc-gcp
mchmarny Nov 4, 2025
9aacbea
debug: add GitHub context output to troubleshoot OIDC auth
mchmarny Nov 4, 2025
dffb7b5
chore: remove debug info
mchmarny Nov 4, 2025
ce31be2
chore: remove spaces
mchmarny Nov 5, 2025
895372e
chore: add app install and tests
mchmarny Nov 5, 2025
bd338a1
chore: repaint project
mchmarny Nov 5, 2025
006d28b
chore: update oidc info
mchmarny Nov 5, 2025
5be1e13
chore: bind current user to cluster admin role
mchmarny Nov 5, 2025
421bed9
chore: add container admin role
mchmarny Nov 5, 2025
03f7651
chore: set csp env var
mchmarny Nov 5, 2025
e3e776e
chore: add missing values files
mchmarny Nov 5, 2025
1e444ac
chore: added gpu operator values
mchmarny Nov 5, 2025
04b4f07
chore: removing delete, skipping driver install
mchmarny Nov 5, 2025
56cc680
chore: set nvs version
mchmarny Nov 5, 2025
d6bdc2f
chore: add GPU node
mchmarny Nov 5, 2025
795455f
chore: switch to 1 node per zone
mchmarny Nov 5, 2025
4cc6187
chore: validate node pool command
mchmarny Nov 5, 2025
deeead6
chore: fix node pool creation
mchmarny Nov 5, 2025
83a935f
chore: remove unused variable
mchmarny Nov 5, 2025
c224307
chore: add pool as it's own step in cluster bringup
mchmarny Nov 5, 2025
fcfa31a
chore: reconcile identity
mchmarny Nov 5, 2025
e71a6d2
chore: remove sample config files
mchmarny Nov 6, 2025
a2bd526
chore: default sa
mchmarny Nov 6, 2025
5abdbb6
chore: cluster create
mchmarny Nov 6, 2025
9c96ee9
chore: gke uat bring up
mchmarny Nov 6, 2025
00e48cc
chore: add opportunistic maint
mchmarny Nov 13, 2025
a34ef81
chore: update gcloud cli
mchmarny Nov 13, 2025
cae819d
chore: default CLI
mchmarny Nov 13, 2025
c88fd88
chore: refactor to tf
mchmarny Nov 13, 2025
4ea8f42
chore: add plugin, test outputs
mchmarny Nov 13, 2025
ab3631a
chore: cleanup CI
mchmarny Nov 13, 2025
756474d
chore: add dcgm
mchmarny Nov 13, 2025
29c2121
chore: debug gcp install
mchmarny Nov 14, 2025
c774fae
chore: install driver by default
mchmarny Nov 14, 2025
c1664b7
chore: add gpu ds and quota prior to installing operator chart
mchmarny Nov 14, 2025
0ceb624
chore: fix test to account for labels
mchmarny Nov 14, 2025
6c1d124
chore: set to ubuntu
mchmarny Nov 14, 2025
8e53d61
chore: disabled on pool, enable in values
mchmarny Nov 14, 2025
7adec73
chore: simplify gpu op values
mchmarny Nov 14, 2025
c64bcb6
chore: disable secure boot
mchmarny Nov 14, 2025
8c5168b
chore; gpu operator values
mchmarny Nov 14, 2025
98afbda
fix: tests and values
lalitadithya Nov 14, 2025
55ecda1
fix: retry node event check
lalitadithya Nov 15, 2025
b12232c
fix: tests
lalitadithya Nov 15, 2025
0dac7e6
fix: tests
lalitadithya Nov 15, 2025
ad010ac
chore: trigger ci
lalitadithya Nov 15, 2025
6482ff2
chore: bump timeout
lalitadithya Nov 15, 2025
801afe6
fix: test
lalitadithya Nov 15, 2025
b9f3c95
fix: test
lalitadithya Nov 15, 2025
c03e7bd
fix: test
lalitadithya Nov 15, 2025
59fc941
fix: rerun
lalitadithya Nov 15, 2025
b074b87
fix: values
lalitadithya Nov 15, 2025
935cb87
Merge remote-tracking branch 'origin/main' into feature/oidc-gcp
lalitadithya Nov 15, 2025
dc447ce
fix: use latest
lalitadithya Nov 15, 2025
46913a4
fix: values
lalitadithya Nov 15, 2025
b10cca5
fix: test
lalitadithya Nov 15, 2025
2d6905d
chore: trigger ci
lalitadithya Nov 16, 2025
ba4047d
chore: set janitor value via env vars
mchmarny Nov 17, 2025
ca87986
chore: set janitor value via env vars
mchmarny Nov 17, 2025
f29a2d7
Merge branch 'main' into feature/oidc-gcp
mchmarny Nov 17, 2025
f00598d
chore: resolve pr feedback
mchmarny Nov 17, 2025
059db93
chore: Merge remote-tracking branch 'refs/remotes/origin/feature/oidc…
mchmarny Nov 17, 2025
7a2004a
chore: resolve pr feedback
mchmarny Nov 17, 2025
8fb4235
Merge remote-tracking branch 'refs/remotes/origin/feature/oidc-gcp' i…
mchmarny Nov 17, 2025
2d3547c
chore: clean up env vars
mchmarny Nov 17, 2025
eeedf93
chore: add missing headers
mchmarny Nov 17, 2025
7f91ede
chore: handle branch tags
mchmarny Nov 17, 2025
5bdf0ce
chore: fix bash condition
mchmarny Nov 17, 2025
efa8b47
chore: add debugging
mchmarny Nov 17, 2025
5a37acd
chore: update chart project value
mchmarny Nov 17, 2025
5812a80
chore: clean nvs values
mchmarny Nov 17, 2025
74d0250
chore: update comment
mchmarny Nov 17, 2025
c9ba4b6
chore: add missing service account file
mchmarny Nov 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions .github/workflows/integration-gcp.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: Integration Tests - GCP

on:
workflow_dispatch: {} # allow manual runs for testing
push:
branches:
- main
- feature/oidc-gcp

permissions:
contents: read
actions: read
id-token: write

jobs:
integration-test-gcp:
runs-on: ubuntu-latest
timeout-minutes: 60
env:
NVSENTINEL_VERSION: main-19d00f2
IDENTITY_PROVIDER: "projects/1015254933832/locations/global/workloadIdentityPools/github-pool/providers/github-provider"
SERVICE_ACCOUNT: "[email protected]"
CSP: "gcp"
PREFIX: "nvs"
SKIP_DELETE: "true" # for debugging, skip cluster deletion
TF_VAR_deployment_id: "d${{ github.run_id }}"
TF_VAR_project_id: "nv-dgxck8s-20250306"
TF_VAR_region: "europe-west4"
TF_VAR_zone: "europe-west4-b"
TF_VAR_system_node_type: "e2-standard-4"
TF_VAR_system_node_count: "3"
TF_VAR_gpu_node_pool_name: "gpu-pool"
TF_VAR_gpu_machine_type: "a3-megagpu-8g"
TF_VAR_gpu_node_count: "1"
TF_VAR_gpu_reservation_project: "nv-dgxcloudprodgsc-20240206"
TF_VAR_gpu_reservation_name: "gsc-a3-megagpu-8g-shared-res-2"
TF_VAR_gpu_driver_version: "INSTALLATION_DISABLED"
TF_VAR_resource_labels: '{"environment":"test","team":"nvsentinel","managed_by":"terraform"}'

steps:
# Checkout
- name: Checkout
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0

# Terraform
- name: Terraform
uses: hashicorp/setup-terraform@b9cd54a3c349d3f38e8881555d616ced269862dd # v3.1.2
with:
terraform_version: "1.13.5"

# Auth
- name: Get AuthN Token
id: auth
uses: google-github-actions/auth@7c6bc770dae815cd3e89ee6cdf493a5fab2cc093 # v3
with:
token_format: access_token
workload_identity_provider: ${{ env.IDENTITY_PROVIDER }}
service_account: ${{ env.SERVICE_ACCOUNT }}

# Gcloud
- name: Setup gcloud CLI
uses: google-github-actions/setup-gcloud@aa5489c8933f4cc7a4f7d45035b3b1440c9c10db # v3.0.1

# Cluster
- name: Create Cluster
id: cluster
shell: bash
continue-on-error: true
run: |
set -euo pipefail
cd tests/uat/gcp/cluster
terraform init
terraform apply -auto-approve

# Connect
- name: Connect to Cluster
id: client
if: steps.cluster.outcome == 'success'
shell: bash
run: |
set -euo pipefail
echo "Installing GKE auth plugin..."
gcloud components install gke-gcloud-auth-plugin --quiet --project ${{ env.TF_VAR_project_id }}
echo "Getting cluster credentials..."
gcloud container clusters get-credentials "${{ env.PREFIX }}-${{ env.TF_VAR_deployment_id }}" \
--zone ${{ env.TF_VAR_zone }} --project ${{ env.TF_VAR_project_id }}

# Apps
- name: Install NVS
id: apps
if: steps.client.outcome == 'success'
shell: bash
env:
GCP_PROJECT_ID: "${{ env.TF_VAR_project_id }}"
GCP_ZONE: "${{ env.TF_VAR_zone }}"
GCP_SERVICE_ACCOUNT: "${{ env.SERVICE_ACCOUNT }}"
run: tests/uat/install-apps.sh

# Test
- name: Run UAT Tests
id: tests
if: steps.apps.outcome == 'success'
shell: bash
run: tests/uat/tests.sh

# Teardown
- name: Destroy Cluster
if: always() && steps.cluster.outcome != 'skipped' && env.SKIP_DELETE != 'true'
shell: bash
run: |
set -euo pipefail
cd tests/uat/gcp/cluster
terraform destroy -auto-approve

# Summary
- name: Test Summary
if: always()
run: |
echo "## Test Results" >> $GITHUB_STEP_SUMMARY
echo "- Cluster: ${{ steps.cluster.outcome }}" >> $GITHUB_STEP_SUMMARY
echo "- Connection: ${{ steps.client.outcome }}" >> $GITHUB_STEP_SUMMARY
echo "- Apps: ${{ steps.apps.outcome }}" >> $GITHUB_STEP_SUMMARY
echo "- Tests: ${{ steps.tests.outcome }}" >> $GITHUB_STEP_SUMMARY
Original file line number Diff line number Diff line change
Expand Up @@ -92,4 +92,4 @@ spec:
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}

runtimeClassName: nvidia
184 changes: 184 additions & 0 deletions scripts/copy-images.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
#!/usr/bin/env bash
#
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

set -euo pipefail

# Variables
TARGET_REG_URI="${1:-}"
IMAGE_LIST_FILE="${2:-versions.txt}"

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color

# Helper functions
log_info() {
echo -e "${BLUE}ℹ️ $*${NC}"
}

log_success() {
echo -e "${GREEN}βœ… $*${NC}"
}

log_warning() {
echo -e "${YELLOW}⚠️ $*${NC}"
}

log_error() {
echo -e "${RED}❌ $*${NC}"
}

command_exists() {
command -v "$1" >/dev/null 2>&1
}

# Validate prerequisites
if ! command_exists crane; then
log_error "crane is not installed. Please install crane to proceed."
exit 1
fi

# Validate arguments
if [ -z "$TARGET_REG_URI" ]; then
log_error "Usage: $0 <target-registry-uri> [image-list-file]"
log_error "Example: $0 us-docker.pkg.dev/my-project/my-repo versions.txt"
exit 1
fi

if [ ! -f "$IMAGE_LIST_FILE" ]; then
log_error "Image list file not found: $IMAGE_LIST_FILE"
exit 1
fi

# Info
log_info "Source image list: $IMAGE_LIST_FILE"
log_info "Target registry URI: $TARGET_REG_URI"
log_info "Reading images from $IMAGE_LIST_FILE..."

# Count total images (excluding empty lines and comments)
TOTAL_IMAGES=$(grep -v '^#' "$IMAGE_LIST_FILE" | grep -v '^[[:space:]]*$' | wc -l | tr -d '[:space:]')
log_info "Found $TOTAL_IMAGES images to copy"

# Counters
SUCCESS_COUNT=0
FAILURE_COUNT=0
SKIPPED_COUNT=0

# Copy single image function
copy_image() {
local src_image_uri=$1
local image_num=$2

log_info "[$image_num/$TOTAL_IMAGES] Processing: $src_image_uri"

# Extract image name and tag from URI
# Format: registry/org/image:tag
local image_base=$(echo "$src_image_uri" | sed -E 's|^(.*/)([^/]+):(.*)$|\2|')
local image_tag=$(echo "$src_image_uri" | sed -E 's|^.*:(.*)$|\1|')

# Build target URI
local target_uri="$TARGET_REG_URI/$image_base:$image_tag"

log_info " Source: $src_image_uri"
log_info " Target: $target_uri"

# Get source digest
local src_digest
if ! src_digest=$(crane digest "$src_image_uri" 2>&1); then
log_error " Failed to get digest for $src_image_uri: $src_digest"
return 1
fi

log_info " Source digest: $src_digest"

# Check if image already exists at target with same digest
local target_digest
if target_digest=$(crane digest "$target_uri" 2>/dev/null); then
if [ "$target_digest" = "$src_digest" ]; then
log_warning " Image already exists at target with same digest, skipping"
return 2
else
log_info " Image exists but digest differs, will overwrite"
fi
fi

# Copy image
log_info " Copying image..."
if ! crane copy "$src_image_uri" "$target_uri"; then
log_error " Failed to copy image"
return 1
fi

# Verify digest after copy
local new_digest
if ! new_digest=$(crane digest "$target_uri" 2>&1); then
log_error " Failed to verify target digest: $new_digest"
return 1
fi

if [ "$new_digest" != "$src_digest" ]; then
log_error " Digest mismatch! Source: $src_digest, Target: $new_digest"
return 1
fi

log_success " Successfully copied and verified: $target_uri"
return 0
}

# Process each image in the list
IMAGE_NUM=0
while IFS= read -r src_image_uri; do
# Skip empty lines and comments
[[ -z "$src_image_uri" || "$src_image_uri" =~ ^[[:space:]]*# ]] && continue

IMAGE_NUM=$((IMAGE_NUM + 1))

if copy_image "$src_image_uri" "$IMAGE_NUM"; then
SUCCESS_COUNT=$((SUCCESS_COUNT + 1))
elif [ $? -eq 2 ]; then
SKIPPED_COUNT=$((SKIPPED_COUNT + 1))
else
FAILURE_COUNT=$((FAILURE_COUNT + 1))
log_warning "Continuing with next image..."
fi

echo "" # Blank line between images
done < "$IMAGE_LIST_FILE"

# Summary
echo "=================================================="
log_info "Image Copy Summary"
echo "=================================================="
log_success "Successfully copied: $SUCCESS_COUNT"
log_warning "Skipped (already exist): $SKIPPED_COUNT"
if [ $FAILURE_COUNT -gt 0 ]; then
log_error "Failed: $FAILURE_COUNT"
else
log_info "Failed: $FAILURE_COUNT"
fi
log_info "Total processed: $TOTAL_IMAGES"
echo "=================================================="

# Exit with error if any failures
if [ $FAILURE_COUNT -gt 0 ]; then
exit 1
fi

exit 0
Empty file removed tests/uat/gcp/.gitkeep
Empty file.
54 changes: 54 additions & 0 deletions tests/uat/gcp/cert-manager-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

installCRDs: true

# Optimize for faster startup in Kind/test environments
global:
leaderElection:
# Reduce leader election timeout for faster startup
leaseDuration: 30s
renewDeadline: 20s
retryPeriod: 5s

# Reduce resource requests for Kind (local testing)
resources:
requests:
cpu: 10m
memory: 32Mi

webhook:
# Reduce webhook resource requirements
resources:
requests:
cpu: 10m
memory: 32Mi
# Faster readiness checks
readinessProbe:
initialDelaySeconds: 3
periodSeconds: 3

cainjector:
# Reduce cainjector resource requirements
resources:
requests:
cpu: 10m
memory: 32Mi

startupapicheck:
# Reduce startup check resource requirements
resources:
requests:
cpu: 10m
memory: 32Mi
Loading
Loading