-
Notifications
You must be signed in to change notification settings - Fork 19
chore: automated UAT test on GCP #355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
90 commits
Select commit
Hold shift + click to select a range
aaa379b
chore: implement image copy to gcp using oidc
mchmarny 36d1078
chore: refactor to env vars
mchmarny 4a1b42c
chore: setup env vars
mchmarny 3e1b34d
chore: debug variables
mchmarny 572238b
chore: move vars to the job
mchmarny 2584c8d
chore: inline variables
mchmarny ccd50ca
chore: move to branch
mchmarny 785e705
fix: use if-then syntax instead of || for crane check
mchmarny ae1d7b8
fix: replace }
mchmarny 1d1433f
chore: added cluster bringup
mchmarny 0de0cf2
fix: remove deprecated logging/monitoring flags from cluster creation
mchmarny 59c9b8a
chore: run delete always
mchmarny 5ae0468
chore: add default network check
mchmarny cf82026
chore: move script to uat
mchmarny 0c66246
chore: unique cluster name per test
mchmarny 4e724a5
chore: add auth plugin
mchmarny 40822f6
chore: add gpu node pool
mchmarny 2d091e5
Merge branch 'main' into feature/oidc-gcp
mchmarny 9aacbea
debug: add GitHub context output to troubleshoot OIDC auth
mchmarny dffb7b5
chore: remove debug info
mchmarny ce31be2
chore: remove spaces
mchmarny 895372e
chore: add app install and tests
mchmarny bd338a1
chore: repaint project
mchmarny 006d28b
chore: update oidc info
mchmarny 5be1e13
chore: bind current user to cluster admin role
mchmarny 421bed9
chore: add container admin role
mchmarny 03f7651
chore: set csp env var
mchmarny e3e776e
chore: add missing values files
mchmarny 1e444ac
chore: added gpu operator values
mchmarny 04b4f07
chore: removing delete, skipping driver install
mchmarny 56cc680
chore: set nvs version
mchmarny d6bdc2f
chore: add GPU node
mchmarny 795455f
chore: switch to 1 node per zone
mchmarny 4cc6187
chore: validate node pool command
mchmarny deeead6
chore: fix node pool creation
mchmarny 83a935f
chore: remove unused variable
mchmarny c224307
chore: add pool as it's own step in cluster bringup
mchmarny fcfa31a
chore: reconcile identity
mchmarny e71a6d2
chore: remove sample config files
mchmarny a2bd526
chore: default sa
mchmarny 5abdbb6
chore: cluster create
mchmarny 9c96ee9
chore: gke uat bring up
mchmarny 00e48cc
chore: add opportunistic maint
mchmarny a34ef81
chore: update gcloud cli
mchmarny cae819d
chore: default CLI
mchmarny c88fd88
chore: refactor to tf
mchmarny 4ea8f42
chore: add plugin, test outputs
mchmarny ab3631a
chore: cleanup CI
mchmarny 756474d
chore: add dcgm
mchmarny 29c2121
chore: debug gcp install
mchmarny c774fae
chore: install driver by default
mchmarny c1664b7
chore: add gpu ds and quota prior to installing operator chart
mchmarny 0ceb624
chore: fix test to account for labels
mchmarny 6c1d124
chore: set to ubuntu
mchmarny 8e53d61
chore: disabled on pool, enable in values
mchmarny 7adec73
chore: simplify gpu op values
mchmarny c64bcb6
chore: disable secure boot
mchmarny 8c5168b
chore; gpu operator values
mchmarny 98afbda
fix: tests and values
lalitadithya 55ecda1
fix: retry node event check
lalitadithya b12232c
fix: tests
lalitadithya 0dac7e6
fix: tests
lalitadithya ad010ac
chore: trigger ci
lalitadithya 6482ff2
chore: bump timeout
lalitadithya 801afe6
fix: test
lalitadithya b9f3c95
fix: test
lalitadithya c03e7bd
fix: test
lalitadithya 59fc941
fix: rerun
lalitadithya b074b87
fix: values
lalitadithya 935cb87
Merge remote-tracking branch 'origin/main' into feature/oidc-gcp
lalitadithya dc447ce
fix: use latest
lalitadithya 46913a4
fix: values
lalitadithya b10cca5
fix: test
lalitadithya 2d6905d
chore: trigger ci
lalitadithya ba4047d
chore: set janitor value via env vars
mchmarny ca87986
chore: set janitor value via env vars
mchmarny f29a2d7
Merge branch 'main' into feature/oidc-gcp
mchmarny f00598d
chore: resolve pr feedback
mchmarny 059db93
chore: Merge remote-tracking branch 'refs/remotes/origin/feature/oidcβ¦
mchmarny 7a2004a
chore: resolve pr feedback
mchmarny 8fb4235
Merge remote-tracking branch 'refs/remotes/origin/feature/oidc-gcp' iβ¦
mchmarny 2d3547c
chore: clean up env vars
mchmarny eeedf93
chore: add missing headers
mchmarny 7f91ede
chore: handle branch tags
mchmarny 5bdf0ce
chore: fix bash condition
mchmarny efa8b47
chore: add debugging
mchmarny 5a37acd
chore: update chart project value
mchmarny 5812a80
chore: clean nvs values
mchmarny 74d0250
chore: update comment
mchmarny c9ba4b6
chore: add missing service account file
mchmarny File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,163 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| name: Integration Tests - GCP | ||
|
|
||
| on: | ||
| workflow_dispatch: {} # allow manual runs for testing | ||
| schedule: | ||
| - cron: '30 14 * * *' # daily at 14:30 UTC, runs on default branch only | ||
| push: | ||
| branches: | ||
| - main | ||
| - feature/oidc-gcp | ||
|
|
||
| permissions: | ||
| contents: read | ||
| actions: read | ||
| id-token: write | ||
|
|
||
| jobs: | ||
| integration-test-gcp: | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 60 | ||
| env: | ||
| CSP: "gcp" | ||
| PREFIX: "nvs" | ||
| PROJECT_ID: "nv-dgxck8s-20250306" | ||
| IDENTITY_PROVIDER: "projects/1015254933832/locations/global/workloadIdentityPools/github-pool/providers/github-provider" | ||
| SERVICE_ACCOUNT: "github-actions-user" | ||
| # Terraform Vars | ||
| TF_VAR_deployment_id: "d${{ github.run_id }}" | ||
| TF_VAR_project_id: "nv-dgxck8s-20250306" | ||
| TF_VAR_region: "europe-west4" | ||
| TF_VAR_zone: "europe-west4-b" | ||
| TF_VAR_system_node_type: "e2-standard-4" | ||
| TF_VAR_system_node_count: "3" | ||
| TF_VAR_gpu_node_pool_name: "gpu-pool" | ||
| TF_VAR_gpu_machine_type: "a3-megagpu-8g" | ||
| TF_VAR_gpu_node_count: "1" | ||
| TF_VAR_gpu_reservation_project: "nv-dgxcloudprodgsc-20240206" | ||
| TF_VAR_gpu_reservation_name: "gsc-a3-megagpu-8g-shared-res-2" | ||
| TF_VAR_gpu_driver_version: "INSTALLATION_DISABLED" | ||
| TF_VAR_resource_labels: '{"environment":"test","team":"nvsentinel","managed_by":"terraform"}' | ||
| # Debug | ||
| SKIP_DELETE: "false" # skip cluster deletion | ||
| TEST_TAG: "main-33c1d03" | ||
|
|
||
| steps: | ||
| # Checkout | ||
| - name: Checkout | ||
| uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0 | ||
|
|
||
| # Terraform | ||
| - name: Terraform | ||
| uses: hashicorp/setup-terraform@b9cd54a3c349d3f38e8881555d616ced269862dd # v3.1.2 | ||
| with: | ||
| terraform_version: "1.13.5" | ||
|
|
||
| # Auth | ||
| - name: Get AuthN Token | ||
| id: auth | ||
| uses: google-github-actions/auth@7c6bc770dae815cd3e89ee6cdf493a5fab2cc093 # v3 | ||
| with: | ||
| token_format: access_token | ||
| workload_identity_provider: ${{ env.IDENTITY_PROVIDER }} | ||
| service_account: "${{ env.SERVICE_ACCOUNT }}@${{ env.PROJECT_ID }}.iam.gserviceaccount.com" | ||
|
|
||
| # Gcloud | ||
| - name: Setup gcloud CLI | ||
| uses: google-github-actions/setup-gcloud@aa5489c8933f4cc7a4f7d45035b3b1440c9c10db # v3.0.1 | ||
|
|
||
| # Cluster | ||
| - name: Create Cluster | ||
| id: cluster | ||
| shell: bash | ||
| continue-on-error: true | ||
| run: | | ||
| set -euo pipefail | ||
| cd tests/uat/gcp/cluster | ||
| terraform init | ||
| terraform apply -auto-approve | ||
|
|
||
| # Connect | ||
| - name: Connect to Cluster | ||
| id: client | ||
| if: steps.cluster.outcome == 'success' | ||
| shell: bash | ||
| run: | | ||
| set -euo pipefail | ||
| echo "Installing GKE auth plugin..." | ||
| gcloud components install gke-gcloud-auth-plugin --quiet --project ${{ env.TF_VAR_project_id }} | ||
| echo "Getting cluster credentials..." | ||
| gcloud container clusters get-credentials "${{ env.PREFIX }}-${{ env.TF_VAR_deployment_id }}" \ | ||
| --zone ${{ env.TF_VAR_zone }} --project ${{ env.TF_VAR_project_id }} | ||
|
|
||
| # Image Tag | ||
| - name: Compute ref name with short SHA | ||
| id: ref-name | ||
| run: | | ||
| if [[ "${{ github.ref_type }}" == "tag" ]]; then | ||
| SAFE_REF="${{ github.ref_name }}" | ||
| elif [[ "${{ github.ref_name }}" == "main" ]]; then | ||
| SHORT_SHA=$(echo "${{ github.sha }}" | cut -c1-7) | ||
| SAFE_REF="${{ github.ref_name }}-${SHORT_SHA}" | ||
| else | ||
| SAFE_REF="${{ env.TEST_TAG }}" | ||
| fi | ||
| # Sanitize ref name: replace slashes with hyphens for Docker tag compatibility | ||
| SAFE_REF=$(echo "$SAFE_REF" | sed 's/\//-/g') | ||
| echo "value=$SAFE_REF" >> $GITHUB_OUTPUT | ||
|
|
||
| # Apps | ||
| - name: Install NVS | ||
| id: apps | ||
| if: steps.client.outcome == 'success' | ||
| shell: bash | ||
| env: | ||
| GCP_PROJECT_ID: "${{ env.PROJECT_ID }}" | ||
| GCP_ZONE: "${{ env.TF_VAR_zone }}" | ||
| GCP_SERVICE_ACCOUNT: "${{ env.SERVICE_ACCOUNT }}" | ||
| NVSENTINEL_VERSION: "${{ steps.ref-name.outputs.value }}" | ||
| run: | | ||
| set -euxo pipefail | ||
| tests/uat/install-apps.sh | ||
|
|
||
| # Test | ||
| - name: Run UAT Tests | ||
| id: tests | ||
| if: steps.apps.outcome == 'success' | ||
| shell: bash | ||
| run: | | ||
| set -euxo pipefail | ||
| tests/uat/tests.sh | ||
|
|
||
| # Teardown | ||
| - name: Destroy Cluster | ||
| if: always() && steps.cluster.outcome != 'skipped' && env.SKIP_DELETE != 'true' | ||
| shell: bash | ||
| run: | | ||
| set -euxo pipefail | ||
| cd tests/uat/gcp/cluster | ||
| terraform destroy -auto-approve | ||
|
|
||
| # Summary | ||
| - name: Test Summary | ||
| if: always() | ||
| run: | | ||
| echo "## Test Results" >> $GITHUB_STEP_SUMMARY | ||
| echo "- Cluster: ${{ steps.cluster.outcome }}" >> $GITHUB_STEP_SUMMARY | ||
| echo "- Connection: ${{ steps.client.outcome }}" >> $GITHUB_STEP_SUMMARY | ||
| echo "- Apps: ${{ steps.apps.outcome }}" >> $GITHUB_STEP_SUMMARY | ||
| echo "- Tests: ${{ steps.tests.outcome }}" >> $GITHUB_STEP_SUMMARY | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -92,4 +92,4 @@ spec: | |
| tolerations: | ||
| {{- toYaml . | nindent 8 }} | ||
| {{- end }} | ||
|
|
||
| runtimeClassName: nvidia | ||
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| installCRDs: true | ||
|
|
||
| # Optimize for faster startup in Kind/test environments | ||
| global: | ||
| leaderElection: | ||
| # Reduce leader election timeout for faster startup | ||
| leaseDuration: 30s | ||
| renewDeadline: 20s | ||
| retryPeriod: 5s | ||
|
|
||
| # Reduce resource requests for Kind (local testing) | ||
| resources: | ||
| requests: | ||
| cpu: 10m | ||
| memory: 32Mi | ||
|
|
||
| webhook: | ||
| # Reduce webhook resource requirements | ||
| resources: | ||
| requests: | ||
| cpu: 10m | ||
| memory: 32Mi | ||
| # Faster readiness checks | ||
| readinessProbe: | ||
| initialDelaySeconds: 3 | ||
| periodSeconds: 3 | ||
|
|
||
| cainjector: | ||
| # Reduce cainjector resource requirements | ||
| resources: | ||
| requests: | ||
| cpu: 10m | ||
| memory: 32Mi | ||
|
|
||
| startupapicheck: | ||
| # Reduce startup check resource requirements | ||
| resources: | ||
| requests: | ||
| cpu: 10m | ||
| memory: 32Mi |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| # GKE Cluster Terraform Configuration | ||
|
|
||
| This Terraform configuration creates a GKE cluster with GPU nodes for NVSentinel testing. | ||
|
|
||
| - Single zone (zonal cluster) | ||
| - GPU nodes use specific reservation affinity | ||
| - Service account `gke-cluster-kubernetes@PROJECT_ID.iam.gserviceaccount.com` must exist | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - [Terraform](https://www.terraform.io/downloads.html) `>= 1.9.5` | ||
| - [gcloud CLI](https://cloud.google.com/sdk/docs/install) configured with appropriate credentials | ||
| - GCP project with necessary APIs enabled: | ||
| - Kubernetes Engine API | ||
| - Compute Engine API | ||
|
|
||
| ## Known Issues | ||
|
|
||
| β οΈ **Reservation Maintenance Interval Mismatch - RESOLVED** | ||
|
|
||
| **Previous Issue:** gSC (Google Supercomputer) reservations require instances to have `maintenanceInterval=PERIODIC`, but GKE created instances with `maintenanceInterval=MAINTENANCE_INTERVAL_UNSPECIFIED` by default. | ||
|
|
||
| **Solution:** The configuration now includes `host_maintenance_policy` block in the GPU node pool with `maintenance_interval = "PERIODIC"`, which resolves this issue. | ||
|
|
||
| ```hcl | ||
| host_maintenance_policy { | ||
| maintenance_interval = "PERIODIC" | ||
| } | ||
| ``` | ||
|
|
||
| ## Usage | ||
|
|
||
| 1. **Initialize Terraform:** | ||
| ```bash | ||
| terraform init | ||
| ``` | ||
|
|
||
| 2. **Configure variables (optional):** | ||
| ```bash | ||
| cp terraform.tfvars.example terraform.tfvars | ||
| # Edit terraform.tfvars with your values | ||
| ``` | ||
|
|
||
| 3. **Preview changes:** | ||
| ```bash | ||
| terraform plan | ||
| ``` | ||
|
|
||
| 4. **Create the cluster:** | ||
| ```bash | ||
| terraform apply | ||
| ``` | ||
|
|
||
| 5. **Get kubeconfig:** | ||
| ```bash | ||
| gcloud container clusters get-credentials nvs-d2 --zone europe-west4-b --project nv-dgxck8s-20250306 | ||
| ``` | ||
|
|
||
| Or use the output command: | ||
| ```bash | ||
| terraform output -raw kubeconfig_command | bash | ||
| ``` | ||
|
|
||
| 6. **Destroy the cluster:** | ||
| ```bash | ||
| terraform destroy | ||
| ``` | ||
|
|
||
| ## Configuration Variables | ||
|
|
||
| | Variable | Description | Default | | ||
| |----------|-------------|---------| | ||
| | `deployment_id` | Deployment identifier for cluster naming | `d2` | | ||
| | `project_id` | GCP project ID | `nv-dgxck8s-20250306` | | ||
| | `zone` | GCP zone for the cluster | `europe-west4-b` | | ||
| | `system_node_type` | Machine type for system nodes | `e2-standard-4` | | ||
| | `system_node_count` | Number of system nodes | `3` | | ||
| | `gpu_node_pool_name` | Name of the GPU node pool | `gpu-pool` | | ||
| | `gpu_machine_type` | Machine type for GPU nodes | `a3-megagpu-8g` | | ||
| | `gpu_node_count` | Number of GPU nodes | `1` | | ||
| | `gpu_reservation_project` | Project containing GPU reservation | `nv-dgxcloudprodgsc-20240206` | | ||
| | `gpu_reservation_name` | Name of GPU reservation | `gsc-a3-megagpu-8g-shared-res-2` | | ||
| | `gpu_driver_version` | GPU driver installation mode | `INSTALLATION_DISABLED` | | ||
| | `resource_labels` | Labels to apply to resources | `{}` | | ||
|
|
||
|
|
||
| ## Outputs | ||
|
|
||
| - `cluster_name`: Name of the created cluster | ||
| - `cluster_location`: Zone where cluster is deployed | ||
| - `cluster_endpoint`: API endpoint (sensitive) | ||
| - `cluster_ca_certificate`: CA certificate (sensitive) | ||
| - `gpu_node_pool_name`: Name of GPU node pool | ||
| - `kubeconfig_command`: Command to configure kubectl |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.