-
Notifications
You must be signed in to change notification settings - Fork 99
Add separate targets for GPU plugin tests + add stress tests #711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,7 +34,7 @@ set -x | |
|
|
||
| # If a previous run leaves e.g. the controller behind in CrashLoopBackOff then | ||
| # the next installation with --wait won't succeed. | ||
| timeout -v 5 helm uninstall nvidia-dra-driver-gpu-batssuite -n nvidia-dra-driver-gpu | ||
| timeout -v 15 helm uninstall nvidia-dra-driver-gpu-batssuite -n nvidia-dra-driver-gpu | ||
shivamerla marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # When the CRD has been left behind deleted by a partially performed | ||
| # test then the deletions below cannot succeed. Apply a CRD version that | ||
|
|
@@ -62,15 +62,24 @@ timeout -v 5 kubectl delete pods -l env=batssuite 2> /dev/null | |
| timeout -v 2 kubectl delete resourceclaim batssuite-rc-bad-opaque-config --force 2> /dev/null | ||
| timeout -v 2 kubectl delete -f demo/specs/imex/simple-mig-test 2> /dev/null | ||
|
|
||
| # Cleanup any GPU stress test pods left behind | ||
| timeout -v 30 kubectl delete pods -l 'env=batssuite,test=stress-shared' 2> /dev/null | ||
| timeout -v 5 kubectl delete -f tests/bats/specs/rc-shared-gpu.yaml 2> /dev/null | ||
| kubectl wait --for=delete pods -l 'env=batssuite,test=stress-shared' \ | ||
| --timeout=60s \ | ||
| || echo "wait-for-delete failed" | ||
|
|
||
| # TODO: maybe more brute-forcing/best-effort: it might make sense to submit all | ||
| # workload in this test suite into a special namespace (not `default`), and to | ||
| # then use `kubectl delete pods -n <testnamespace]> --all`. | ||
|
|
||
| # Delete any previous remainder of `clean-state-dirs-all-nodes.sh` invocation. | ||
| kubectl delete pods privpod-rm-plugindirs 2> /dev/null | ||
|
|
||
| timeout -v 5 helm uninstall nvidia-dra-driver-gpu-batssuite -n nvidia-dra-driver-gpu | ||
| # Make sure to wait till the chart is completely removed | ||
| helm uninstall nvidia-dra-driver-gpu-batssuite --wait -n nvidia-dra-driver-gpu | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 That one is interesting. I'll see how it does in practice for me! I've found Helm's |
||
|
|
||
| # Double check that the pods are deleted | ||
| kubectl wait \ | ||
| --for=delete pods -A \ | ||
| -l app.kubernetes.io/name=nvidia-dra-driver-gpu \ | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # Pod referencing the shared resource claim from rc-shared-gpu.yaml | ||
| # Test will create multiple pods using the spec below and updated INDEX. | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: stress-pod-__INDEX__ | ||
shivamerla marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| labels: | ||
| env: batssuite | ||
| test: stress-shared | ||
| spec: | ||
| restartPolicy: Never | ||
| containers: | ||
| - name: ctr | ||
| image: ubuntu:24.04 | ||
| command: ["bash","-lc"] | ||
| args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"] | ||
| resources: | ||
| claims: | ||
| - name: gpu | ||
| resourceClaims: | ||
| - name: gpu | ||
| resourceClaimName: rc-shared-gpu | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| # Shared GPU resource claim | ||
| apiVersion: resource.k8s.io/v1 | ||
| kind: ResourceClaim | ||
| metadata: | ||
| name: rc-shared-gpu | ||
| labels: | ||
| env: batssuite | ||
| test: stress-shared | ||
| spec: | ||
| devices: | ||
| requests: | ||
| - name: gpu | ||
| exactly: | ||
| deviceClassName: gpu.nvidia.com |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,76 @@ | ||
| # shellcheck disable=SC2148 | ||
| # shellcheck disable=SC2329 | ||
|
|
||
| : "${STRESS_PODS_N:=15}" | ||
| : "${STRESS_LOOPS:=5}" | ||
| : "${STRESS_DELAY:=30}" | ||
shivamerla marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| setup_file () { | ||
| load 'helpers.sh' | ||
| _common_setup | ||
| local _iargs=("--set" "logVerbosity=6") | ||
| iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" _iargs | ||
| } | ||
|
|
||
| setup() { | ||
| load 'helpers.sh' | ||
| _common_setup | ||
| log_objects | ||
| } | ||
|
|
||
| bats::on_failure() { | ||
| echo -e "\n\nFAILURE HOOK START" | ||
| log_objects | ||
| show_kubelet_plugin_error_logs | ||
| echo -e "FAILURE HOOK END\n\n" | ||
| } | ||
|
|
||
| # Expand pod YAML with indexes | ||
| _generate_pods_manifest() { | ||
| local out="$1" | ||
| local template="tests/bats/specs/pods-shared-gpu.yaml" | ||
| : > "$out" | ||
| for i in $(seq 1 "${STRESS_PODS_N}"); do | ||
| sed "s/__INDEX__/${i}/g" "${template}" >> "$out" | ||
| echo "---" >> "$out" | ||
| done | ||
| } | ||
|
|
||
| @test "Stress: shared ResourceClaim across ${STRESS_PODS_N} pods x ${STRESS_LOOPS} loops" { | ||
| for loop in $(seq 1 "${STRESS_LOOPS}"); do | ||
| echo "=== Loop $loop/${STRESS_LOOPS} ===" | ||
|
|
||
| # Apply ResourceClaim | ||
| kubectl apply -f tests/bats/specs/rc-shared-gpu.yaml | ||
|
|
||
| # Generate and apply pods spec | ||
| manifest="${BATS_TEST_TMPDIR:-/tmp}/pods-shared-${loop}.yaml" | ||
| _generate_pods_manifest "$manifest" | ||
| kubectl apply -f "$manifest" | ||
|
|
||
| # Wait for ResourceClaim allocation | ||
| kubectl wait --for=jsonpath='{.status.allocation}' resourceclaim rc-shared-gpu --timeout=120s | ||
shivamerla marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # Wait for all pods to be Ready | ||
| kubectl wait --for=condition=Ready pods -l 'env=batssuite,test=stress-shared' --timeout=180s | ||
|
|
||
| # Verify pod phases | ||
| phases=$(kubectl get pods -l 'env=batssuite,test=stress-shared' -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.phase}{"\n"}{end}') | ||
| echo "$phases" | ||
| echo "$phases" | awk '$2!="Running"{exit 1}' | ||
|
|
||
| # Spot-check GPU allocation logs | ||
| run kubectl logs stress-pod-1 | ||
| assert_output --partial "UUID: GPU-" | ||
|
|
||
| # Cleanup | ||
| kubectl delete pods -l 'env=batssuite,test=stress-shared' --timeout=90s | ||
| kubectl delete -f tests/bats/specs/rc-shared-gpu.yaml --timeout=90s | ||
| kubectl wait --for=delete pods -l 'env=batssuite,test=stress-shared' --timeout=60s | ||
|
|
||
| if [[ "$loop" -lt "$STRESS_LOOPS" ]]; then | ||
| echo "Sleeping ${STRESS_DELAY}s before next loop..." | ||
| sleep "${STRESS_DELAY}" | ||
| fi | ||
| done | ||
| } | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.