-
Notifications
You must be signed in to change notification settings - Fork 98
tests: fixes, improved cleanup & stability, better debuggability #637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
9377cdb
tests: rm `time`, dynamic -it, --rm, fix exec order, tweak output
jgehrcke eedd65a
tests: bump imex chan inject timeout (success after 70s+)
jgehrcke 6ed0984
tests: suite setup: fail if dra driver appears to be there
jgehrcke 61dc539
tests: temporarily install driver to facilitate cleanup
jgehrcke 56f11cc
Update gitignore for log files and tarballs
jgehrcke 4a625c9
tests: introduce common setup in helpers.sh
jgehrcke 6fce9c2
tests: fix instability waiting for new pods after chart upgrade
jgehrcke ecf27e0
tests: log workload state of the world upon entering test
jgehrcke f9d3322
tests: fix instability: wait for chan inject workload deletion
jgehrcke 403e9f6
tests: fix instability: --wait for helm uninstall
jgehrcke b57a854
tests: remove explicit IMEXDaemonsWithDNSNames=true
jgehrcke File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,3 +7,7 @@ | |
| [._]*.sw[a-p] | ||
| coverage.out | ||
| tests-out | ||
| *.log | ||
| *.tar | ||
| *.tgz | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,12 +1,12 @@ | ||
| # shellcheck disable=SC2148 | ||
| # shellcheck disable=SC2329 | ||
| setup() { | ||
| load '/bats-libraries/bats-support/load.bash' | ||
| load '/bats-libraries/bats-assert/load.bash' | ||
| load '/bats-libraries/bats-file/load.bash' | ||
| load 'helpers.sh' | ||
| # Executed before entering each test in this file. | ||
| load 'helpers.sh' | ||
| _common_setup | ||
| } | ||
|
|
||
|
|
||
| # Currently, the tests defined in this file deliberately depend on each other | ||
| # and are expected to execute in the order defined. In the future, we want to | ||
| # build test dependency injection (with fixtures), and work towards clean | ||
|
|
@@ -15,10 +15,6 @@ setup() { | |
| # happening. Tools like `etcdctl` will be helpful. | ||
|
|
||
|
|
||
| # Use a name that upon cluster inspection reveals that this | ||
| # Helm chart release was installed/managed by this test suite. | ||
| export TEST_HELM_RELEASE_NAME="nvidia-dra-driver-gpu-batssuite" | ||
|
|
||
| # Note(JP): bats swallows output of setup upon success (regardless of cmdline | ||
| # args such as `--show-output-of-passing-tests`). Ref: | ||
| # https://github.com/bats-core/bats-core/issues/540#issuecomment-1013521656 -- | ||
|
|
@@ -41,45 +37,28 @@ setup_file() { | |
| # Prepare for installing releases from NGC (that merely mutates local | ||
| # filesystem state). | ||
| helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update | ||
|
|
||
| # A helper arg for `iupgrade_wait` w/o additional install args. | ||
| export NOARGS=() | ||
| } | ||
|
|
||
| # Install or upgrade, and wait for pods to be READY. | ||
| # 1st arg: helm chart repo | ||
| # 2nd arg: helm chart version | ||
| # 3rd arg: array with additional args (provide `NOARGS` if none) | ||
| iupgrade_wait() { | ||
| # E.g. `nvidia/nvidia-dra-driver-gpu` or | ||
| # `oci://ghcr.io/nvidia/k8s-dra-driver-gpu` | ||
| local REPO="$1" | ||
|
|
||
| # E.g. `25.3.1` or `25.8.0-dev-f2eaddd6-chart` | ||
| local VERSION="$2" | ||
|
|
||
| # Expect array as third argument. | ||
| local -n ADDITIONAL_INSTALL_ARGS=$3 | ||
|
|
||
| timeout -v 10 helm upgrade --install "${TEST_HELM_RELEASE_NAME}" \ | ||
| "${REPO}" \ | ||
| --version="${VERSION}" \ | ||
| --create-namespace \ | ||
| --namespace nvidia-dra-driver-gpu \ | ||
| --set resources.gpus.enabled=false \ | ||
| --set nvidiaDriverRoot="${TEST_NVIDIA_DRIVER_ROOT}" "${ADDITIONAL_INSTALL_ARGS[@]}" | ||
|
|
||
| kubectl wait --for=condition=READY pods -A -l nvidia-dra-driver-gpu-component=kubelet-plugin --timeout=10s | ||
| kubectl wait --for=condition=READY pods -A -l nvidia-dra-driver-gpu-component=controller --timeout=10s | ||
| # maybe: check version on labels (to confirm that we set labels correctly) | ||
| } | ||
|
|
||
| apply_check_delete_workload_imex_chan_inject() { | ||
| kubectl apply -f demo/specs/imex/channel-injection.yaml | ||
| kubectl wait --for=condition=READY pods imex-channel-injection --timeout=70s | ||
| kubectl wait --for=condition=READY pods imex-channel-injection --timeout=100s | ||
| run kubectl logs imex-channel-injection | ||
| assert_output --partial "channel0" | ||
| kubectl delete -f demo/specs/imex/channel-injection.yaml | ||
| # Check output after attempted deletion. | ||
| assert_output --partial "channel0" | ||
|
|
||
| # Wait for deletion to complete; this is critical before moving on to the next | ||
| # test (as long as we don't wipe state entirely between tests). | ||
| kubectl wait --for=delete pods imex-channel-injection --timeout=10s | ||
| } | ||
|
|
||
| log_objects() { | ||
| # Never fail, but show output in case a test fails, to facilitate debugging. | ||
| # Could this be part of setup()? If setup succeeds and when a test fails: | ||
| # does this show the output of setup? Then we could do this. | ||
| kubectl get resourceclaims || true | ||
| kubectl get computedomain || true | ||
| kubectl get pods -o wide || true | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am sure we'll find a better way -- the main question here is if a bats setup primitive can be useful. I will explore more deeply another time. |
||
| } | ||
|
|
||
| # A test that covers local dev tooling, we don't want to | ||
|
|
@@ -100,8 +79,7 @@ apply_check_delete_workload_imex_chan_inject() { | |
| } | ||
|
|
||
| @test "helm-install ${TEST_CHART_REPO}/${TEST_CHART_VERSION}" { | ||
| local _iargs=("--set" "featureGates.IMEXDaemonsWithDNSNames=true") | ||
| iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" _iargs | ||
| iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" NOARGS | ||
| } | ||
|
|
||
| @test "helm list: validate output" { | ||
|
|
@@ -146,10 +124,12 @@ apply_check_delete_workload_imex_chan_inject() { | |
| } | ||
|
|
||
| @test "IMEX channel injection (single)" { | ||
| log_objects | ||
| apply_check_delete_workload_imex_chan_inject | ||
| } | ||
|
|
||
| @test "IMEX channel injection (all)" { | ||
| log_objects | ||
| # Example: with TEST_CHART_VERSION="v25.3.2-12390-chart" | ||
| # the command below returns 0 (true: the tested version is smaller) | ||
| if dpkg --compare-versions "${TEST_CHART_VERSION#v}" lt "25.8.0"; then | ||
|
|
@@ -164,6 +144,8 @@ apply_check_delete_workload_imex_chan_inject() { | |
| } | ||
|
|
||
| @test "NodePrepareResources: catch unknown field in opaque cfg in ResourceClaim" { | ||
| log_objects | ||
|
|
||
| envsubst < tests/bats/specs/rc-opaque-cfg-unknown-field.yaml.tmpl > \ | ||
| "${BATS_TEST_TMPDIR}"/rc-opaque-cfg-unknown-field.yaml | ||
| cd "${BATS_TEST_TMPDIR}" | ||
|
|
@@ -206,6 +188,8 @@ apply_check_delete_workload_imex_chan_inject() { | |
| } | ||
|
|
||
| @test "nickelpie (NCCL send/recv/broadcast, 2 pods, 2 nodes, small payload)" { | ||
| log_objects | ||
|
|
||
| # Do not run in checkout dir (to not pollute that). | ||
| cd "${BATS_TEST_TMPDIR}" | ||
| git clone https://github.com/jgehrcke/jpsnips-nv | ||
|
|
@@ -220,6 +204,8 @@ apply_check_delete_workload_imex_chan_inject() { | |
| } | ||
|
|
||
| @test "nvbandwidth (2 nodes, 2 GPUs each)" { | ||
| log_objects | ||
|
|
||
| kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.6.0/mpi-operator.yaml || echo "ignore" | ||
| kubectl apply -f demo/specs/imex/nvbandwidth-test-job-1.yaml | ||
| # The canonical k8s job interface works even for MPIJob (the MPIJob has an | ||
|
|
@@ -232,6 +218,8 @@ apply_check_delete_workload_imex_chan_inject() { | |
| } | ||
|
|
||
| @test "downgrade: current-dev -> last-stable" { | ||
| log_objects | ||
|
|
||
| # Stage 1: apply workload, but do not delete. | ||
| kubectl apply -f demo/specs/imex/channel-injection.yaml | ||
| kubectl wait --for=condition=READY pods imex-channel-injection --timeout=60s | ||
|
|
@@ -250,8 +238,10 @@ apply_check_delete_workload_imex_chan_inject() { | |
| } | ||
|
|
||
| @test "upgrade: wipe-state, install-last-stable, upgrade-to-current-dev" { | ||
| log_objects | ||
|
|
||
| # Stage 1: clean slate | ||
| helm uninstall "${TEST_HELM_RELEASE_NAME}" -n nvidia-dra-driver-gpu | ||
| helm uninstall "${TEST_HELM_RELEASE_NAME}" -n nvidia-dra-driver-gpu --wait --timeout=30s | ||
| kubectl wait --for=delete pods -A -l app.kubernetes.io/name=nvidia-dra-driver-gpu --timeout=10s | ||
| bash tests/bats/clean-state-dirs-all-nodes.sh | ||
| kubectl get crd computedomains.resource.nvidia.com | ||
|
|
@@ -271,8 +261,7 @@ apply_check_delete_workload_imex_chan_inject() { | |
| kubectl apply -f "${CRD_UPGRADE_URL}" | ||
|
|
||
| # Stage 5: install target version (as users would do). | ||
| local _iargs=("--set" "featureGates.IMEXDaemonsWithDNSNames=true") | ||
| iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" _iargs | ||
| iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" NOARGS | ||
|
|
||
| # Stage 6: confirm deleting old workload works (critical, see above). | ||
| timeout -v 60 kubectl delete -f demo/specs/imex/channel-injection.yaml | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to helpers.sh