-
Notifications
You must be signed in to change notification settings - Fork 99
tests: fixes, improved cleanup & stability, better debuggability #637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
| kubectl wait --for=condition=READY pods -A -l nvidia-dra-driver-gpu-component=kubelet-plugin --timeout=10s | ||
| kubectl wait --for=condition=READY pods -A -l nvidia-dra-driver-gpu-component=controller --timeout=10s | ||
| # maybe: check version on labels (to confirm that we set labels correctly) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to helpers.sh
9436a56 to
cd3b93f
Compare
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
cd3b93f to
7436bb9
Compare
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
7436bb9 to
b57a854
Compare
| # does this show the output of setup? Then we could do this. | ||
| kubectl get resourceclaims || true | ||
| kubectl get computedomain || true | ||
| kubectl get pods -o wide || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sure we'll find a better way -- the main question here is if a bats setup primitive can be useful. I will explore more deeply another time.
|
With the changes on this branch, I just got ten subsequent suite completions. Before that, I got an error around the upgrade/downgrade tests almost every single time (not sure what changed, they were more stable -- but those race conditions addressed in this PR have always been present, we were just lucky). As of experience: there will be way more instabilities to be fixed (as we will also keep introducing new ones). Keeping an eye on stability is critical of course as we're heading towards CI integration. Report from last execution, for reference: [...]
mkdir -p tests-out && \
export _RUNDIR=$(mktemp -p tests-out -d -t bats-tests-$(date +%s)-XXXXX) && \
docker run \
[...]
--env TEST_CHART_REPO="deployments/helm/nvidia-dra-driver-gpu" \
--env TEST_CHART_VERSION=25.8.0-dev \
--env TEST_CHART_LASTSTABLE_REPO="oci://ghcr.io/nvidia/k8s-dra-driver-gpu" \
--env TEST_CHART_LASTSTABLE_VERSION="25.3.2-2c250af3-chart" \
--env TEST_CRD_UPGRADE_TARGET_GIT_REF="main" \
--env TEST_NVIDIA_DRIVER_ROOT="/run/nvidia/driver" \
--env TEST_EXPECTED_IMAGE_SPEC_SUBSTRING=v25.8.0-dev \
[...]
+ set +x
+ TMPDIR=/cwd/tests-out/bats-tests-1759413258-bwl9W
+ bats --print-output-on-failure --no-tempdir-cleanup --timing tests/bats/tests.bats
tests.bats
✓ test VERSION_W_COMMIT, VERSION_GHCR_CHART, VERSION [62]
✓ confirm no kubelet plugin pods running [69]
✓ helm-install deployments/helm/nvidia-dra-driver-gpu/25.8.0-dev [5204]
✓ helm list: validate output [165]
✓ get crd computedomains.resource.nvidia.com [66]
✓ wait for kubelet plugin pods READY [276]
✓ wait for controller pod READY [164]
✓ validate CD controller container image spec [66]
✓ IMEX channel injection (single) [6871]
✓ IMEX channel injection (all) [7334]
✓ NodePrepareResources: catch unknown field in opaque cfg in ResourceClaim [3044]
✓ nickelpie (NCCL send/recv/broadcast, 2 pods, 2 nodes, small payload) [11322]
✓ nvbandwidth (2 nodes, 2 GPUs each) [18924]
✓ downgrade: current-dev -> last-stable [25512]
✓ upgrade: wipe-state, install-last-stable, upgrade-to-current-dev [31474]
15 tests, 0 failures in 112 seconds |
|
Going to land this. I still appreciate critical eyes on this, as always. |
Initiative: #613.
This branch is a collection of test suite-related patches that I've accumulated over the past days in various branches.
For example:
time, dynamically use-it, add--rm(towards CI in GHA)For details, see individual commits.