Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

@jgehrcke jgehrcke commented Oct 2, 2025

Initiative: #613.

This branch is a collection of test suite-related patches that I've accumulated over the past days in various branches.

For example:

  • Fixed an order-of-execution bug in the main makefile target (this time for real)
  • Various test stability fixes (around workload teardown, and Helm chart upgrade in particular)
  • Remove dependency on time, dynamically use -it, add --rm (towards CI in GHA)
  • More brute-force in cleanup
  • More fail-fast criteria
  • Timeout constant tweaks based on observation, to support tail end scenarios (stability improvements)
  • Better debuggability (more log output emitted in case things go wrong)

For details, see individual commits.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kubectl wait --for=condition=READY pods -A -l nvidia-dra-driver-gpu-component=kubelet-plugin --timeout=10s
kubectl wait --for=condition=READY pods -A -l nvidia-dra-driver-gpu-component=controller --timeout=10s
# maybe: check version on labels (to confirm that we set labels correctly)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to helpers.sh

@jgehrcke jgehrcke self-assigned this Oct 2, 2025
@jgehrcke jgehrcke added the ci/testing issue/PR related to CI and/or testing label Oct 2, 2025
@jgehrcke jgehrcke moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Oct 2, 2025
@jgehrcke jgehrcke added this to the v25.8.0 milestone Oct 2, 2025
@jgehrcke jgehrcke force-pushed the jp/test-suite-patches branch from 9436a56 to cd3b93f Compare October 2, 2025 13:36
@jgehrcke jgehrcke changed the title tests: fixes, improve cleanup, output, time constants tests: fixes, improved cleanup & stability, better debuggability Oct 2, 2025
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
@jgehrcke jgehrcke force-pushed the jp/test-suite-patches branch from cd3b93f to 7436bb9 Compare October 2, 2025 13:40
@jgehrcke jgehrcke force-pushed the jp/test-suite-patches branch from 7436bb9 to b57a854 Compare October 2, 2025 13:44
# does this show the output of setup? Then we could do this.
kubectl get resourceclaims || true
kubectl get computedomain || true
kubectl get pods -o wide || true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sure we'll find a better way -- the main question here is if a bats setup primitive can be useful. I will explore more deeply another time.

@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Oct 2, 2025

With the changes on this branch, I just got ten subsequent suite completions. Before that, I got an error around the upgrade/downgrade tests almost every single time (not sure what changed, they were more stable -- but those race conditions addressed in this PR have always been present, we were just lucky).

As of experience: there will be way more instabilities to be fixed (as we will also keep introducing new ones). Keeping an eye on stability is critical of course as we're heading towards CI integration.

Report from last execution, for reference:

[...]
mkdir -p tests-out && \
export _RUNDIR=$(mktemp -p tests-out -d -t bats-tests-$(date +%s)-XXXXX) && \
docker run \
[...]
        --env TEST_CHART_REPO="deployments/helm/nvidia-dra-driver-gpu" \
        --env TEST_CHART_VERSION=25.8.0-dev \
        --env TEST_CHART_LASTSTABLE_REPO="oci://ghcr.io/nvidia/k8s-dra-driver-gpu" \
        --env TEST_CHART_LASTSTABLE_VERSION="25.3.2-2c250af3-chart" \
        --env TEST_CRD_UPGRADE_TARGET_GIT_REF="main" \
        --env TEST_NVIDIA_DRIVER_ROOT="/run/nvidia/driver" \
        --env TEST_EXPECTED_IMAGE_SPEC_SUBSTRING=v25.8.0-dev \
[...]
+ set +x
+ TMPDIR=/cwd/tests-out/bats-tests-1759413258-bwl9W
+ bats --print-output-on-failure --no-tempdir-cleanup --timing tests/bats/tests.bats
tests.bats
 ✓ test VERSION_W_COMMIT, VERSION_GHCR_CHART, VERSION [62]
 ✓ confirm no kubelet plugin pods running [69]
 ✓ helm-install deployments/helm/nvidia-dra-driver-gpu/25.8.0-dev [5204]
 ✓ helm list: validate output [165]
 ✓ get crd computedomains.resource.nvidia.com [66]
 ✓ wait for kubelet plugin pods READY [276]
 ✓ wait for controller pod READY [164]
 ✓ validate CD controller container image spec [66]
 ✓ IMEX channel injection (single) [6871]
 ✓ IMEX channel injection (all) [7334]
 ✓ NodePrepareResources: catch unknown field in opaque cfg in ResourceClaim [3044]
 ✓ nickelpie (NCCL send/recv/broadcast, 2 pods, 2 nodes, small payload) [11322]
 ✓ nvbandwidth (2 nodes, 2 GPUs each) [18924]
 ✓ downgrade: current-dev -> last-stable [25512]
 ✓ upgrade: wipe-state, install-last-stable, upgrade-to-current-dev [31474]

15 tests, 0 failures in 112 seconds

@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Oct 2, 2025

Going to land this. I still appreciate critical eyes on this, as always.

@jgehrcke jgehrcke merged commit afbb033 into NVIDIA:main Oct 2, 2025
7 checks passed
@jgehrcke jgehrcke moved this from In Progress to Closed in Planning Board: k8s-dra-driver-gpu Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/testing issue/PR related to CI and/or testing

Projects

Development

Successfully merging this pull request may close these issues.

1 participant