Refactor e2e test infrastructure with unified helpers leveraging operator clientsets #1788

karthikvetrivel · 2025-10-14T17:32:55Z

This PR is the second step for migrating GPU Operator e2e tests from bash to Go/Ginkgo.

Overview

This PR introduces a comprehensive set of helper utilities that leverage the GPU operator's generated Go clientsets (api/versioned).

Note: This PR contains only the helper function infrastructure. Actual test implementations using these helpers will be added in a follow-up PR.

Changes

1. Helper Infrastructure (`tests/e2e/helpers/`)

Created comprehensive helper utilities:

operator.go - Helm client for deploying GPU Operator
clusterpolicy.go - ClusterPolicy CRUD + component toggles (DCGM, GFD, etc.)
nvidiadriver.go - NVIDIADriver CR operations (cluster-scoped)
daemonset.go - DaemonSet queries and readiness checks
node.go - Node labeling operations
workload.go - GPU workload deployment and verification
pod.go - Pod operations and namespace management

Structure:

Flattened directory: moved operator/helm.go → operator.go, kubernetes/pod.go → pod.go
Updated imports in gpu_operator_test.go

2. Constants (`tests/e2e/constants.go`)

Added shared constants:

DefaultPollingInterval - Standard 5s interval for wait operations
UpgradeDoneState - Driver upgrade completion state

3. Updated Existing Tests

Modified gpu_operator_test.go to use new helper package structure:

operator.Client → helpers.OperatorClient
k8stest.Client → helpers.PodClient
Improved error handling for pod log retrieval

Next Steps

Step 3: Migrate bash test scenarios to Go using these helpers
Step 4: Update CI/CD pipelines to run Go-based e2e tests

tests/e2e/helpers/clusterpolicy.go

tests/e2e/constants.go

rajathagasthya

LGTM. Thanks for addressing the comments!

tests/e2e/gpu_operator_test.go

tests/e2e/helpers/daemonset.go

karthikvetrivel · 2025-10-17T21:09:02Z

Moving this to draft until I add tests to keep the github.com/NVIDIA/gpu-operator dep (by referencing it in the e2e/tests code).

rajathagasthya · 2025-10-23T16:43:03Z

@karthikvetrivel Could you update helm.sh/helm/v3 dependency to address Dependabot alerts in https://github.com/NVIDIA/gpu-operator/security/dependabot? go get helm.sh/helm/v3@latest should work.

karthikvetrivel · 2025-10-23T20:18:34Z

@rajathagasthya This would require updating the main module's Kubernetes dependencies from v0.33.2 to v0.34.0. Should we still do it?

rajathagasthya · 2025-10-23T20:21:21Z

@tariq1890 updated it recently in #1805, so we should be able to contain these changes to just the e2e module.

karthikvetrivel · 2025-10-23T20:27:26Z

@rajathagasthya Fixed and amended. Thanks!

coderabbitai · 2025-10-23T20:27:35Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tests/e2e/suites/clusterpolicy_test.go

ArangoGutierrez

Thanks @rajathagasthya PR LGTM, I'll leave final approval to @tariq1890 and @cdesiniotis

…operator clientsets Signed-off-by: Karthik Vetrivel <[email protected]>

karthikvetrivel · 2025-11-17T19:00:15Z

@cdesiniotis @tariq1890 Bumping this PR again for final review when you guys have the chance.

cdesiniotis

Thanks @karthikvetrivel! I left a few comments, most are not blockers.

cdesiniotis · 2025-11-19T00:09:31Z

tests/e2e/helpers/daemonset.go

+		if daemonSet.Status.NumberReady == daemonSet.Status.DesiredNumberScheduled &&
+			daemonSet.Status.NumberReady > 0 {


Just an observation -- we check for daemonset readiness in several places. Most notably:

gpu-operator/controllers/object_controls.go

Line 3763 in 1c39fd9

func isDaemonSetReady(name string, n ClusterPolicyController) gpuv1.State {

gpu-operator/internal/state/state_skel.go

Line 416 in 1c39fd9

func (s *stateSkel) isDaemonSetReady(uds *unstructured.Unstructured, reqLogger logr.Logger) (bool, error) {

It may be in our best interest to align all these implementations at some point and reuse the same helper (if it makes sense). This is obviously out of scope for this PR.

cdesiniotis · 2025-11-19T00:30:58Z

tests/e2e/helpers/nvidiadriver.go

+		if nvidiaDriver.Status.State == upgradeDoneState {
+			return true, nil
+		}


I am not sure I am following this. The NVIDIADriver CR status will never enter this state.

By default, the upgrade of a GPU driver daemonset is facilitated by our driver upgrade controller. This is documented here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html. The controller currently uses the nvidia.com/gpu-driver-upgrade-state node label to track the state of the upgrade. The label value will be upgrade-done when the driver upgrade has completed successfully on a particular node.

I am assuming this helper was inspired by

gpu-operator/tests/scripts/checks.sh

Lines 203 to 236 in 1c39fd9

wait_for_driver_upgrade_done() {

gpu_node_count=$(kubectl get node -l nvidia.com/gpu.present --no-headers | wc -l)

local current_time=0

echo "waiting for the gpu driver upgrade to complete"

while :; do

local upgraded_count=0

for node in $(kubectl get nodes -o NAME); do

upgrade_state=$(kubectl get $node -ojsonpath='{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}')

if [ "${upgrade_state}" = "upgrade-done" ]; then

upgraded_count=$((${upgraded_count} + 1))

fi

done

if [[ $upgraded_count -eq $gpu_node_count ]]; then

echo "gpu driver upgrade completed successfully"

break;

else

echo "gpu driver still in progress. $upgraded_count/$gpu_node_count node(s) upgraded"

fi

if [[ "${current_time}" -gt $((60 * 45)) ]]; then

echo "timeout reached"

exit 1;

fi

echo "current state of driver upgrade"

kubectl get node -l nvidia.com/gpu.present \

-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}{"\n"}{end}'

echo "Sleeping 5 seconds"

current_time=$((${current_time} + 5))

sleep 5

done

}

?

cdesiniotis · 2025-11-19T00:41:32Z

tests/e2e/helpers/workload.go

+	}
+}
+
+func (h *WorkloadClient) DeployGPUPod(ctx context.Context, namespace string, podSpec *corev1.Pod) (*corev1.Pod, error) {


nit: Technically the pod passed in does not need to be a GPU workload. Would renaming this function to DeployPod be a better fit?

cdesiniotis · 2025-11-19T00:45:54Z

tests/e2e/helpers/workload.go

+	if !strings.Contains(logs, "NVIDIA") && !strings.Contains(logs, "GPU") {
+		return fmt.Errorf("pod logs do not contain evidence of GPU access")


I am okay with this for now, but we probably want to improve upon this in future iterations. E.g. this method could exec into the main container and invoke nvidia-smi itself.

cdesiniotis · 2025-11-19T00:46:16Z

tests/e2e/helpers/workload.go

+			Containers: []corev1.Container{
+				{
+					Name:  "gpu-test",
+					Image: "nvidia/cuda:12.0.0-base-ubuntu22.04",


nit: let's pull this image from nvcr.io and use a more recent version.

karthikvetrivel requested review from ArangoGutierrez, cdesiniotis, elezar, shivamerla and tariq1890 as code owners October 14, 2025 17:32

elezar removed their request for review October 15, 2025 13:46

rajathagasthya requested changes Oct 15, 2025

View reviewed changes

tests/e2e/helpers/clusterpolicy.go Outdated Show resolved Hide resolved

tests/e2e/helpers/clusterpolicy.go Outdated Show resolved Hide resolved

tests/e2e/constants.go Outdated Show resolved Hide resolved

karthikvetrivel force-pushed the refactor/e2e-migration branch 2 times, most recently from 0313d88 to 1b4d24e Compare October 15, 2025 18:31

rajathagasthya approved these changes Oct 15, 2025

View reviewed changes

tariq1890 reviewed Oct 16, 2025

View reviewed changes

tests/e2e/gpu_operator_test.go Show resolved Hide resolved

tests/e2e/helpers/daemonset.go Outdated Show resolved Hide resolved

karthikvetrivel force-pushed the refactor/e2e-migration branch from 1b4d24e to 31a8207 Compare October 16, 2025 14:51

karthikvetrivel mentioned this pull request Oct 17, 2025

include the go submodules in the validate-submodules make target #1797

Merged

karthikvetrivel marked this pull request as draft October 17, 2025 21:09

karthikvetrivel force-pushed the refactor/e2e-migration branch 4 times, most recently from 99feca6 to 10ad3e0 Compare October 20, 2025 04:16

karthikvetrivel marked this pull request as ready for review October 20, 2025 12:50

karthikvetrivel force-pushed the refactor/e2e-migration branch from 10ad3e0 to 70b967b Compare October 23, 2025 20:27

ArangoGutierrez requested changes Nov 10, 2025

View reviewed changes

karthikvetrivel force-pushed the refactor/e2e-migration branch from 70b967b to 4c82446 Compare November 13, 2025 15:17

ArangoGutierrez self-requested a review November 13, 2025 15:37

ArangoGutierrez reviewed Nov 13, 2025

View reviewed changes

Refactor e2e test infrastructure with unified helpers leveraging GPU …

bb2d254

…operator clientsets Signed-off-by: Karthik Vetrivel <[email protected]>

karthikvetrivel force-pushed the refactor/e2e-migration branch from 4c82446 to bb2d254 Compare November 17, 2025 18:59

cdesiniotis reviewed Nov 19, 2025

View reviewed changes

		if daemonSet.Status.NumberReady == daemonSet.Status.DesiredNumberScheduled &&
		daemonSet.Status.NumberReady > 0 {

	wait_for_driver_upgrade_done() {
	gpu_node_count=$(kubectl get node -l nvidia.com/gpu.present --no-headers \| wc -l)
	local current_time=0
	echo "waiting for the gpu driver upgrade to complete"
	while :; do
	local upgraded_count=0
	for node in $(kubectl get nodes -o NAME); do
	upgrade_state=$(kubectl get $node -ojsonpath='{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}')
	if [ "${upgrade_state}" = "upgrade-done" ]; then
	upgraded_count=$((${upgraded_count} + 1))
	fi
	done
	if [[ $upgraded_count -eq $gpu_node_count ]]; then
	echo "gpu driver upgrade completed successfully"
	break;
	else
	echo "gpu driver still in progress. $upgraded_count/$gpu_node_count node(s) upgraded"
	fi

	if [[ "${current_time}" -gt $((60 * 45)) ]]; then
	echo "timeout reached"
	exit 1;
	fi

	echo "current state of driver upgrade"
	kubectl get node -l nvidia.com/gpu.present \
	-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}{"\n"}{end}'


	echo "Sleeping 5 seconds"
	current_time=$((${current_time} + 5))
	sleep 5
	done
	}

		if !strings.Contains(logs, "NVIDIA") && !strings.Contains(logs, "GPU") {
		return fmt.Errorf("pod logs do not contain evidence of GPU access")

Refactor e2e test infrastructure with unified helpers leveraging operator clientsets #1788

Are you sure you want to change the base?

Refactor e2e test infrastructure with unified helpers leveraging operator clientsets #1788

Conversation

karthikvetrivel commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes

1. Helper Infrastructure (tests/e2e/helpers/)

2. Constants (tests/e2e/constants.go)

3. Updated Existing Tests

Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rajathagasthya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

karthikvetrivel commented Oct 17, 2025

Uh oh!

rajathagasthya commented Oct 23, 2025

Uh oh!

karthikvetrivel commented Oct 23, 2025

Uh oh!

rajathagasthya commented Oct 23, 2025

Uh oh!

karthikvetrivel commented Oct 23, 2025

Uh oh!

coderabbitai bot commented Oct 23, 2025

Review skipped

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

karthikvetrivel commented Nov 17, 2025

Uh oh!

cdesiniotis left a comment

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

karthikvetrivel commented Oct 14, 2025 •

edited

Loading

1. Helper Infrastructure (`tests/e2e/helpers/`)

2. Constants (`tests/e2e/constants.go`)