Skip to content

Conversation

@karthikvetrivel
Copy link
Member

@karthikvetrivel karthikvetrivel commented Oct 14, 2025

This PR is the second step for migrating GPU Operator e2e tests from bash to Go/Ginkgo.

Overview

This PR introduces a comprehensive set of helper utilities that leverage the GPU operator's generated Go clientsets (api/versioned).

Note: This PR contains only the helper function infrastructure. Actual test implementations using these helpers will be added in a follow-up PR.

Changes

1. Helper Infrastructure (tests/e2e/helpers/)

Created comprehensive helper utilities:

  • operator.go - Helm client for deploying GPU Operator
  • clusterpolicy.go - ClusterPolicy CRUD + component toggles (DCGM, GFD, etc.)
  • nvidiadriver.go - NVIDIADriver CR operations (cluster-scoped)
  • daemonset.go - DaemonSet queries and readiness checks
  • node.go - Node labeling operations
  • workload.go - GPU workload deployment and verification
  • pod.go - Pod operations and namespace management

Structure:

  • Flattened directory: moved operator/helm.gooperator.go, kubernetes/pod.gopod.go
  • Updated imports in gpu_operator_test.go

2. Constants (tests/e2e/constants.go)

Added shared constants:

  • DefaultPollingInterval - Standard 5s interval for wait operations
  • UpgradeDoneState - Driver upgrade completion state

3. Updated Existing Tests

Modified gpu_operator_test.go to use new helper package structure:

  • operator.Clienthelpers.OperatorClient
  • k8stest.Clienthelpers.PodClient
  • Improved error handling for pod log retrieval

Next Steps

  • Step 3: Migrate bash test scenarios to Go using these helpers
  • Step 4: Update CI/CD pipelines to run Go-based e2e tests

@karthikvetrivel karthikvetrivel force-pushed the refactor/e2e-migration branch 2 times, most recently from 0313d88 to 1b4d24e Compare October 15, 2025 18:31
Copy link
Contributor

@rajathagasthya rajathagasthya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for addressing the comments!

@karthikvetrivel
Copy link
Member Author

Moving this to draft until I add tests to keep the github.com/NVIDIA/gpu-operator dep (by referencing it in the e2e/tests code).

@karthikvetrivel karthikvetrivel marked this pull request as draft October 17, 2025 21:09
@karthikvetrivel karthikvetrivel force-pushed the refactor/e2e-migration branch 4 times, most recently from 99feca6 to 10ad3e0 Compare October 20, 2025 04:16
@karthikvetrivel karthikvetrivel marked this pull request as ready for review October 20, 2025 12:50
@rajathagasthya
Copy link
Contributor

@karthikvetrivel Could you update helm.sh/helm/v3 dependency to address Dependabot alerts in https://github.com/NVIDIA/gpu-operator/security/dependabot? go get helm.sh/helm/v3@latest should work.

@karthikvetrivel
Copy link
Member Author

@rajathagasthya This would require updating the main module's Kubernetes dependencies from v0.33.2 to v0.34.0. Should we still do it?

@rajathagasthya
Copy link
Contributor

@tariq1890 updated it recently in #1805, so we should be able to contain these changes to just the e2e module.

@karthikvetrivel
Copy link
Member Author

@rajathagasthya Fixed and amended. Thanks!

@coderabbitai
Copy link

coderabbitai bot commented Oct 23, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rajathagasthya PR LGTM, I'll leave final approval to @tariq1890 and @cdesiniotis

@karthikvetrivel
Copy link
Member Author

@cdesiniotis @tariq1890 Bumping this PR again for final review when you guys have the chance.

Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @karthikvetrivel! I left a few comments, most are not blockers.

Comment on lines +75 to +76
if daemonSet.Status.NumberReady == daemonSet.Status.DesiredNumberScheduled &&
daemonSet.Status.NumberReady > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation -- we check for daemonset readiness in several places. Most notably:

func isDaemonSetReady(name string, n ClusterPolicyController) gpuv1.State {

func (s *stateSkel) isDaemonSetReady(uds *unstructured.Unstructured, reqLogger logr.Logger) (bool, error) {

It may be in our best interest to align all these implementations at some point and reuse the same helper (if it makes sense). This is obviously out of scope for this PR.

Comment on lines +98 to +100
if nvidiaDriver.Status.State == upgradeDoneState {
return true, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I am following this. The NVIDIADriver CR status will never enter this state.

By default, the upgrade of a GPU driver daemonset is facilitated by our driver upgrade controller. This is documented here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html. The controller currently uses the nvidia.com/gpu-driver-upgrade-state node label to track the state of the upgrade. The label value will be upgrade-done when the driver upgrade has completed successfully on a particular node.

I am assuming this helper was inspired by

wait_for_driver_upgrade_done() {
gpu_node_count=$(kubectl get node -l nvidia.com/gpu.present --no-headers | wc -l)
local current_time=0
echo "waiting for the gpu driver upgrade to complete"
while :; do
local upgraded_count=0
for node in $(kubectl get nodes -o NAME); do
upgrade_state=$(kubectl get $node -ojsonpath='{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}')
if [ "${upgrade_state}" = "upgrade-done" ]; then
upgraded_count=$((${upgraded_count} + 1))
fi
done
if [[ $upgraded_count -eq $gpu_node_count ]]; then
echo "gpu driver upgrade completed successfully"
break;
else
echo "gpu driver still in progress. $upgraded_count/$gpu_node_count node(s) upgraded"
fi
if [[ "${current_time}" -gt $((60 * 45)) ]]; then
echo "timeout reached"
exit 1;
fi
echo "current state of driver upgrade"
kubectl get node -l nvidia.com/gpu.present \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}{"\n"}{end}'
echo "Sleeping 5 seconds"
current_time=$((${current_time} + 5))
sleep 5
done
}
?

}
}

func (h *WorkloadClient) DeployGPUPod(ctx context.Context, namespace string, podSpec *corev1.Pod) (*corev1.Pod, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Technically the pod passed in does not need to be a GPU workload. Would renaming this function to DeployPod be a better fit?

Comment on lines +115 to +116
if !strings.Contains(logs, "NVIDIA") && !strings.Contains(logs, "GPU") {
return fmt.Errorf("pod logs do not contain evidence of GPU access")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with this for now, but we probably want to improve upon this in future iterations. E.g. this method could exec into the main container and invoke nvidia-smi itself.

Containers: []corev1.Container{
{
Name: "gpu-test",
Image: "nvidia/cuda:12.0.0-base-ubuntu22.04",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's pull this image from nvcr.io and use a more recent version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants