-
Notifications
You must be signed in to change notification settings - Fork 412
Refactor e2e test infrastructure with unified helpers leveraging operator clientsets #1788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Refactor e2e test infrastructure with unified helpers leveraging operator clientsets #1788
Conversation
0313d88 to
1b4d24e
Compare
rajathagasthya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for addressing the comments!
1b4d24e to
31a8207
Compare
|
Moving this to draft until I add tests to keep the github.com/NVIDIA/gpu-operator dep (by referencing it in the e2e/tests code). |
99feca6 to
10ad3e0
Compare
|
@karthikvetrivel Could you update |
|
@rajathagasthya This would require updating the main module's Kubernetes dependencies from v0.33.2 to v0.34.0. Should we still do it? |
|
@tariq1890 updated it recently in #1805, so we should be able to contain these changes to just the e2e module. |
10ad3e0 to
70b967b
Compare
|
@rajathagasthya Fixed and amended. Thanks! |
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
70b967b to
4c82446
Compare
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rajathagasthya PR LGTM, I'll leave final approval to @tariq1890 and @cdesiniotis
…operator clientsets Signed-off-by: Karthik Vetrivel <[email protected]>
4c82446 to
bb2d254
Compare
|
@cdesiniotis @tariq1890 Bumping this PR again for final review when you guys have the chance. |
cdesiniotis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @karthikvetrivel! I left a few comments, most are not blockers.
| if daemonSet.Status.NumberReady == daemonSet.Status.DesiredNumberScheduled && | ||
| daemonSet.Status.NumberReady > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an observation -- we check for daemonset readiness in several places. Most notably:
gpu-operator/controllers/object_controls.go
Line 3763 in 1c39fd9
| func isDaemonSetReady(name string, n ClusterPolicyController) gpuv1.State { |
gpu-operator/internal/state/state_skel.go
Line 416 in 1c39fd9
| func (s *stateSkel) isDaemonSetReady(uds *unstructured.Unstructured, reqLogger logr.Logger) (bool, error) { |
It may be in our best interest to align all these implementations at some point and reuse the same helper (if it makes sense). This is obviously out of scope for this PR.
| if nvidiaDriver.Status.State == upgradeDoneState { | ||
| return true, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I am following this. The NVIDIADriver CR status will never enter this state.
By default, the upgrade of a GPU driver daemonset is facilitated by our driver upgrade controller. This is documented here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html. The controller currently uses the nvidia.com/gpu-driver-upgrade-state node label to track the state of the upgrade. The label value will be upgrade-done when the driver upgrade has completed successfully on a particular node.
I am assuming this helper was inspired by
gpu-operator/tests/scripts/checks.sh
Lines 203 to 236 in 1c39fd9
| wait_for_driver_upgrade_done() { | |
| gpu_node_count=$(kubectl get node -l nvidia.com/gpu.present --no-headers | wc -l) | |
| local current_time=0 | |
| echo "waiting for the gpu driver upgrade to complete" | |
| while :; do | |
| local upgraded_count=0 | |
| for node in $(kubectl get nodes -o NAME); do | |
| upgrade_state=$(kubectl get $node -ojsonpath='{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}') | |
| if [ "${upgrade_state}" = "upgrade-done" ]; then | |
| upgraded_count=$((${upgraded_count} + 1)) | |
| fi | |
| done | |
| if [[ $upgraded_count -eq $gpu_node_count ]]; then | |
| echo "gpu driver upgrade completed successfully" | |
| break; | |
| else | |
| echo "gpu driver still in progress. $upgraded_count/$gpu_node_count node(s) upgraded" | |
| fi | |
| if [[ "${current_time}" -gt $((60 * 45)) ]]; then | |
| echo "timeout reached" | |
| exit 1; | |
| fi | |
| echo "current state of driver upgrade" | |
| kubectl get node -l nvidia.com/gpu.present \ | |
| -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}{"\n"}{end}' | |
| echo "Sleeping 5 seconds" | |
| current_time=$((${current_time} + 5)) | |
| sleep 5 | |
| done | |
| } |
| } | ||
| } | ||
|
|
||
| func (h *WorkloadClient) DeployGPUPod(ctx context.Context, namespace string, podSpec *corev1.Pod) (*corev1.Pod, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Technically the pod passed in does not need to be a GPU workload. Would renaming this function to DeployPod be a better fit?
| if !strings.Contains(logs, "NVIDIA") && !strings.Contains(logs, "GPU") { | ||
| return fmt.Errorf("pod logs do not contain evidence of GPU access") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with this for now, but we probably want to improve upon this in future iterations. E.g. this method could exec into the main container and invoke nvidia-smi itself.
| Containers: []corev1.Container{ | ||
| { | ||
| Name: "gpu-test", | ||
| Image: "nvidia/cuda:12.0.0-base-ubuntu22.04", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's pull this image from nvcr.io and use a more recent version.
This PR is the second step for migrating GPU Operator e2e tests from bash to Go/Ginkgo.
Overview
This PR introduces a comprehensive set of helper utilities that leverage the GPU operator's generated Go clientsets (
api/versioned).Note: This PR contains only the helper function infrastructure. Actual test implementations using these helpers will be added in a follow-up PR.
Changes
1. Helper Infrastructure (
tests/e2e/helpers/)Created comprehensive helper utilities:
operator.go- Helm client for deploying GPU Operatorclusterpolicy.go- ClusterPolicy CRUD + component toggles (DCGM, GFD, etc.)nvidiadriver.go- NVIDIADriver CR operations (cluster-scoped)daemonset.go- DaemonSet queries and readiness checksnode.go- Node labeling operationsworkload.go- GPU workload deployment and verificationpod.go- Pod operations and namespace managementStructure:
operator/helm.go→operator.go,kubernetes/pod.go→pod.gogpu_operator_test.go2. Constants (
tests/e2e/constants.go)Added shared constants:
DefaultPollingInterval- Standard 5s interval for wait operationsUpgradeDoneState- Driver upgrade completion state3. Updated Existing Tests
Modified
gpu_operator_test.goto use new helper package structure:operator.Client→helpers.OperatorClientk8stest.Client→helpers.PodClientNext Steps