-
Notifications
You must be signed in to change notification settings - Fork 437
[no-relnote] Add E2E for libnvidia-container #1118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 16831774091Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
67cc2ec to
903737e
Compare
d0a338e to
8c42c14
Compare
|
Tests pass, PR ready for review @elezar |
8c42c14 to
d905a49
Compare
| docker run -d --name test-nvidia-container-cli \ | ||
| --privileged \ | ||
| --runtime=nvidia \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not scalable to have to MOUNT everything into this container. Note that when we still had some simple integration tests in the toolkit we used
| testing::docker::dind::setup() { |
Can we rather adapt this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will work on that. we need to make this test more robust and scalable
d905a49 to
9674787
Compare
9674787 to
1a72738
Compare
1a72738 to
f84c038
Compare
tests/e2e/installer.go
Outdated
| # Create a temporary directory | ||
| TEMP_DIR="/tmp/ctk_e2e.$(date +%s)_$RANDOM" | ||
| mkdir -p "$TEMP_DIR" | ||
| : ${IMAGE:={{.Image}}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Why did we swap the ordering of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general note to the scripts -- why are we using envvars at all and don't we just use {{.Image}} everywhere? Is there a case where IMAGE is already set to something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched back
| var ( | ||
| runner Runner | ||
| testScript = "/tmp/libnvidia-container-cli.sh" | ||
| dockerImage = "ghcr.io/nvidia/container-toolkit:5e8c1411-ubuntu20.04" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we hardcoding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does this break once we switch to distroless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded image was a mistake; it was an experimental feature introduced by me to iterate.
On distroless, it should work as we are setting the --entrypoint /libnvidia-container-cli.sh , so as long as the distroless supports /usr/bin/env bash scripts, this test should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now edited to adjust for the distroless image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On distroless, it should work as we are setting the --entrypoint /libnvidia-container-cli.sh , so as long as the distroless supports /usr/bin/env bash scripts, this test should work
Distroless does not support bash. It includes sh from busybox.
tests/e2e/e2e_test.go
Outdated
| imageName = getRequiredEnvvar[string]("E2E_IMAGE_NAME") | ||
| imageTag = getRequiredEnvvar[string]("E2E_IMAGE_TAG") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we removed the conditional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, regardless of if we want to install or not the toolkit on the host, I want to be able to get these 2 variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? What do we need the image for if we're not installing the toolkit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the test cases added in this PR, we need to pass the image into the runner container, so we can get the artifacts to install the toolkit and the nvidia-container-cli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle the "test the locally installed toolkit" case?
| // script are therefore a good indicator of whether the NVIDIA Container | ||
| // Toolkit is functioning correctly inside the container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not what we're testing. We're testing the nvidia-container-cli specifically.
| apt-get update -y && apt-get install -y curl gnupg2 | ||
| WORKDIR="$(mktemp -d)" | ||
| ROOTFS="${WORKDIR}/rootfs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need two directories? What about:
| ROOTFS="${WORKDIR}/rootfs" | |
| ROOTFS="$(mktemp -d)/rootfs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, added
| var _ = Describe("nvidia-container-cli", Ordered, ContinueOnFailure, func() { | ||
| var ( | ||
| runner Runner | ||
| testScript = "/tmp/libnvidia-container-cli.sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we guaranteed a single script across all tests?
| testScript = "/tmp/libnvidia-container-cli.sh" | ||
| dockerImage = "ghcr.io/nvidia/container-toolkit:5e8c1411-ubuntu20.04" | ||
| containerName = "nvidia-cli-e2e" | ||
| dockerRunCmd string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a variable at this scope required?
b4cd062 to
6ae776e
Compare
6ae776e to
cd56671
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds an end-to-end test for libnvidia-container's nvidia-container-cli tool to catch regressions. The test validates that GPUs detected inside a container match those available on the host system.
- Introduces a new E2E test that sets up a containerized environment with Docker and nvidia-container-cli
- Modifies the installer template to support reusable temporary directories across tests
- Updates environment variable handling to make image name/tag always available
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/e2e/nvidia-container-cli_test.go | New E2E test file implementing the nvidia-container-cli validation test |
| tests/e2e/installer.go | Modified installer template to support persistent temporary directories and fix image variable usage |
| tests/e2e/e2e_test.go | Updated environment variable logic to always require image name/tag regardless of CTK installation flag |
| tests/e2e/Makefile | Added GINKGO_FOCUS parameter support for selective test execution |
Comments suppressed due to low confidence (1)
tests/e2e/nvidia-container-cli_test.go:156
- The error from removing a potentially non-existent container is ignored. Consider checking if the container exists first or handling the expected error case when the container doesn't exist.
_, _, err = runner.Run(fmt.Sprintf("docker rm -f %s", containerName))
cd56671 to
55205e8
Compare
|
@elezar PTAL RFR |
| ) | ||
|
|
||
| const ( | ||
| installDockerTemplate = `docker exec -u root {{.ContainerName}} bash -c ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have the script be the "template" directly? i.e. not include the docker exec part? In most cases this would mean that the script is used as is and does not even require templating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that, but then the fmt.SprintF("docker exec %s", template.String()) would run into issues, failing, complaining that line 1 didn't have EOL.
SO I decided to have self-contained templates that we can simply execute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I got what I was doing wrong, now is not a template
| IN_NS | ||
| '` | ||
|
|
||
| dockerRunCmdTemplate = `docker run -d --name {{.ContainerName}} --privileged --runtime=nvidia \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't think this needs to be a template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
55205e8 to
1450620
Compare
| ) | ||
|
|
||
| const ( | ||
| installDockerTemplate = ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this is now a script and not a template.
|
|
||
| It("should report the same GPUs inside the container as on the host", func(ctx context.Context) { | ||
| // Launch the container in detached mode. | ||
| _, _, err := runner.Run(dockerRunCmdTemplate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really at the point where I want to block this, but what would be good would be if we could return a Runner when starting this container. We can then replace runner.Run("docker exec ... ") commands bellow with calls to this runner that JUST accept the script that we want to run. This could be done in a follow up though.
| err = tmpl.Execute(&toolkitInstall, struct { | ||
| ToolkitImage string | ||
| }{ | ||
| ToolkitImage: imageName + ":" + imageTag, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a note: This makes LOCAL tests difficult if one doesn't have the image built. Should we still make imageName and imageTag optional and skip this test if these are not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this PR with basic functionality to allow local tests. Here we mount the nvidia-container-cli and related libraries from the host instead of installing them from an image.
| Expect(err).ToNot(HaveOccurred()) | ||
|
|
||
| // Run the test script in the container. | ||
| // Capture but don't fail on errors - we'll check the results via container logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean don't fail on errors? I think this comment might be out of date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this.
5c258a1 to
8c4dcd7
Compare
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
8c4dcd7 to
718fe70
Compare
This patch adds an E2E test for the nvidia-container-cli that will allow us to catch regressions on libnvidia-container