[no-relnote] Add E2E for libnvidia-container #1118

ArangoGutierrez · 2025-05-30T14:35:15Z

This patch adds an E2E test for the nvidia-container-cli that will allow us to catch regressions on libnvidia-container

coveralls · 2025-05-30T14:37:27Z

Pull Request Test Coverage Report for Build 16831774091

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 35.729%

Totals
Change from base Build 16830975215:	0.0%
Covered Lines:	4583
Relevant Lines:	12827

💛 - Coveralls

tests/e2e/nvidia-container-cli_test.go

ArangoGutierrez · 2025-06-03T19:18:09Z

Tests pass, PR ready for review @elezar

elezar · 2025-06-05T15:04:10Z

tests/e2e/nvidia-container-cli_test.go

+docker run -d --name test-nvidia-container-cli \
+  --privileged \
+  --runtime=nvidia \


It is not scalable to have to MOUNT everything into this container. Note that when we still had some simple integration tests in the toolkit we used

nvidia-container-toolkit/tests/container/docker_test.sh

Line 20 in 27f5ec8

testing::docker::dind::setup() {

Can we rather adapt this?

sure, will work on that. we need to make this test more robust and scalable

elezar · 2025-07-17T08:27:55Z

tests/e2e/installer.go

-# Create a temporary directory
-TEMP_DIR="/tmp/ctk_e2e.$(date +%s)_$RANDOM"
-mkdir -p "$TEMP_DIR"
+: ${IMAGE:={{.Image}}}


Nit: Why did we swap the ordering of these?

As a general note to the scripts -- why are we using envvars at all and don't we just use {{.Image}} everywhere? Is there a case where IMAGE is already set to something else?

switched back

elezar · 2025-07-17T08:29:14Z

tests/e2e/nvidia-container-cli_test.go

+	var (
+		runner        Runner
+		testScript    = "/tmp/libnvidia-container-cli.sh"
+		dockerImage   = "ghcr.io/nvidia/container-toolkit:5e8c1411-ubuntu20.04"


Why are we hardcoding this?

Also, does this break once we switch to distroless?

The hardcoded image was a mistake; it was an experimental feature introduced by me to iterate.

On distroless, it should work as we are setting the --entrypoint /libnvidia-container-cli.sh , so as long as the distroless supports /usr/bin/env bash scripts, this test should work

I have now edited to adjust for the distroless image

On distroless, it should work as we are setting the --entrypoint /libnvidia-container-cli.sh , so as long as the distroless supports /usr/bin/env bash scripts, this test should work

Distroless does not support bash. It includes sh from busybox.

elezar · 2025-07-17T08:30:29Z

tests/e2e/e2e_test.go

+	imageName = getRequiredEnvvar[string]("E2E_IMAGE_NAME")
+	imageTag = getRequiredEnvvar[string]("E2E_IMAGE_TAG")


Is there a reason we removed the conditional?

Yes, regardless of if we want to install or not the toolkit on the host, I want to be able to get these 2 variables.

Why? What do we need the image for if we're not installing the toolkit?

for the test cases added in this PR, we need to pass the image into the runner container, so we can get the artifacts to install the toolkit and the nvidia-container-cli

How do we handle the "test the locally installed toolkit" case?

tests/e2e/installer.go

elezar · 2025-07-17T08:34:45Z

tests/e2e/nvidia-container-cli_test.go

+	// script are therefore a good indicator of whether the NVIDIA Container
+	// Toolkit is functioning correctly inside the container.


This is not what we're testing. We're testing the nvidia-container-cli specifically.

elezar · 2025-07-17T08:36:16Z

tests/e2e/nvidia-container-cli_test.go

+apt-get update -y && apt-get install -y curl gnupg2
+
+WORKDIR="$(mktemp -d)"
+ROOTFS="${WORKDIR}/rootfs"


Why do we need two directories? What about:

Suggested change

ROOTFS="${WORKDIR}/rootfs"

ROOTFS="$(mktemp -d)/rootfs"

agreed, added

elezar · 2025-07-17T08:37:15Z

tests/e2e/nvidia-container-cli_test.go

+var _ = Describe("nvidia-container-cli", Ordered, ContinueOnFailure, func() {
+	var (
+		runner        Runner
+		testScript    = "/tmp/libnvidia-container-cli.sh"


Are we guaranteed a single script across all tests?

elezar · 2025-07-17T08:38:35Z

tests/e2e/nvidia-container-cli_test.go

+		testScript    = "/tmp/libnvidia-container-cli.sh"
+		dockerImage   = "ghcr.io/nvidia/container-toolkit:5e8c1411-ubuntu20.04"
+		containerName = "nvidia-cli-e2e"
+		dockerRunCmd  string


Is a variable at this scope required?

Copilot

Pull Request Overview

This PR adds an end-to-end test for libnvidia-container's nvidia-container-cli tool to catch regressions. The test validates that GPUs detected inside a container match those available on the host system.

Introduces a new E2E test that sets up a containerized environment with Docker and nvidia-container-cli
Modifies the installer template to support reusable temporary directories across tests
Updates environment variable handling to make image name/tag always available

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
tests/e2e/nvidia-container-cli_test.go	New E2E test file implementing the nvidia-container-cli validation test
tests/e2e/installer.go	Modified installer template to support persistent temporary directories and fix image variable usage
tests/e2e/e2e_test.go	Updated environment variable logic to always require image name/tag regardless of CTK installation flag
tests/e2e/Makefile	Added GINKGO_FOCUS parameter support for selective test execution

Comments suppressed due to low confidence (1)

tests/e2e/nvidia-container-cli_test.go:156

The error from removing a potentially non-existent container is ignored. Consider checking if the container exists first or handling the expected error case when the container doesn't exist.

		_, _, err = runner.Run(fmt.Sprintf("docker rm -f %s", containerName))

tests/e2e/nvidia-container-cli_test.go

tests/e2e/installer.go

ArangoGutierrez · 2025-08-04T10:29:48Z

@elezar PTAL RFR

elezar · 2025-08-04T10:57:19Z

tests/e2e/nvidia-container-cli_test.go

+)
+
+const (
+	installDockerTemplate = `docker exec -u root {{.ContainerName}} bash -c '


Why not have the script be the "template" directly? i.e. not include the docker exec part? In most cases this would mean that the script is used as is and does not even require templating.

I tried that, but then the fmt.SprintF("docker exec %s", template.String()) would run into issues, failing, complaining that line 1 didn't have EOL.

SO I decided to have self-contained templates that we can simply execute

Ok, I got what I was doing wrong, now is not a template

elezar · 2025-08-04T10:59:36Z

tests/e2e/nvidia-container-cli_test.go

+IN_NS
+'`
+
+	dockerRunCmdTemplate = `docker run -d --name {{.ContainerName}} --privileged --runtime=nvidia \


I also don't think this needs to be a template.

elezar · 2025-08-07T12:48:52Z

tests/e2e/nvidia-container-cli_test.go

+)
+
+const (
+	installDockerTemplate = `


Nit: this is now a script and not a template.

elezar · 2025-08-07T12:50:44Z

tests/e2e/nvidia-container-cli_test.go

+
+	It("should report the same GPUs inside the container as on the host", func(ctx context.Context) {
+		// Launch the container in detached mode.
+		_, _, err := runner.Run(dockerRunCmdTemplate)


I'm not really at the point where I want to block this, but what would be good would be if we could return a Runner when starting this container. We can then replace runner.Run("docker exec ... ") commands bellow with calls to this runner that JUST accept the script that we want to run. This could be done in a follow up though.

elezar · 2025-08-07T12:52:25Z

tests/e2e/nvidia-container-cli_test.go

+		err = tmpl.Execute(&toolkitInstall, struct {
+			ToolkitImage string
+		}{
+			ToolkitImage: imageName + ":" + imageTag,


As a note: This makes LOCAL tests difficult if one doesn't have the image built. Should we still make imageName and imageTag optional and skip this test if these are not set?

I have updated this PR with basic functionality to allow local tests. Here we mount the nvidia-container-cli and related libraries from the host instead of installing them from an image.

elezar · 2025-08-07T12:53:18Z

tests/e2e/nvidia-container-cli_test.go

+		Expect(err).ToNot(HaveOccurred())
+
+		// Run the test script in the container.
+		// Capture but don't fail on errors - we'll check the results via container logs.


What do you mean don't fail on errors? I think this comment might be out of date.

I updated this.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

Signed-off-by: Evan Lezar <[email protected]>

ArangoGutierrez requested review from Copilot and elezar May 30, 2025 14:35

ArangoGutierrez self-assigned this May 30, 2025

This comment was marked as outdated.

Sign in to view

elezar reviewed May 30, 2025

View reviewed changes

tests/e2e/nvidia-container-cli_test.go Outdated Show resolved Hide resolved

ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 67cc2ec to 903737e Compare May 30, 2025 15:32

ArangoGutierrez requested a review from elezar May 30, 2025 15:32

ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch 5 times, most recently from d0a338e to 8c42c14 Compare June 3, 2025 18:51

ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 8c42c14 to d905a49 Compare June 4, 2025 08:13

elezar reviewed Jun 5, 2025

View reviewed changes

ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from d905a49 to 9674787 Compare June 5, 2025 16:18

ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 9674787 to 1a72738 Compare July 16, 2025 14:03

ArangoGutierrez requested review from Copilot and elezar July 16, 2025 14:03

This comment was marked as outdated.

Sign in to view

ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 1a72738 to f84c038 Compare July 16, 2025 15:27

elezar reviewed Jul 17, 2025

View reviewed changes

tests/e2e/installer.go Show resolved Hide resolved

elezar reviewed Jul 17, 2025

View reviewed changes

ArangoGutierrez requested a review from Copilot July 25, 2025 13:49

This comment was marked as outdated.

Sign in to view

ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch 4 times, most recently from b4cd062 to 6ae776e Compare July 29, 2025 09:57

ArangoGutierrez requested review from Copilot and elezar July 29, 2025 09:59

This comment was marked as outdated.

Sign in to view

ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 6ae776e to cd56671 Compare July 29, 2025 13:45

ArangoGutierrez requested a review from Copilot July 29, 2025 13:46

Copilot AI reviewed Jul 29, 2025

View reviewed changes

tests/e2e/nvidia-container-cli_test.go Outdated Show resolved Hide resolved

tests/e2e/nvidia-container-cli_test.go Outdated Show resolved Hide resolved

tests/e2e/installer.go Show resolved Hide resolved

ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from cd56671 to 55205e8 Compare August 4, 2025 10:29

elezar reviewed Aug 4, 2025

View reviewed changes

ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 55205e8 to 1450620 Compare August 4, 2025 14:25

ArangoGutierrez requested a review from elezar August 4, 2025 14:25

elezar reviewed Aug 7, 2025

View reviewed changes

tests/e2e/nvidia-container-cli_test.go

)

const (

installDockerTemplate = `

Copy link

Member

elezar Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is now a script and not a template.

elezar reviewed Aug 7, 2025

View reviewed changes

elezar force-pushed the e2e/nvidia-container-cli branch 2 times, most recently from 5c258a1 to 8c4dcd7 Compare August 8, 2025 12:53

ArangoGutierrez and others added 2 commits August 8, 2025 15:33

[no-relnote] Add E2E for libnvidia-container

ae30adc

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

[no-relnote] Allow local nvidia-container-cli tests

718fe70

Signed-off-by: Evan Lezar <[email protected]>

elezar force-pushed the e2e/nvidia-container-cli branch from 8c4dcd7 to 718fe70 Compare August 8, 2025 13:34

ArangoGutierrez requested a review from Copilot August 8, 2025 13:42

elezar approved these changes Aug 8, 2025

View reviewed changes

elezar merged commit 4507575 into NVIDIA:main Aug 8, 2025
16 checks passed

		imageName = getRequiredEnvvar[string]("E2E_IMAGE_NAME")
		imageTag = getRequiredEnvvar[string]("E2E_IMAGE_TAG")

		// script are therefore a good indicator of whether the NVIDIA Container
		// Toolkit is functioning correctly inside the container.

[no-relnote] Add E2E for libnvidia-container #1118

[no-relnote] Add E2E for libnvidia-container #1118

Uh oh!

Conversation

ArangoGutierrez commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

coveralls commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 16831774091

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

Uh oh!

ArangoGutierrez commented Jun 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArangoGutierrez commented Aug 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ArangoGutierrez commented May 30, 2025 •

edited

Loading

coveralls commented May 30, 2025 •

edited

Loading

ArangoGutierrez Aug 4, 2025 •

edited

Loading