Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented May 30, 2025

This patch adds an E2E test for the nvidia-container-cli that will allow us to catch regressions on libnvidia-container

@ArangoGutierrez ArangoGutierrez requested review from Copilot and elezar May 30, 2025 14:35
@ArangoGutierrez ArangoGutierrez self-assigned this May 30, 2025

This comment was marked as outdated.

@coveralls
Copy link

coveralls commented May 30, 2025

Pull Request Test Coverage Report for Build 16831774091

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 35.729%

Totals Coverage Status
Change from base Build 16830975215: 0.0%
Covered Lines: 4583
Relevant Lines: 12827

💛 - Coveralls

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 67cc2ec to 903737e Compare May 30, 2025 15:32
@ArangoGutierrez ArangoGutierrez requested a review from elezar May 30, 2025 15:32
@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch 5 times, most recently from d0a338e to 8c42c14 Compare June 3, 2025 18:51
@ArangoGutierrez
Copy link
Collaborator Author

Tests pass, PR ready for review @elezar

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 8c42c14 to d905a49 Compare June 4, 2025 08:13
Comment on lines 31 to 33
docker run -d --name test-nvidia-container-cli \
--privileged \
--runtime=nvidia \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not scalable to have to MOUNT everything into this container. Note that when we still had some simple integration tests in the toolkit we used

testing::docker::dind::setup() {

Can we rather adapt this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will work on that. we need to make this test more robust and scalable

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from d905a49 to 9674787 Compare June 5, 2025 16:18
@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 9674787 to 1a72738 Compare July 16, 2025 14:03

This comment was marked as outdated.

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 1a72738 to f84c038 Compare July 16, 2025 15:27
# Create a temporary directory
TEMP_DIR="/tmp/ctk_e2e.$(date +%s)_$RANDOM"
mkdir -p "$TEMP_DIR"
: ${IMAGE:={{.Image}}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Why did we swap the ordering of these?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general note to the scripts -- why are we using envvars at all and don't we just use {{.Image}} everywhere? Is there a case where IMAGE is already set to something else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched back

var (
runner Runner
testScript = "/tmp/libnvidia-container-cli.sh"
dockerImage = "ghcr.io/nvidia/container-toolkit:5e8c1411-ubuntu20.04"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we hardcoding this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does this break once we switch to distroless?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded image was a mistake; it was an experimental feature introduced by me to iterate.

On distroless, it should work as we are setting the --entrypoint /libnvidia-container-cli.sh , so as long as the distroless supports /usr/bin/env bash scripts, this test should work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now edited to adjust for the distroless image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On distroless, it should work as we are setting the --entrypoint /libnvidia-container-cli.sh , so as long as the distroless supports /usr/bin/env bash scripts, this test should work

Distroless does not support bash. It includes sh from busybox.

Comment on lines 65 to 66
imageName = getRequiredEnvvar[string]("E2E_IMAGE_NAME")
imageTag = getRequiredEnvvar[string]("E2E_IMAGE_TAG")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we removed the conditional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, regardless of if we want to install or not the toolkit on the host, I want to be able to get these 2 variables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? What do we need the image for if we're not installing the toolkit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the test cases added in this PR, we need to pass the image into the runner container, so we can get the artifacts to install the toolkit and the nvidia-container-cli

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle the "test the locally installed toolkit" case?

Comment on lines 32 to 33
// script are therefore a good indicator of whether the NVIDIA Container
// Toolkit is functioning correctly inside the container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what we're testing. We're testing the nvidia-container-cli specifically.

apt-get update -y && apt-get install -y curl gnupg2
WORKDIR="$(mktemp -d)"
ROOTFS="${WORKDIR}/rootfs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two directories? What about:

Suggested change
ROOTFS="${WORKDIR}/rootfs"
ROOTFS="$(mktemp -d)/rootfs"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, added

var _ = Describe("nvidia-container-cli", Ordered, ContinueOnFailure, func() {
var (
runner Runner
testScript = "/tmp/libnvidia-container-cli.sh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we guaranteed a single script across all tests?

testScript = "/tmp/libnvidia-container-cli.sh"
dockerImage = "ghcr.io/nvidia/container-toolkit:5e8c1411-ubuntu20.04"
containerName = "nvidia-cli-e2e"
dockerRunCmd string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a variable at this scope required?

@ArangoGutierrez ArangoGutierrez requested a review from Copilot July 25, 2025 13:49

This comment was marked as outdated.

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch 4 times, most recently from b4cd062 to 6ae776e Compare July 29, 2025 09:57

This comment was marked as outdated.

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 6ae776e to cd56671 Compare July 29, 2025 13:45
@ArangoGutierrez ArangoGutierrez requested a review from Copilot July 29, 2025 13:46
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds an end-to-end test for libnvidia-container's nvidia-container-cli tool to catch regressions. The test validates that GPUs detected inside a container match those available on the host system.

  • Introduces a new E2E test that sets up a containerized environment with Docker and nvidia-container-cli
  • Modifies the installer template to support reusable temporary directories across tests
  • Updates environment variable handling to make image name/tag always available

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
tests/e2e/nvidia-container-cli_test.go New E2E test file implementing the nvidia-container-cli validation test
tests/e2e/installer.go Modified installer template to support persistent temporary directories and fix image variable usage
tests/e2e/e2e_test.go Updated environment variable logic to always require image name/tag regardless of CTK installation flag
tests/e2e/Makefile Added GINKGO_FOCUS parameter support for selective test execution
Comments suppressed due to low confidence (1)

tests/e2e/nvidia-container-cli_test.go:156

  • The error from removing a potentially non-existent container is ignored. Consider checking if the container exists first or handling the expected error case when the container doesn't exist.
		_, _, err = runner.Run(fmt.Sprintf("docker rm -f %s", containerName))

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from cd56671 to 55205e8 Compare August 4, 2025 10:29
@ArangoGutierrez
Copy link
Collaborator Author

@elezar PTAL RFR

)

const (
installDockerTemplate = `docker exec -u root {{.ContainerName}} bash -c '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have the script be the "template" directly? i.e. not include the docker exec part? In most cases this would mean that the script is used as is and does not even require templating.

Copy link
Collaborator Author

@ArangoGutierrez ArangoGutierrez Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that, but then the fmt.SprintF("docker exec %s", template.String()) would run into issues, failing, complaining that line 1 didn't have EOL.

SO I decided to have self-contained templates that we can simply execute

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I got what I was doing wrong, now is not a template

IN_NS
'`

dockerRunCmdTemplate = `docker run -d --name {{.ContainerName}} --privileged --runtime=nvidia \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think this needs to be a template.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 55205e8 to 1450620 Compare August 4, 2025 14:25
@ArangoGutierrez ArangoGutierrez requested a review from elezar August 4, 2025 14:25
)

const (
installDockerTemplate = `
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is now a script and not a template.


It("should report the same GPUs inside the container as on the host", func(ctx context.Context) {
// Launch the container in detached mode.
_, _, err := runner.Run(dockerRunCmdTemplate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really at the point where I want to block this, but what would be good would be if we could return a Runner when starting this container. We can then replace runner.Run("docker exec ... ") commands bellow with calls to this runner that JUST accept the script that we want to run. This could be done in a follow up though.

err = tmpl.Execute(&toolkitInstall, struct {
ToolkitImage string
}{
ToolkitImage: imageName + ":" + imageTag,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a note: This makes LOCAL tests difficult if one doesn't have the image built. Should we still make imageName and imageTag optional and skip this test if these are not set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this PR with basic functionality to allow local tests. Here we mount the nvidia-container-cli and related libraries from the host instead of installing them from an image.

Expect(err).ToNot(HaveOccurred())

// Run the test script in the container.
// Capture but don't fail on errors - we'll check the results via container logs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean don't fail on errors? I think this comment might be out of date.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this.

@elezar elezar force-pushed the e2e/nvidia-container-cli branch 2 times, most recently from 5c258a1 to 8c4dcd7 Compare August 8, 2025 12:53
@elezar elezar force-pushed the e2e/nvidia-container-cli branch from 8c4dcd7 to 718fe70 Compare August 8, 2025 13:34
@ArangoGutierrez ArangoGutierrez requested a review from Copilot August 8, 2025 13:42
@elezar elezar merged commit 4507575 into NVIDIA:main Aug 8, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants