Skip to content

Conversation

@elezar
Copy link
Member

@elezar elezar commented Feb 19, 2025

These changes add an nvbandwidth CUDA sample to allow for testing GPU bandwitdth between multiple GPUs.

These chages would produce the following images:

  • docker.io/nvidia/cuda-sample:nvbandwidth-cuda12.6.2-ubuntu22.04
  • docker.io/nvidia/cuda-sample:nvbandwidth-cuda12.6.2

@elezar elezar mentioned this pull request Feb 19, 2025
@elezar elezar changed the title Bandwidthtest Add nvbandwidth sample Feb 19, 2025
build-%: DOCKERFILE = $(CURDIR)/deployments/container/Dockerfile.$(DOCKERFILE_SUFFIX)
else
build-%: DOCKERFILE = $(CURDIR)/deployments/container/$(SAMPLE)/Dockerfile.$(DOCKERFILE_SUFFIX)
endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we modify the IMAGE_TAG here, I don't think nvbandwidth-8169f9fa-ubuntu22.04 is a good tag for the nvbandwidth image, maybe we want nvbandwidth-8169f9fa

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need the nvbandwidth and cuda_version tag actually. these images are version sensitive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I discussed this with @klueska

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update the tags to be whatever we want them to be. Please remember that:

  • The VERSION for the images released from this repo is cuda12.6.2 for example.
  • The tag should be different for each build (e.g. SHA) so that we can test early access bits.
  • The image will be released when tagging the (internal) repo. Currently we tag with cuda<VERSION> since the base images are the main driver for updates.

@guptaNswati
Copy link
Contributor

  • docker.io/nvidia/cuda-sample:nvbandwidth-cuda12.6.2-ubuntu22.04
  • docker.io/nvidia/cuda-sample:nvbandwidth-cuda12.6.2

This is not a cuda sample. These are standalone memory benchmarking tests. https://github.com/NVIDIA/nvbandwidth
If we dont want to do nvidia/k8s-sample then i would propose we do nvidia/nvbandwidth. Note that i added it to k8s-sample so that pushing it to NGC is bit faster since we already have the repo.

- vectorAdd
- nbody
- deviceQuery
- nvbandwidth
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think it belongs here. It is a separate build/Dockerfile. I added another cuda-sample that should go here. See this #18

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does, see the structure of the GitHub action and how Evan is creating a new Make target

@ArangoGutierrez
Copy link
Collaborator

  • docker.io/nvidia/cuda-sample:nvbandwidth-cuda12.6.2-ubuntu22.04
  • docker.io/nvidia/cuda-sample:nvbandwidth-cuda12.6.2

This is not a cuda sample. These are standalone memory benchmarking tests. https://github.com/NVIDIA/nvbandwidth If we dont want to do nvidia/k8s-sample then i would propose we do nvidia/nvbandwidth. Note that i added it to k8s-sample so that pushing it to NGC is bit faster since we already have the repo.

It was a typo from Evan's point of view, the GH action will produce
ghcr.io/nvidia/k8s-samples:nvbandwidth-cuda12.6.2-ubuntu22.04 as per GItHub registry the image name is the repo name, we can control the tag.

https://github.com/NVIDIA/k8s-samples/pull/20/files#diff-b4df0a4f0d80f73138c476afbd7aefdac9df339642ddfba323d27c8cbabb92e2R90

@elezar elezar marked this pull request as ready for review February 20, 2025 13:14
@elezar
Copy link
Member Author

elezar commented Feb 20, 2025

/ok-to-test

@ArangoGutierrez
Copy link
Collaborator

/ok to test

1 similar comment
@ArangoGutierrez
Copy link
Collaborator

/ok to test

@elezar elezar force-pushed the bandwidthtest branch 3 times, most recently from b6b0c77 to 8103ba9 Compare February 20, 2025 16:30
@ArangoGutierrez
Copy link
Collaborator

/ok to test

This change adds an nvbandwidth sample that can be used to test
both single and multi-node GPU interconnectivity.

The multi-arch images are generated with the following image root:

nvcr.io/ghcr.io/nvidia/k8s-samples:nvbandwidth

Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
@elezar
Copy link
Member Author

elezar commented Feb 21, 2025

Closing in favour of #19

@elezar elezar closed this Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants