-
Notifications
You must be signed in to change notification settings - Fork 98
switch to cuda base image #315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| containers: | ||
| - name: compute-domain-daemon | ||
| image: ubuntu:22.04 | ||
| image: nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the latest image: nvcr.io/nvidia/cuda:12.8.1-base-ubuntu22.04
Out of scope: we may want to templatize the image path here and find a way to keep it up-to-date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may want to templatize the image path here
Absolutely. We have to allow for orgs to take control of their supply chain. I also just mentioned this in #315 (comment).
elezar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to consider is that if we use a CUDA base image we probably want to set NVIDIA_VISIBLE_DEVICES=void in the deaemonset to ensure that we're not inadvertently injecting NVIDIA GPUs when we do not need them.
Signed-off-by: Swati Gupta <[email protected]>
a92485a to
88d3316
Compare
Good point. But i thought if you dont explicitly ask for |
If one doesn't explicitly ask for We could also do this by ensuring that the CDI spec for the resource claim that we request in the template sets this variable, and it might already be done. Is this something that you could confirm? (What is the value of Also, to be clear, by setting thes in the template I mean something like the following: |
|
@elezar looks like CDI is doing the expected thing. its set to void in the current container |
Is this from the compute-domain daemon. |
|
Cool. Such a simple patch. Inspiring so many thoughts and questions. So, I have a few remarks on a variety of topics. Please bear with me! :) My high-level feedback is: I don't understand yet why we believe that this is an obvious improvement. Describe the problemIn a patch/PR we should always describe the observed problem, even if just anecdotal. Here, the motivation was an organization affected by Docker Hub quotas/rate limiting. Resulting in image pull failures. Is that right? Can you add details to the PR description? Explain the solutionThe hope behind consuming nvcr.io instead of Docker Hub is that pipelines will be less affected by such errors. But is this hope well-founded? Would love to read your thoughts. My thoughts on problem and solutionHow this is done elsewhereWe probably need to make this image spec configurable. To allow for serious orgs consuming their self-hosted or paid container registry. In every company I have worked at so far we have at some point moved container images that were used frequently / at high rate (especially in CI) to a paid container registry close to the infrastructure (often, this is ECR on AWS). The reason was simple: making the outcome more predictable. Sometimes, we even paid Docker Hub (a fair solution for many companies). Nothing is free. Especially reliability is not free. nvcr vs Docker HubKey question: how does free tier nvcr behave for non-NVIDIA? Does it behave 'better' than Docker Hub (in the sense of free tier quotas and availability)? The answer is not obvious to me. If we have a definite answer: great, let's share it -- super important to know :). For NVIDIA-internal use cases (such as CI): NVCR is probably the right choice. CUDA base image vs. plain UbuntuIn this patch we do not only change the registry, but also the container image. (Why) are we OK with using the CUDA base image? I think we probably picked the CUDA base image because a plain Ubuntu image wasn't found on nvcr, and the base image seems to get close. Should we get a plain Ubuntu hosted on nvcr instead? Are we still using Docker Hub?The base image included in this patch implicitly derives from Docker Hub's Ubuntu:
Is that base image layer in practice served by nvcr? Maybe. Not obvious to me. I guess we can answer some of these questions by just trying things out. But I would honestly love to hear someone else's thoughts on this. |
Yes. compute-domain daemon pod env |
I have updated the PR description to add more context. The CUDA base images are not as minimal as using a vanilla ubuntu image but they are already audited by our OpenSource compliance team and they also contain the NV License. Plus they work right out of box for NVIDIA GPUs (not for this particular case though) https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.8.1/ubuntu2204/base/Dockerfile?ref_type=heads It is alot of effort to mirror or host new images that contain someone else code. For this reason, I switched our nvbandwidth image to CUDA based image. As per registry, nvcr vs dockerhub, i am not sure if one is more preferred over the other in terms of speed and reliability. dockerhub used to be a clear win for quick POCs and tests. But for enterprise customers nvcr.io might be preferred choice because of paid services and licensing requirements. |
|
@jgehrcke another reason i can think of is consistency. We use cuda base images at most places. |
|
connecting dots, related: #317 |
This PR is to make sure all the images are CUDA based images hosted on nvcr.io to meet with GPL and NVIDIA Open Source software requirement policies. The source code of all the components of CUDA based images are hosted on NVIDIA mirrors. By simply using the CUDA based images, we save us the effort of meeting with the compliance. Also, our images on nvcr.io are scanned and signed, so that adds a layer of security also.
To add a bit more context, this issue was actually surfaced when a user was trying to run the imex channel example from internal docs. The compute-doamin daemonset responsible for creating the resourceclaims and setting up the imex channels was based on ubuntu22.04 from docker hub. As shown in the error below, Kubelet was ruuning into image pull errors due to docker rate limit issue without using an account.