Skip to content

Conversation

@guptaNswati
Copy link
Contributor

@guptaNswati guptaNswati commented Apr 8, 2025

This PR is to make sure all the images are CUDA based images hosted on nvcr.io to meet with GPL and NVIDIA Open Source software requirement policies. The source code of all the components of CUDA based images are hosted on NVIDIA mirrors. By simply using the CUDA based images, we save us the effort of meeting with the compliance. Also, our images on nvcr.io are scanned and signed, so that adds a layer of security also.

To add a bit more context, this issue was actually surfaced when a user was trying to run the imex channel example from internal docs. The compute-doamin daemonset responsible for creating the resourceclaims and setting up the imex channels was based on ubuntu22.04 from docker hub. As shown in the error below, Kubelet was ruuning into image pull errors due to docker rate limit issue without using an account.

Normal  Pulling  26m (x5 over 29m)   kubelet      Pulling image "ubuntu:22.04"
 Warning Failed   26m (x5 over 29m)   kubelet      Error: ErrImagePull
 Warning Failed   26m          kubelet      Failed to pull image "ubuntu:22.04": failed to pull and unpack image "docker.io/library/ubuntu:22.04": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/ubuntu/manifests/sha256:23fdb0648173966ac0b863e5b3d63032e99f44533c5b396e62f29770ca2c5210: 429 Too Many Requests - Server message: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 Normal  BackOff  4m47s (x108 over 29m) kubelet      Back-off pulling image "ubuntu:22.04"
 Warning Failed   4m35s (x109 over 29m) kubelet      Error: ImagePullBackOff

@guptaNswati guptaNswati requested review from elezar and jgehrcke April 8, 2025 19:18
@copy-pr-bot
Copy link

copy-pr-bot bot commented Apr 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

containers:
- name: compute-domain-daemon
image: ubuntu:22.04
image: nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the latest image: nvcr.io/nvidia/cuda:12.8.1-base-ubuntu22.04

Out of scope: we may want to templatize the image path here and find a way to keep it up-to-date.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to templatize the image path here

Absolutely. We have to allow for orgs to take control of their supply chain. I also just mentioned this in #315 (comment).

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to consider is that if we use a CUDA base image we probably want to set NVIDIA_VISIBLE_DEVICES=void in the deaemonset to ensure that we're not inadvertently injecting NVIDIA GPUs when we do not need them.

Signed-off-by: Swati Gupta <[email protected]>
@guptaNswati
Copy link
Contributor Author

One thing to consider is that if we use a CUDA base image we probably want to set NVIDIA_VISIBLE_DEVICES=void in the deaemonset to ensure that we're not inadvertently injecting NVIDIA GPUs when we do not need them.

Good point. But i thought if you dont explicitly ask for GPUs via nvidia.com/gpu this flag will be set automatically in the env. What other image did you have in mind if not the CUDA base. I thought CUDA base images are perfect from OSRB requirements

@elezar
Copy link
Member

elezar commented Apr 9, 2025

One thing to consider is that if we use a CUDA base image we probably want to set NVIDIA_VISIBLE_DEVICES=void in the deaemonset to ensure that we're not inadvertently injecting NVIDIA GPUs when we do not need them.

Good point. But i thought if you dont explicitly ask for GPUs via nvidia.com/gpu this flag will be set automatically in the env. What other image did you have in mind if not the CUDA base. I thought CUDA base images are perfect from OSRB requirements

If one doesn't explicitly ask for nvidia.com/gpu resources, the variable would not be modified meaning that if it is set -- as it is in the base image -- it will remain set. If the NVIDIA Container Runtime is set as the default runtime and is configured to allow requesting devices through the environment variable (its default), then all devices will be injected.

We could also do this by ensuring that the CDI spec for the resource claim that we request in the template sets this variable, and it might already be done. Is this something that you could confirm? (What is the value of NVIDIA_VISIBLE_DEVICES in the compute-domain-daemon container?)

Also, to be clear, by setting thes in the template I mean something like the following:

diff --git a/templates/compute-domain-daemon.tmpl.yaml b/templates/compute-domain-daemon.tmpl.yaml
index 02d04958..9c761d82 100644
--- a/templates/compute-domain-daemon.tmpl.yaml
+++ b/templates/compute-domain-daemon.tmpl.yaml
@@ -23,6 +23,9 @@ spec:
       containers:
       - name: compute-domain-daemon
         image: nvcr.io/nvidia/cuda:12.8.1-base-ubuntu22.04
+        env:
+        - name: NVIDIA_VISIBLE_DEVICES
+          value: void
         command: [sh, -c]
         args:
         - |-

@guptaNswati guptaNswati closed this Apr 9, 2025
@guptaNswati guptaNswati reopened this Apr 9, 2025
@guptaNswati
Copy link
Contributor Author

@elezar looks like CDI is doing the expected thing. its set to void in the current container

$ kubectl exec -it imex-channel-injection-l2w45-m6pf9 -n nvidia-dra-driver-gpu -- env | grep NVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES=void

@elezar
Copy link
Member

elezar commented Apr 11, 2025

@elezar looks like CDI is doing the expected thing. its set to void in the current container

$ kubectl exec -it imex-channel-injection-l2w45-m6pf9 -n nvidia-dra-driver-gpu -- env | grep NVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES=void

Is this from the compute-domain daemon.

@jgehrcke
Copy link
Collaborator

jgehrcke commented Apr 11, 2025

Cool. Such a simple patch. Inspiring so many thoughts and questions. So, I have a few remarks on a variety of topics. Please bear with me! :)

My high-level feedback is: I don't understand yet why we believe that this is an obvious improvement.

Describe the problem

In a patch/PR we should always describe the observed problem, even if just anecdotal.

Here, the motivation was an organization affected by Docker Hub quotas/rate limiting. Resulting in image pull failures. Is that right? Can you add details to the PR description?

Explain the solution

The hope behind consuming nvcr.io instead of Docker Hub is that pipelines will be less affected by such errors. But is this hope well-founded? Would love to read your thoughts.

My thoughts on problem and solution

How this is done elsewhere

We probably need to make this image spec configurable. To allow for serious orgs consuming their self-hosted or paid container registry.

In every company I have worked at so far we have at some point moved container images that were used frequently / at high rate (especially in CI) to a paid container registry close to the infrastructure (often, this is ECR on AWS). The reason was simple: making the outcome more predictable. Sometimes, we even paid Docker Hub (a fair solution for many companies).

Nothing is free. Especially reliability is not free.

nvcr vs Docker Hub

Key question: how does free tier nvcr behave for non-NVIDIA? Does it behave 'better' than Docker Hub (in the sense of free tier quotas and availability)? The answer is not obvious to me. If we have a definite answer: great, let's share it -- super important to know :).

For NVIDIA-internal use cases (such as CI): NVCR is probably the right choice.

CUDA base image vs. plain Ubuntu

In this patch we do not only change the registry, but also the container image. (Why) are we OK with using the CUDA base image? I think we probably picked the CUDA base image because a plain Ubuntu image wasn't found on nvcr, and the base image seems to get close.

Should we get a plain Ubuntu hosted on nvcr instead?

Are we still using Docker Hub?

The base image included in this patch implicitly derives from Docker Hub's Ubuntu:
https://gitlab.com/nvidia/container-images/cuda/blob/master/dist/12.8.1/ubuntu2204/base/Dockerfile#L1

FROM ubuntu:22.04 AS base resolves to using the docker.io registry.

Is that base image layer in practice served by nvcr? Maybe. Not obvious to me.

I guess we can answer some of these questions by just trying things out. But I would honestly love to hear someone else's thoughts on this.

@jgehrcke jgehrcke mentioned this pull request Apr 11, 2025
@guptaNswati
Copy link
Contributor Author

@elezar looks like CDI is doing the expected thing. its set to void in the current container

$ kubectl exec -it imex-channel-injection-l2w45-m6pf9 -n nvidia-dra-driver-gpu -- env | grep NVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES=void

Is this from the compute-domain daemon.

Yes. compute-domain daemon pod env

@guptaNswati
Copy link
Contributor Author

Cool. Such a simple patch. Inspiring so many thoughts and questions. So, I have a few remarks on a variety of topics. Please bear with me! :)

My high-level feedback is: I don't understand yet why we believe that this is an obvious improvement.

Describe the problem

In a patch/PR we should always describe the observed problem, even if just anecdotal.

Here, the motivation was an organization affected by Docker Hub quotas/rate limiting. Resulting in image pull failures. Is that right? Can you add details to the PR description?

Explain the solution

The hope behind consuming nvcr.io instead of Docker Hub is that pipelines will be less affected by such errors. But is this hope well-founded? Would love to read your thoughts.

My thoughts on problem and solution

How this is done elsewhere

We probably need to make this image spec configurable. To allow for serious orgs consuming their self-hosted or paid container registry.

In every company I have worked at so far we have at some point moved container images that were used frequently / at high rate (especially in CI) to a paid container registry close to the infrastructure (often, this is ECR on AWS). The reason was simple: making the outcome more predictable. Sometimes, we even paid Docker Hub (a fair solution for many companies).

Nothing is free. Especially reliability is not free.

nvcr vs Docker Hub

Key question: how does free tier nvcr behave for non-NVIDIA? Does it behave 'better' than Docker Hub (in the sense of free tier quotas and availability)? The answer is not obvious to me. If we have a definite answer: great, let's share it -- super important to know :).

For NVIDIA-internal use cases (such as CI): NVCR is probably the right choice.

CUDA base image vs. plain Ubuntu

In this patch we do not only change the registry, but also the container image. (Why) are we OK with using the CUDA base image? I think we probably picked the CUDA base image because a plain Ubuntu image wasn't found on nvcr, and the base image seems to get close.

Should we get a plain Ubuntu hosted on nvcr instead?

Are we still using Docker Hub?

The base image included in this patch implicitly derives from Docker Hub's Ubuntu: https://gitlab.com/nvidia/container-images/cuda/blob/master/dist/12.8.1/ubuntu2204/base/Dockerfile#L1

FROM ubuntu:22.04 AS base resolves to using the docker.io registry.

Is that base image layer in practice served by nvcr? Maybe. Not obvious to me.

I guess we can answer some of these questions by just trying things out. But I would honestly love to hear someone else's thoughts on this.

I have updated the PR description to add more context.

The CUDA base images are not as minimal as using a vanilla ubuntu image but they are already audited by our OpenSource compliance team and they also contain the NV License. Plus they work right out of box for NVIDIA GPUs (not for this particular case though) https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.8.1/ubuntu2204/base/Dockerfile?ref_type=heads

It is alot of effort to mirror or host new images that contain someone else code. For this reason, I switched our nvbandwidth image to CUDA based image.

As per registry, nvcr vs dockerhub, i am not sure if one is more preferred over the other in terms of speed and reliability. dockerhub used to be a clear win for quick POCs and tests. But for enterprise customers nvcr.io might be preferred choice because of paid services and licensing requirements.

@guptaNswati
Copy link
Contributor Author

@jgehrcke another reason i can think of is consistency. We use cuda base images at most places.

@jgehrcke
Copy link
Collaborator

connecting dots, related: #317

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

4 participants