switch to cuda base image #315

guptaNswati · 2025-04-08T19:18:47Z

This PR is to make sure all the images are CUDA based images hosted on nvcr.io to meet with GPL and NVIDIA Open Source software requirement policies. The source code of all the components of CUDA based images are hosted on NVIDIA mirrors. By simply using the CUDA based images, we save us the effort of meeting with the compliance. Also, our images on nvcr.io are scanned and signed, so that adds a layer of security also.

To add a bit more context, this issue was actually surfaced when a user was trying to run the imex channel example from internal docs. The compute-doamin daemonset responsible for creating the resourceclaims and setting up the imex channels was based on ubuntu22.04 from docker hub. As shown in the error below, Kubelet was ruuning into image pull errors due to docker rate limit issue without using an account.

Normal  Pulling  26m (x5 over 29m)   kubelet      Pulling image "ubuntu:22.04"
 Warning Failed   26m (x5 over 29m)   kubelet      Error: ErrImagePull
 Warning Failed   26m          kubelet      Failed to pull image "ubuntu:22.04": failed to pull and unpack image "docker.io/library/ubuntu:22.04": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/ubuntu/manifests/sha256:23fdb0648173966ac0b863e5b3d63032e99f44533c5b396e62f29770ca2c5210: 429 Too Many Requests - Server message: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 Normal  BackOff  4m47s (x108 over 29m) kubelet      Back-off pulling image "ubuntu:22.04"
 Warning Failed   4m35s (x109 over 29m) kubelet      Error: ImagePullBackOff

copy-pr-bot · 2025-04-08T19:18:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cdesiniotis · 2025-04-08T19:40:31Z

templates/compute-domain-daemon.tmpl.yaml

      containers:
      - name: compute-domain-daemon
-        image: ubuntu:22.04
+        image: nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04


Please use the latest image: nvcr.io/nvidia/cuda:12.8.1-base-ubuntu22.04

Out of scope: we may want to templatize the image path here and find a way to keep it up-to-date.

we may want to templatize the image path here

Absolutely. We have to allow for orgs to take control of their supply chain. I also just mentioned this in #315 (comment).

elezar

One thing to consider is that if we use a CUDA base image we probably want to set NVIDIA_VISIBLE_DEVICES=void in the deaemonset to ensure that we're not inadvertently injecting NVIDIA GPUs when we do not need them.

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati · 2025-04-08T21:05:30Z

One thing to consider is that if we use a CUDA base image we probably want to set NVIDIA_VISIBLE_DEVICES=void in the deaemonset to ensure that we're not inadvertently injecting NVIDIA GPUs when we do not need them.

Good point. But i thought if you dont explicitly ask for GPUs via nvidia.com/gpu this flag will be set automatically in the env. What other image did you have in mind if not the CUDA base. I thought CUDA base images are perfect from OSRB requirements

elezar · 2025-04-09T08:12:09Z

One thing to consider is that if we use a CUDA base image we probably want to set NVIDIA_VISIBLE_DEVICES=void in the deaemonset to ensure that we're not inadvertently injecting NVIDIA GPUs when we do not need them.

Good point. But i thought if you dont explicitly ask for GPUs via nvidia.com/gpu this flag will be set automatically in the env. What other image did you have in mind if not the CUDA base. I thought CUDA base images are perfect from OSRB requirements

If one doesn't explicitly ask for nvidia.com/gpu resources, the variable would not be modified meaning that if it is set -- as it is in the base image -- it will remain set. If the NVIDIA Container Runtime is set as the default runtime and is configured to allow requesting devices through the environment variable (its default), then all devices will be injected.

We could also do this by ensuring that the CDI spec for the resource claim that we request in the template sets this variable, and it might already be done. Is this something that you could confirm? (What is the value of NVIDIA_VISIBLE_DEVICES in the compute-domain-daemon container?)

Also, to be clear, by setting thes in the template I mean something like the following:

diff --git a/templates/compute-domain-daemon.tmpl.yaml b/templates/compute-domain-daemon.tmpl.yaml
index 02d04958..9c761d82 100644
--- a/templates/compute-domain-daemon.tmpl.yaml
+++ b/templates/compute-domain-daemon.tmpl.yaml
@@ -23,6 +23,9 @@ spec:
       containers:
       - name: compute-domain-daemon
         image: nvcr.io/nvidia/cuda:12.8.1-base-ubuntu22.04
+        env:
+        - name: NVIDIA_VISIBLE_DEVICES
+          value: void
         command: [sh, -c]
         args:
         - |-

guptaNswati · 2025-04-09T20:12:02Z

@elezar looks like CDI is doing the expected thing. its set to void in the current container

$ kubectl exec -it imex-channel-injection-l2w45-m6pf9 -n nvidia-dra-driver-gpu -- env | grep NVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES=void

elezar · 2025-04-11T08:55:27Z

@elezar looks like CDI is doing the expected thing. its set to void in the current container
$ kubectl exec -it imex-channel-injection-l2w45-m6pf9 -n nvidia-dra-driver-gpu -- env | grep NVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES=void

Is this from the compute-domain daemon.

jgehrcke · 2025-04-11T12:27:38Z

Cool. Such a simple patch. Inspiring so many thoughts and questions. So, I have a few remarks on a variety of topics. Please bear with me! :)

My high-level feedback is: I don't understand yet why we believe that this is an obvious improvement.

Describe the problem

In a patch/PR we should always describe the observed problem, even if just anecdotal.

Here, the motivation was an organization affected by Docker Hub quotas/rate limiting. Resulting in image pull failures. Is that right? Can you add details to the PR description?

Explain the solution

The hope behind consuming nvcr.io instead of Docker Hub is that pipelines will be less affected by such errors. But is this hope well-founded? Would love to read your thoughts.

My thoughts on problem and solution

How this is done elsewhere

We probably need to make this image spec configurable. To allow for serious orgs consuming their self-hosted or paid container registry.

In every company I have worked at so far we have at some point moved container images that were used frequently / at high rate (especially in CI) to a paid container registry close to the infrastructure (often, this is ECR on AWS). The reason was simple: making the outcome more predictable. Sometimes, we even paid Docker Hub (a fair solution for many companies).

Nothing is free. Especially reliability is not free.

nvcr vs Docker Hub

Key question: how does free tier nvcr behave for non-NVIDIA? Does it behave 'better' than Docker Hub (in the sense of free tier quotas and availability)? The answer is not obvious to me. If we have a definite answer: great, let's share it -- super important to know :).

For NVIDIA-internal use cases (such as CI): NVCR is probably the right choice.

CUDA base image vs. plain Ubuntu

In this patch we do not only change the registry, but also the container image. (Why) are we OK with using the CUDA base image? I think we probably picked the CUDA base image because a plain Ubuntu image wasn't found on nvcr, and the base image seems to get close.

Should we get a plain Ubuntu hosted on nvcr instead?

Are we still using Docker Hub?

The base image included in this patch implicitly derives from Docker Hub's Ubuntu:
https://gitlab.com/nvidia/container-images/cuda/blob/master/dist/12.8.1/ubuntu2204/base/Dockerfile#L1

FROM ubuntu:22.04 AS base resolves to using the docker.io registry.

Is that base image layer in practice served by nvcr? Maybe. Not obvious to me.

I guess we can answer some of these questions by just trying things out. But I would honestly love to hear someone else's thoughts on this.

guptaNswati · 2025-04-14T16:21:35Z

@elezar looks like CDI is doing the expected thing. its set to void in the current container
$ kubectl exec -it imex-channel-injection-l2w45-m6pf9 -n nvidia-dra-driver-gpu -- env | grep NVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES=void
Is this from the compute-domain daemon.

Yes. compute-domain daemon pod env

guptaNswati · 2025-04-14T17:05:29Z

Cool. Such a simple patch. Inspiring so many thoughts and questions. So, I have a few remarks on a variety of topics. Please bear with me! :)

My high-level feedback is: I don't understand yet why we believe that this is an obvious improvement.

Describe the problem

In a patch/PR we should always describe the observed problem, even if just anecdotal.

Here, the motivation was an organization affected by Docker Hub quotas/rate limiting. Resulting in image pull failures. Is that right? Can you add details to the PR description?

Explain the solution

The hope behind consuming nvcr.io instead of Docker Hub is that pipelines will be less affected by such errors. But is this hope well-founded? Would love to read your thoughts.

My thoughts on problem and solution

How this is done elsewhere

We probably need to make this image spec configurable. To allow for serious orgs consuming their self-hosted or paid container registry.

In every company I have worked at so far we have at some point moved container images that were used frequently / at high rate (especially in CI) to a paid container registry close to the infrastructure (often, this is ECR on AWS). The reason was simple: making the outcome more predictable. Sometimes, we even paid Docker Hub (a fair solution for many companies).

Nothing is free. Especially reliability is not free.

nvcr vs Docker Hub

Key question: how does free tier nvcr behave for non-NVIDIA? Does it behave 'better' than Docker Hub (in the sense of free tier quotas and availability)? The answer is not obvious to me. If we have a definite answer: great, let's share it -- super important to know :).

For NVIDIA-internal use cases (such as CI): NVCR is probably the right choice.

CUDA base image vs. plain Ubuntu

In this patch we do not only change the registry, but also the container image. (Why) are we OK with using the CUDA base image? I think we probably picked the CUDA base image because a plain Ubuntu image wasn't found on nvcr, and the base image seems to get close.

Should we get a plain Ubuntu hosted on nvcr instead?

Are we still using Docker Hub?

The base image included in this patch implicitly derives from Docker Hub's Ubuntu: https://gitlab.com/nvidia/container-images/cuda/blob/master/dist/12.8.1/ubuntu2204/base/Dockerfile#L1

FROM ubuntu:22.04 AS base resolves to using the docker.io registry.

Is that base image layer in practice served by nvcr? Maybe. Not obvious to me.

I guess we can answer some of these questions by just trying things out. But I would honestly love to hear someone else's thoughts on this.

I have updated the PR description to add more context.

The CUDA base images are not as minimal as using a vanilla ubuntu image but they are already audited by our OpenSource compliance team and they also contain the NV License. Plus they work right out of box for NVIDIA GPUs (not for this particular case though) https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.8.1/ubuntu2204/base/Dockerfile?ref_type=heads

It is alot of effort to mirror or host new images that contain someone else code. For this reason, I switched our nvbandwidth image to CUDA based image.

As per registry, nvcr vs dockerhub, i am not sure if one is more preferred over the other in terms of speed and reliability. dockerhub used to be a clear win for quick POCs and tests. But for enterprise customers nvcr.io might be preferred choice because of paid services and licensing requirements.

guptaNswati · 2025-04-22T18:08:44Z

@jgehrcke another reason i can think of is consistency. We use cuda base images at most places.

jgehrcke · 2025-04-23T08:07:10Z

connecting dots, related: #317

guptaNswati requested review from elezar and jgehrcke April 8, 2025 19:18

cdesiniotis reviewed Apr 8, 2025

View reviewed changes

elezar reviewed Apr 8, 2025

View reviewed changes

switch to cuda base image

88d3316

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati force-pushed the cuda-base-image branch from a92485a to 88d3316 Compare April 8, 2025 21:01

guptaNswati closed this Apr 9, 2025

guptaNswati reopened this Apr 9, 2025

jgehrcke mentioned this pull request Apr 11, 2025

Add E2E tests #314

Open

jgehrcke mentioned this pull request May 7, 2025

Make container image spec/locations configurable #344

Closed

klueska mentioned this pull request May 27, 2025

Use same image as driver in template rendered for the daemonset #376

Merged

klueska closed this in #376 May 28, 2025

klueska added this to Planning Board: k8s-dra-driver-gpu Jun 16, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Jun 16, 2025

guptaNswati self-assigned this Nov 18, 2025

switch to cuda base image #315

switch to cuda base image #315

Uh oh!

Conversation

guptaNswati commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Apr 8, 2025

Uh oh!

cdesiniotis Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

jgehrcke Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

elezar left a comment

Choose a reason for hiding this comment

Uh oh!

guptaNswati commented Apr 8, 2025

Uh oh!

elezar commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guptaNswati commented Apr 9, 2025

Uh oh!

elezar commented Apr 11, 2025

Uh oh!

jgehrcke commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the problem

Explain the solution

My thoughts on problem and solution

How this is done elsewhere

nvcr vs Docker Hub

CUDA base image vs. plain Ubuntu

Are we still using Docker Hub?

Uh oh!

guptaNswati commented Apr 14, 2025

Uh oh!

guptaNswati commented Apr 14, 2025

Describe the problem

Explain the solution

My thoughts on problem and solution

How this is done elsewhere

nvcr vs Docker Hub

CUDA base image vs. plain Ubuntu

Are we still using Docker Hub?

Uh oh!

guptaNswati commented Apr 22, 2025

Uh oh!

jgehrcke commented Apr 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

guptaNswati commented Apr 8, 2025 •

edited

Loading

elezar commented Apr 9, 2025 •

edited

Loading

jgehrcke commented Apr 11, 2025 •

edited

Loading