Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

deployments/container/Dockerfile currently assumes building on AMD64.

For example, on a system where uname -m returns aarch64 we see:

$ make -f deployments/container/Makefile build
...
 => ERROR [build 2/7] RUN apt-get update &&     apt-get install -y wget make git gcc-aarch64-linux-gnu gcc     &&     rm -rf /var/lib/apt/lists/*  0.6s
------
 > [build 2/7] RUN apt-get update &&     apt-get install -y wget make git gcc-aarch64-linux-gnu gcc     &&     rm -rf /var/lib/apt/lists/*:
0.273 exec /bin/sh: exec format error

This patch allows for building on aarch64.

I manually tested this patch by running make -f deployments/container/Makefile build on the two different platforms.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Apr 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

RUN wget -nv -O - https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-amd64.tar.gz \
| tar -C /usr/local -xz

# Support building on Linux on aarch64 (arm64) and on x86_64 (amd64).
Copy link
Member

@elezar elezar Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that we were building a single architecture here is that we explicitly cross-comile the binaries where we build them later. Does the logic to set cc=aarch64-linux-gnu-gcc as a cross compiler there need to be changed in this case? Is gcc just an alias to this compiler when an aarch64 image is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor point (that is also applicable to the device plugin) is that we could use TARGETARCH here directly instead of running uname -m.

Copy link
Collaborator Author

@jgehrcke jgehrcke Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that we were building a single architecture here is that we explicitly cross-comile the binaries

I understand. We decoupled build architecture from target architecture, which is good (allowing for cross-compilation). That is, we allowed for the target architecture to be flexible. But we did not allow for the build architecture to be flexible: we were explicitly requesting an AMD64 image via --platform=${BUILDOS}/amd64 regardless of the current (build) architecture.

That is, we could do AMD64 -> AMD64 and AMD64 -> ARM64. But e.g. ARM64 -> ARM64 does not work.

Does the logic to set cc=aarch64-linux-gnu-gcc as a cross compiler there need to be changed in this case?

Seen that, wondered the same -- I tested this patch with AMD64 -> AMD64 and ARM64 -> ARM64.

we could use TARGETARCH here directly instead of running uname -m.

that would again couple build arch to target arch.

Does the logic to set cc=aarch64-linux-gnu-gcc as a cross compiler there need to be changed in this case? Is gcc just an alias to this compiler when an aarch64 image is used.

I thought about it a bit. I don't think this needs changing. Running gcc on aarch64 should by default build for aarch64. In any case, when target arch is arm64 we always run aarch64-linux-gnu-gcc and I think on aarch64 this is basically "normal gcc" :).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be good to be consistent across projects in this case.

certainly would have been good, because then I would not have run into a problem :). I will look at this, and maybe copy/paste that code instead.

Copy link
Member

@elezar elezar Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying things once more. I think I missed some of the subtleties in my first pass.

That said, does the use of TARGETARCH in the make target below need to be reassessed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the use of TARGETARCH in the make target below need to be reassessed?

You mean in

 make CC=${cc} GOARCH=${TARGETARCH} PREFIX=/artifacts cmds

?

Hm. It still looks good to me. GOARCH signifies the target architecture:

GOARCH is the running program's architecture target: one of 386, amd64, arm, s390x, and so on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the logic here

pushed a commit, used that instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, we could do AMD64 -> AMD64 and AMD64 -> ARM64

this patch probably broke AMD64 -> ARM64 (as observed by @klueska)

@jgehrcke jgehrcke force-pushed the jp/allow-building-on-aarch64 branch from 2755d85 to 9760ffb Compare April 25, 2025 12:32
@jgehrcke jgehrcke force-pushed the jp/allow-building-on-aarch64 branch from fc1a563 to b38146c Compare April 25, 2025 12:34
LABEL org.opencontainers.image.description "NVIDIA DRA Driver for GPUs"
LABEL org.opencontainers.image.source "https://github.com/NVIDIA/k8s-dra-driver-gpu"
LABEL org.opencontainers.image.description="NVIDIA DRA Driver for GPUs"
LABEL org.opencontainers.image.source="https://github.com/NVIDIA/k8s-dra-driver-gpu"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I needed to update this branch I took in this feedback from a different PR (not a fix, merely a style consistency).

@jgehrcke jgehrcke merged commit ea3e26d into NVIDIA:main Apr 25, 2025
7 checks passed
ARG GOLANG_VERSION=1.23.8
# We use an ubuntu20.04 base image to allow for a more efficient multi-arch builds.
FROM --platform=${BUILDOS}/amd64 nvcr.io/nvidia/cuda:12.8.1-base-ubuntu20.04 AS build
FROM nvcr.io/nvidia/cuda:12.8.1-base-ubuntu20.04 AS build
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption when removing --platform here was that the default is the current (build) platform.

That assumption was wrong. Ref docs are: https://docs.docker.com/reference/dockerfile/#from

Quote:

The optional --platform flag can be used to specify the platform of the image in case FROM references a multi-platform image. For example, linux/amd64, linux/arm64, or windows/amd64. By default, the target platform of the build request is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants