Skip to content

Conversation

Rohan-Bierneni
Copy link
Collaborator

@Rohan-Bierneni Rohan-Bierneni commented May 27, 2025

Description

For building maxtext docker images, we still using self-hosted Github runners. Now there is a specific GKE runner available for building docker containers. This pr migrates the maxtext docker image builds from the old self-hosted runners to the new GKE runner that use Docker BuildX.

Some new key features to note with this pr:

  • Use of docker BuildX instead of docker build commands. This means our scripts such as build_and_upload_images.sh is not needed anymore as it is incompatible with BuildX. Will have a follow-up pr to clean up maxtext dockerfiles and build scripts.
  • XLML builds now happen in parallel vs sequentially. This means one image build no longer causes all other image builds to fail and builds finish much quicker. We see image build times for xlml tests at ~10 min total.
  • Caching using docker BuildX integrated with Github Actions Cache. This would make some of the image builds quicker based on how much the layers of the docker build changes
  • Fixed small bug in notify step of RunTests.yml to create buganizer issue in case of build failure

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/412986220
FIXES: b/421909781

Tests

Will update with a screenshot of using GKE runner for image builds in the unit tests

Image builds for Unit Tests: https://screenshot.googleplex.com/BRZidEP3mNkRQUy
Link to Github Action: https://github.com/AI-Hypercomputer/maxtext/actions/runs/15596030910/job/43926535013

Image builds for daily XLML E2E tests: https://screenshot.googleplex.com/7dBHUowk2NVUtsw
Link to Github Action: https://github.com/AI-Hypercomputer/maxtext/actions/runs/15596030910

  • Also verified images successfully pushed to Artifact Registry

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

@Rohan-Bierneni Rohan-Bierneni force-pushed the rbierneni-gke-runners-docker-build branch from 64154be to 19c845c Compare May 30, 2025 21:45
@Rohan-Bierneni Rohan-Bierneni self-assigned this Jun 2, 2025
@Rohan-Bierneni Rohan-Bierneni force-pushed the rbierneni-gke-runners-docker-build branch from 5f44df6 to 83e7647 Compare June 6, 2025 18:30
Comment on lines 43 to 115
if: >
github.event_name == 'schedule' ||
github.event_name == 'pull_request' ||
github.event.inputs.target_device == 'all' ||
github.event.inputs.target_device == 'tpu'

runs-on: ["self-hosted", "tpu", "v4-8"]
container: google/cloud-sdk:524.0.0

strategy:
fail-fast: false
matrix:
device-type: ["v4-8"]
runs-on: ["self-hosted", "tpu", "${{ matrix.device-type }}"]
include:
# TPU Image Builds
- image_name: maxtext_jax_stable
dockerfile: ./maxtext_dependencies.Dockerfile
build_args: |
MODE=stable
JAX_VERSION=NONE
LIBTPU_GCS_PATH=NONE
- image_name: maxtext_jax_nightly
dockerfile: ./maxtext_dependencies.Dockerfile
build_args: |
MODE=nightly
JAX_VERSION=NONE
LIBTPU_GCS_PATH=NONE
# TPU Image builds using JAX AI Image
- image_name: maxtext_jax_stable_stack
dockerfile: ./maxtext_jax_ai_image.Dockerfile
base_image: us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest
- image_name: maxtext_stable_stack_nightly_jax
dockerfile: ./maxtext_jax_ai_image.Dockerfile
base_image: us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/tpu/jax_nightly:latest
- image_name: maxtext_stable_stack_candidate
dockerfile: ./maxtext_jax_ai_image.Dockerfile
base_image: us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/candidate/tpu:latest

steps:
- uses: actions/checkout@v3
- name: Cleanup old docker images
run: docker system prune --all --force
- name: Authenticate gcloud
run: gcloud auth configure-docker us-docker.pkg.dev --quiet
- name: build jax stable image
run : |
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_jax_stable MODE=stable DEVICE=tpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_jax_stable
- name: build jax nightly image
run : |
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_jax_nightly MODE=nightly DEVICE=tpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_jax_nightly
- name: build jax AI image
run : |
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_jax_stable_stack MODE=jax_ai_image DEVICE=tpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_jax_stable_stack BASEIMAGE=us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest
- name: build image with JAX AI nightly jax
run: |
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_stable_stack_nightly_jax MODE=jax_ai_image DEVICE=tpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_tpu_jax_stable_stack_nightly BASEIMAGE=us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/tpu/jax_nightly:latest
- name: build image with jax AI release candidate image
run: |
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_stable_stack_candidate MODE=jax_ai_image DEVICE=tpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_stable_stack_candidate BASEIMAGE=us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/candidate/tpu:latest
gpu:
- name: Checkout repository
uses: actions/checkout@v4
- name: Mark git repository as safe
run: git config --global --add safe.directory ${GITHUB_WORKSPACE}
- name: Get short commit hash
id: vars
run: echo "sha_short=$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT
- name: Authenticate to Google Cloud
run: gcloud auth configure-docker us-docker.pkg.dev,gcr.io -q
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build and Push Docker Image
uses: docker/build-push-action@v6
with:
push: false
context: .
file: ${{ matrix.dockerfile }}
tags: gcr.io/tpu-prod-env-multipod/${{ matrix.image_name }}:tpu-pr-test
cache-from: type=gha
cache-to: type=gha,mode=max
build-args: |
${{ matrix.build_args }}
JAX_AI_IMAGE_BASEIMAGE=${{ matrix.base_image }}
COMMIT_HASH=${{ steps.vars.outputs.sha_short }}
DEVICE=tpu

build-gpu:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
Comment on lines 108 to 172
if: >
github.event_name == 'schedule' ||
github.event_name == 'pull_request' ||
github.event.inputs.target_device == 'all' ||
github.event.inputs.target_device == 'gpu'

runs-on: ["self-hosted", "gpu", "a100-40gb-4"]
container: google/cloud-sdk:524.0.0

strategy:
fail-fast: false
matrix:
device-type: ["a100-40gb-4"]
runs-on: ["self-hosted", "gpu", "${{ matrix.device-type }}"]
# GPU Image Builds using JAX AI Image
include:
- image_name: maxtext_gpu_jax_stable_stack
dockerfile: ./maxtext_jax_ai_image.Dockerfile
base_image: us-central1-docker.pkg.dev/deeplearning-images/jax-ai-image/gpu:latest
- image_name: maxtext_gpu_stable_stack_nightly_jax
dockerfile: ./maxtext_jax_ai_image.Dockerfile
base_image: us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/gpu/jax_nightly:latest
- image_name: maxtext_stable_stack_candidate_gpu
dockerfile: ./maxtext_jax_ai_image.Dockerfile
base_image: us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/candidate/gpu:latest

steps:
- uses: actions/checkout@v3
- name: Cleanup old docker images
run: docker system prune --all --force
- name: Authenticate gcloud
run: gcloud auth configure-docker us-docker.pkg.dev --quiet
- name: build jax stable image
run : |
- name: build jax AI image
run : |
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_gpu_jax_stable_stack MODE=jax_ai_image DEVICE=gpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_gpu_jax_stable_stack BASEIMAGE=us-central1-docker.pkg.dev/deeplearning-images/jax-ai-image/gpu:latest
- name: build image with JAX AI nightly jax
run: |
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_gpu_stable_stack_nightly_jax MODE=jax_ai_image DEVICE=gpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_gpu_jax_stable_stack_nightly BASEIMAGE=us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/gpu/jax_nightly:latest
- name: build image with jax AI release candidate image
run: |
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_stable_stack_candidate_gpu MODE=jax_ai_image DEVICE=gpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_stable_stack_candidate_gpu BASEIMAGE=us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/candidate/gpu:latest
- name: Checkout repository
uses: actions/checkout@v4
- name: Mark git repository as safe
run: git config --global --add safe.directory ${GITHUB_WORKSPACE}
- name: Get short commit hash
id: vars
run: echo "sha_short=$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT
- name: Authenticate to Google Cloud
run: gcloud auth configure-docker us-docker.pkg.dev,gcr.io,us-central1-docker.pkg.dev -q
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build and Push Docker Image
uses: docker/build-push-action@v6
with:
push: false
context: .
file: ${{ matrix.dockerfile }}
tags: gcr.io/tpu-prod-env-multipod/${{ matrix.image_name }}:gpu-pr-test
cache-from: type=gha
cache-to: type=gha,mode=max
build-args: |
${{ matrix.build_args }}
JAX_AI_IMAGE_BASEIMAGE=${{ matrix.base_image }}
COMMIT_HASH=${{ steps.vars.outputs.sha_short }}
DEVICE=gpu

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
@Rohan-Bierneni Rohan-Bierneni force-pushed the rbierneni-gke-runners-docker-build branch from c1ddfc3 to a876e54 Compare June 11, 2025 22:00
@Rohan-Bierneni Rohan-Bierneni marked this pull request as ready for review June 11, 2025 22:33
Copy link
Collaborator

@parambole parambole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Daily docker build in GKE runners

Updated container field value to use ubuntu:22.04

Updated checkout action

deleted authentication step

Updated BuildX stage with instructions from bug

Mark repo directory as safe for runner use

Added buildx support for build script

Removed env variable

Testing to see build-push-action command works

Added auth step and fixed commit_hash bug

Fixed auth step

Condensed auth step

Added gpu image auth

Updated Daily XLML Image builds

Fixed docker buildx commands for xlml image builds

Optimized UploadDockerImages with caching and refactored file formatting

Split job into separate tpu and gpu steps

Updated to use the correct runner

Fixed config setup

Reduced RUN commands in Dockerfile for better layer caching

Reverted Dockerfile changes since causing regression

Toggle builds to push to AR since working now

Added latest tag with image push to AR

Added comments for config of gke runners
@Rohan-Bierneni Rohan-Bierneni force-pushed the rbierneni-gke-runners-docker-build branch from aadebc5 to 94a96b5 Compare June 12, 2025 22:14
@copybara-service copybara-service bot merged commit bd3c2f6 into main Jun 12, 2025
18 checks passed
@copybara-service copybara-service bot deleted the rbierneni-gke-runners-docker-build branch June 12, 2025 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants