-
Notifications
You must be signed in to change notification settings - Fork 409
Migrating docker build Github Actions for unit tests and daily image builds to GKE runners #1783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
64154be
to
19c845c
Compare
5f44df6
to
83e7647
Compare
Comment on lines
43
to
115
if: > | ||
github.event_name == 'schedule' || | ||
github.event_name == 'pull_request' || | ||
github.event.inputs.target_device == 'all' || | ||
github.event.inputs.target_device == 'tpu' | ||
|
||
runs-on: ["self-hosted", "tpu", "v4-8"] | ||
container: google/cloud-sdk:524.0.0 | ||
|
||
strategy: | ||
fail-fast: false | ||
matrix: | ||
device-type: ["v4-8"] | ||
runs-on: ["self-hosted", "tpu", "${{ matrix.device-type }}"] | ||
include: | ||
# TPU Image Builds | ||
- image_name: maxtext_jax_stable | ||
dockerfile: ./maxtext_dependencies.Dockerfile | ||
build_args: | | ||
MODE=stable | ||
JAX_VERSION=NONE | ||
LIBTPU_GCS_PATH=NONE | ||
- image_name: maxtext_jax_nightly | ||
dockerfile: ./maxtext_dependencies.Dockerfile | ||
build_args: | | ||
MODE=nightly | ||
JAX_VERSION=NONE | ||
LIBTPU_GCS_PATH=NONE | ||
# TPU Image builds using JAX AI Image | ||
- image_name: maxtext_jax_stable_stack | ||
dockerfile: ./maxtext_jax_ai_image.Dockerfile | ||
base_image: us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest | ||
- image_name: maxtext_stable_stack_nightly_jax | ||
dockerfile: ./maxtext_jax_ai_image.Dockerfile | ||
base_image: us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/tpu/jax_nightly:latest | ||
- image_name: maxtext_stable_stack_candidate | ||
dockerfile: ./maxtext_jax_ai_image.Dockerfile | ||
base_image: us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/candidate/tpu:latest | ||
|
||
steps: | ||
- uses: actions/checkout@v3 | ||
- name: Cleanup old docker images | ||
run: docker system prune --all --force | ||
- name: Authenticate gcloud | ||
run: gcloud auth configure-docker us-docker.pkg.dev --quiet | ||
- name: build jax stable image | ||
run : | | ||
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_jax_stable MODE=stable DEVICE=tpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_jax_stable | ||
- name: build jax nightly image | ||
run : | | ||
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_jax_nightly MODE=nightly DEVICE=tpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_jax_nightly | ||
- name: build jax AI image | ||
run : | | ||
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_jax_stable_stack MODE=jax_ai_image DEVICE=tpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_jax_stable_stack BASEIMAGE=us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest | ||
- name: build image with JAX AI nightly jax | ||
run: | | ||
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_stable_stack_nightly_jax MODE=jax_ai_image DEVICE=tpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_tpu_jax_stable_stack_nightly BASEIMAGE=us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/tpu/jax_nightly:latest | ||
- name: build image with jax AI release candidate image | ||
run: | | ||
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_stable_stack_candidate MODE=jax_ai_image DEVICE=tpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_stable_stack_candidate BASEIMAGE=us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/candidate/tpu:latest | ||
gpu: | ||
- name: Checkout repository | ||
uses: actions/checkout@v4 | ||
- name: Mark git repository as safe | ||
run: git config --global --add safe.directory ${GITHUB_WORKSPACE} | ||
- name: Get short commit hash | ||
id: vars | ||
run: echo "sha_short=$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT | ||
- name: Authenticate to Google Cloud | ||
run: gcloud auth configure-docker us-docker.pkg.dev,gcr.io -q | ||
- name: Set up Docker Buildx | ||
uses: docker/setup-buildx-action@v3 | ||
- name: Build and Push Docker Image | ||
uses: docker/build-push-action@v6 | ||
with: | ||
push: false | ||
context: . | ||
file: ${{ matrix.dockerfile }} | ||
tags: gcr.io/tpu-prod-env-multipod/${{ matrix.image_name }}:tpu-pr-test | ||
cache-from: type=gha | ||
cache-to: type=gha,mode=max | ||
build-args: | | ||
${{ matrix.build_args }} | ||
JAX_AI_IMAGE_BASEIMAGE=${{ matrix.base_image }} | ||
COMMIT_HASH=${{ steps.vars.outputs.sha_short }} | ||
DEVICE=tpu | ||
|
||
build-gpu: |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
Comment on lines
108
to
172
if: > | ||
github.event_name == 'schedule' || | ||
github.event_name == 'pull_request' || | ||
github.event.inputs.target_device == 'all' || | ||
github.event.inputs.target_device == 'gpu' | ||
|
||
runs-on: ["self-hosted", "gpu", "a100-40gb-4"] | ||
container: google/cloud-sdk:524.0.0 | ||
|
||
strategy: | ||
fail-fast: false | ||
matrix: | ||
device-type: ["a100-40gb-4"] | ||
runs-on: ["self-hosted", "gpu", "${{ matrix.device-type }}"] | ||
# GPU Image Builds using JAX AI Image | ||
include: | ||
- image_name: maxtext_gpu_jax_stable_stack | ||
dockerfile: ./maxtext_jax_ai_image.Dockerfile | ||
base_image: us-central1-docker.pkg.dev/deeplearning-images/jax-ai-image/gpu:latest | ||
- image_name: maxtext_gpu_stable_stack_nightly_jax | ||
dockerfile: ./maxtext_jax_ai_image.Dockerfile | ||
base_image: us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/gpu/jax_nightly:latest | ||
- image_name: maxtext_stable_stack_candidate_gpu | ||
dockerfile: ./maxtext_jax_ai_image.Dockerfile | ||
base_image: us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/candidate/gpu:latest | ||
|
||
steps: | ||
- uses: actions/checkout@v3 | ||
- name: Cleanup old docker images | ||
run: docker system prune --all --force | ||
- name: Authenticate gcloud | ||
run: gcloud auth configure-docker us-docker.pkg.dev --quiet | ||
- name: build jax stable image | ||
run : | | ||
- name: build jax AI image | ||
run : | | ||
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_gpu_jax_stable_stack MODE=jax_ai_image DEVICE=gpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_gpu_jax_stable_stack BASEIMAGE=us-central1-docker.pkg.dev/deeplearning-images/jax-ai-image/gpu:latest | ||
- name: build image with JAX AI nightly jax | ||
run: | | ||
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_gpu_stable_stack_nightly_jax MODE=jax_ai_image DEVICE=gpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_gpu_jax_stable_stack_nightly BASEIMAGE=us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/gpu/jax_nightly:latest | ||
- name: build image with jax AI release candidate image | ||
run: | | ||
bash .github/workflows/build_and_upload_images.sh CLOUD_IMAGE_NAME=maxtext_stable_stack_candidate_gpu MODE=jax_ai_image DEVICE=gpu PROJECT=tpu-prod-env-multipod LOCAL_IMAGE_NAME=maxtext_stable_stack_candidate_gpu BASEIMAGE=us-docker.pkg.dev/tpu-prod-env-multipod/jax-stable-stack/candidate/gpu:latest | ||
- name: Checkout repository | ||
uses: actions/checkout@v4 | ||
- name: Mark git repository as safe | ||
run: git config --global --add safe.directory ${GITHUB_WORKSPACE} | ||
- name: Get short commit hash | ||
id: vars | ||
run: echo "sha_short=$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT | ||
- name: Authenticate to Google Cloud | ||
run: gcloud auth configure-docker us-docker.pkg.dev,gcr.io,us-central1-docker.pkg.dev -q | ||
- name: Set up Docker Buildx | ||
uses: docker/setup-buildx-action@v3 | ||
- name: Build and Push Docker Image | ||
uses: docker/build-push-action@v6 | ||
with: | ||
push: false | ||
context: . | ||
file: ${{ matrix.dockerfile }} | ||
tags: gcr.io/tpu-prod-env-multipod/${{ matrix.image_name }}:gpu-pr-test | ||
cache-from: type=gha | ||
cache-to: type=gha,mode=max | ||
build-args: | | ||
${{ matrix.build_args }} | ||
JAX_AI_IMAGE_BASEIMAGE=${{ matrix.base_image }} | ||
COMMIT_HASH=${{ steps.vars.outputs.sha_short }} | ||
DEVICE=gpu |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
c1ddfc3
to
a876e54
Compare
shralex
reviewed
Jun 12, 2025
shralex
approved these changes
Jun 12, 2025
parambole
approved these changes
Jun 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Daily docker build in GKE runners Updated container field value to use ubuntu:22.04 Updated checkout action deleted authentication step Updated BuildX stage with instructions from bug Mark repo directory as safe for runner use Added buildx support for build script Removed env variable Testing to see build-push-action command works Added auth step and fixed commit_hash bug Fixed auth step Condensed auth step Added gpu image auth Updated Daily XLML Image builds Fixed docker buildx commands for xlml image builds Optimized UploadDockerImages with caching and refactored file formatting Split job into separate tpu and gpu steps Updated to use the correct runner Fixed config setup Reduced RUN commands in Dockerfile for better layer caching Reverted Dockerfile changes since causing regression Toggle builds to push to AR since working now Added latest tag with image push to AR Added comments for config of gke runners
aadebc5
to
94a96b5
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
For building maxtext docker images, we still using self-hosted Github runners. Now there is a specific GKE runner available for building docker containers. This pr migrates the maxtext docker image builds from the old self-hosted runners to the new GKE runner that use Docker BuildX.
Some new key features to note with this pr:
If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/412986220
FIXES: b/421909781
Tests
Will update with a screenshot of using GKE runner for image builds in the unit tests
Image builds for Unit Tests: https://screenshot.googleplex.com/BRZidEP3mNkRQUy
Link to Github Action: https://github.com/AI-Hypercomputer/maxtext/actions/runs/15596030910/job/43926535013
Image builds for daily XLML E2E tests: https://screenshot.googleplex.com/7dBHUowk2NVUtsw
Link to Github Action: https://github.com/AI-Hypercomputer/maxtext/actions/runs/15596030910
Checklist
Before submitting this PR, please make sure (put X in square brackets):