Skip to content

Investigation: Root cause of per-image build time doubling (SDK cefaebf → eab666f) #544

@simonrosenberg

Description

@simonrosenberg

Root Cause

SDK PR #2130 (commit fd80128, Mar 3) added OPENHANDS_BUILD_GIT_SHA ARG before apt-get install in the base-image-minimal Dockerfile stage. This breaks BuildKit registry cache layer matching whenever the SDK SHA changes.

FROM ${BASE_IMAGE} AS base-image-minimal
+ ARG OPENHANDS_BUILD_GIT_SHA=unknown     # ← inserted here by fd80128
+ ENV OPENHANDS_BUILD_GIT_SHA=...
  RUN apt-get install ...                  # ← cache key now includes SHA → miss
  RUN npm install ...                      # ← also busted (depends on apt-get)

Before fd80128: registry cache from prior builds matched → apt-get CACHED (~0s).
After fd80128: SHA in the parent chain changes the layer hash → apt-get rebuilt (~103s/image).

How caching works for benchmark images

Each of the 433 SWT-bench images uses a unique `BASE_IMAGE`, so layers can't be shared between images within a run. The caching mechanism is GHCR registry cache:

  1. Each `docker buildx build` includes `--cache-from type=registry,ref=ghcr.io/openhands/eval-agent-server:buildcache-{slug}`
  2. The cache tag does NOT include the SDK SHA — it's stable across SDK bumps
  3. Prior builds export layers with `--cache-to type=registry,mode=max`
  4. Later builds import those layers if the Dockerfile build graph matches

The ARG before apt-get changes the build graph, so cached layers from prior builds no longer match.

Fix

SDK PR #2522: move the ARG after the expensive layers.

Validation

A/B test on CI (same 4 django images)

Run SDK Dockerfile cache-mode apt-get cached? Buildx p50
#23343565318 (seed) ARG fix branch ARG after apt-get max (export) N/A (cold) 149s
#23343709061 (test) different SHA ARG after apt-get off (import only) 3/4 CACHED 154s
#23164396524 (baseline) d129025 ARG before apt-get on 0/414 CACHED 322s

Result: 2.1x faster with the fix (154s vs 322s per image).

Build log evidence

Test run (ARG fix, different SDK SHA) — apt-get CACHED from registry:

#12 importing cache manifest from ghcr.io/.../buildcache-...-django-1227...  → SUCCESS
#15 [base-image-minimal 2/4] RUN apt-get install ...
#15 CACHED

Mar 16 baseline (no fix) — apt-get rebuilt despite registry cache import:

#12 importing cache manifest from ghcr.io/.../buildcache-...-django-1356...  → SUCCESS
#16 [base-image-minimal 2/6] RUN apt-get install ...
#16 DONE 121.8s

Same registry cache mechanism, same images. The only difference: ARG placement in the Dockerfile.

Local Docker experiment

Built identical Dockerfile with ARG before vs after apt-get, then changed the ARG value:

Variant apt-get result Time
ARG before apt-get + SHA change REBUILT 44s
ARG after apt-get + SHA change CACHED 0.7s

Impact

  • apt-get: ~103s mean per image when not cached (from 414 build logs)
  • npm install: ~38s mean per image (also busted by the ARG, depends on apt-get)
  • Combined: ~141s/image = 45% of total build time
  • At 433 images with 4 workers: ~4+ hours wasted on cache misses

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions