-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Root Cause
SDK PR #2130 (commit fd80128, Mar 3) added OPENHANDS_BUILD_GIT_SHA ARG before apt-get install in the base-image-minimal Dockerfile stage. This breaks BuildKit registry cache layer matching whenever the SDK SHA changes.
FROM ${BASE_IMAGE} AS base-image-minimal
+ ARG OPENHANDS_BUILD_GIT_SHA=unknown # ← inserted here by fd80128
+ ENV OPENHANDS_BUILD_GIT_SHA=...
RUN apt-get install ... # ← cache key now includes SHA → miss
RUN npm install ... # ← also busted (depends on apt-get)Before fd80128: registry cache from prior builds matched → apt-get CACHED (~0s).
After fd80128: SHA in the parent chain changes the layer hash → apt-get rebuilt (~103s/image).
How caching works for benchmark images
Each of the 433 SWT-bench images uses a unique `BASE_IMAGE`, so layers can't be shared between images within a run. The caching mechanism is GHCR registry cache:
- Each `docker buildx build` includes `--cache-from type=registry,ref=ghcr.io/openhands/eval-agent-server:buildcache-{slug}`
- The cache tag does NOT include the SDK SHA — it's stable across SDK bumps
- Prior builds export layers with `--cache-to type=registry,mode=max`
- Later builds import those layers if the Dockerfile build graph matches
The ARG before apt-get changes the build graph, so cached layers from prior builds no longer match.
Fix
SDK PR #2522: move the ARG after the expensive layers.
Validation
A/B test on CI (same 4 django images)
| Run | SDK | Dockerfile | cache-mode | apt-get cached? | Buildx p50 |
|---|---|---|---|---|---|
| #23343565318 (seed) | ARG fix branch | ARG after apt-get | max (export) | N/A (cold) | 149s |
| #23343709061 (test) | different SHA | ARG after apt-get | off (import only) | 3/4 CACHED | 154s |
| #23164396524 (baseline) | d129025 | ARG before apt-get | on | 0/414 CACHED | 322s |
Result: 2.1x faster with the fix (154s vs 322s per image).
Build log evidence
Test run (ARG fix, different SDK SHA) — apt-get CACHED from registry:
#12 importing cache manifest from ghcr.io/.../buildcache-...-django-1227... → SUCCESS
#15 [base-image-minimal 2/4] RUN apt-get install ...
#15 CACHED
Mar 16 baseline (no fix) — apt-get rebuilt despite registry cache import:
#12 importing cache manifest from ghcr.io/.../buildcache-...-django-1356... → SUCCESS
#16 [base-image-minimal 2/6] RUN apt-get install ...
#16 DONE 121.8s
Same registry cache mechanism, same images. The only difference: ARG placement in the Dockerfile.
Local Docker experiment
Built identical Dockerfile with ARG before vs after apt-get, then changed the ARG value:
| Variant | apt-get result | Time |
|---|---|---|
| ARG before apt-get + SHA change | REBUILT | 44s |
| ARG after apt-get + SHA change | CACHED | 0.7s |
Impact
- apt-get: ~103s mean per image when not cached (from 414 build logs)
- npm install: ~38s mean per image (also busted by the ARG, depends on apt-get)
- Combined: ~141s/image = 45% of total build time
- At 433 images with 4 workers: ~4+ hours wasted on cache misses
Related
- Parent tracker: SWT-Bench image building slowness: root cause analysis and fix plan #531
- ARG fix PR: SDK #2522
- Guilty PR: SDK #2130 (commit `fd80128`)
- Secondary factor: SDK #2465 (commit `d129025`) added npm install ACP, adding ~38s/image unconditionally