-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Root Cause
SWT-bench image build throughput regressed significantly. Late February runs (SDK cefaebf) built 364-382 images in 5h03m-5h29m (66-76 img/h). Mid-March runs built 314-409 images in 9h20m-9h57m (31-42 img/h). The root cause is broken registry cache due to a Dockerfile ARG ordering mistake.
What happened
SDK PR #2130 (commit fd80128, Mar 3) added OPENHANDS_BUILD_GIT_SHA ARG before the apt-get install and npm install layers in base-image-minimal:
FROM ${BASE_IMAGE} AS base-image-minimal
+ ARG OPENHANDS_BUILD_GIT_SHA=unknown # changes every SDK bump
+ ENV OPENHANDS_BUILD_GIT_SHA=...
RUN apt-get install ... # cache key now includes SHA → miss
RUN npm install ... # also busted (depends on apt-get)Why this matters
Benchmark images use GHCR registry cache (--cache-from type=registry). The cache tags are stable across SDK bumps (buildcache-{target}-{base_image_slug}, no SHA). Prior builds export layers with --cache-to type=registry,mode=max.
When the Dockerfile build graph matches, BuildKit reuses cached layers from GHCR — including the expensive apt-get and npm install. The ARG before these layers changes the ancestor chain hash, breaking layer matching even though the cache tag is found.
Measured impact (from 414 build logs, run #23164396524)
apt-get install: 103s mean per image (rebuilt from scratch, 0/414 cached)npm install: 38s mean per image- Combined: 141s/image of unnecessary rebuilds per SDK bump
Secondary factor
SDK PR #2465 (commit d129025, Mar 16) added npm install -g @zed-industries/claude-agent-acp @zed-industries/codex-acp to every benchmark image. This is a new ~38s/image step that didn't exist in the Feb baseline. Not a bug per se, but compounds the cache invalidation problem.
Fix
SDK PR #2522: Move the ARG after the expensive layers. Registry cache layer matching is restored.
Benchmarks PR #547: Set SWT-bench cache-mode default back to max (was changed to off in #541 because cache was broken — it works again with the ARG fix).
Validated
Small-scale A/B test (4 django images)
A/B test on CI (details in #544):
| Run | Dockerfile | apt-get cached? | Buildx p50 |
|---|---|---|---|
| Seed | ARG after (fix), cache-mode=max |
N/A (cold) | 149s |
| Test | ARG after (fix), different SDK SHA | 3/4 CACHED | 154s |
| Baseline | ARG before (current) | 0/414 CACHED | 322s |
Per-image buildx p50 dropped from 322s to 154s (4 images only — see full-scale validation below).
Full-scale validation (433 images, run #23382357696)
- Result: 9h04m, 47.8 img/h (vs 35.5 img/h pre-fix with
cache-mode=max) - Cache behavior confirmed:
cache_import_miss_count=1for 432/433 images,cached_step_count12-13 for 94% of images - Remaining gap: The cache fix restored layer caching but throughput is still below the Feb baseline (66-76 img/h). Profiling identified 42.2% of per-image wall clock is I/O overhead (image export, cache export, push) unrelated to the ARG fix. See SWT-bench image build throughput tracker (historical source of truth) #530 comment 5 for full profiling breakdown.
Previous investigation
The original analysis in this issue correctly identified Layer 1 (SDK build path changes) as the dominant problem but attributed it to Python 3.12/boto3 changes. Those were already present in the fast Feb baseline (SDK cefaebf). The actual culprit was narrowed down in #544 to the ARG ordering in commit fd80128.
Layer 2 (registry cache export contention, PR #541) and Layer 3 (BuildKit instability at high concurrency) from the original analysis remain valid but secondary.
Further Optimizations
Beyond the ARG fix, several improvements could reduce build times further:
Make npm install ACP optional for benchmarks
The ACP servers (@zed-industries/claude-agent-acp, @zed-industries/codex-acp) add ~38s/image (measured). Benchmarks that don't use ACPAgent could skip this step via a build arg like INSTALL_ACP=false.
Reduce apt-get overhead
- Pre-bake common packages into a shared base layer pushed to GHCR, so
base-image-minimalinherits them instead of installing per image - Pin apt package versions to improve cache stability
Safe concurrency limits
- Hard-cap
max-workersat 4 for cold builds (16 workers caused 9-19 BuildKit resets in SWT-Bench image building slowness: root cause analysis and fix plan #531's A/B test) - Raise prune threshold from 60% to reduce prune cycles (measured: 2 prune events cost 39 min dead time in run #23382357696)
Cache seeding on Dockerfile changes
- Any PR modifying the Dockerfile triggers a small (10-20 image)
cache-mode=maxbuild to pre-populate registry cache - Subsequent full builds then import cached layers instead of rebuilding from scratch
Controlled cold-build workflow
Instead of cold-building all 433 images at max parallelism:
- Seed build:
max-workers=2,cache-mode=max— populate registry cache without contention - Full build:
max-workers=4,cache-mode=off— read seeded cache, don't write
Prevention
- Add a CI check on Dockerfile PRs that verifies apt-get is CACHED after an ARG value change (regression test for cache compatibility)
- Never bundle SDK submodule bumps with Dockerfile changes in the same PR
Related Issues
- Investigation: Root cause of per-image build time doubling (SDK cefaebf → eab666f) #544 — Root cause investigation with A/B test validation
- Use cache-mode=off for batch benchmark image builds #540 — Controlled cache-mode experiment (100% registry miss rate confirmed pre-fix)
- ci(swtbench): default cache-mode to off #541 — Set cache-mode=off (workaround, to be reverted by Set SWT-bench cache-mode default to max #547)
- Dockerfile ARG ordering causes full rebuild on every SDK commit #542 — ARG ordering fix proposal
- SDK #2522 — The fix
- SDK #2130 — The guilty PR