Skip to content

SWT-Bench image building slowness: root cause analysis and fix plan #531

@simonrosenberg

Description

@simonrosenberg

Root Cause

SWT-bench image build throughput regressed significantly. Late February runs (SDK cefaebf) built 364-382 images in 5h03m-5h29m (66-76 img/h). Mid-March runs built 314-409 images in 9h20m-9h57m (31-42 img/h). The root cause is broken registry cache due to a Dockerfile ARG ordering mistake.

What happened

SDK PR #2130 (commit fd80128, Mar 3) added OPENHANDS_BUILD_GIT_SHA ARG before the apt-get install and npm install layers in base-image-minimal:

FROM ${BASE_IMAGE} AS base-image-minimal
+ ARG OPENHANDS_BUILD_GIT_SHA=unknown   # changes every SDK bump
+ ENV OPENHANDS_BUILD_GIT_SHA=...
  RUN apt-get install ...                # cache key now includes SHA → miss
  RUN npm install ...                    # also busted (depends on apt-get)

Why this matters

Benchmark images use GHCR registry cache (--cache-from type=registry). The cache tags are stable across SDK bumps (buildcache-{target}-{base_image_slug}, no SHA). Prior builds export layers with --cache-to type=registry,mode=max.

When the Dockerfile build graph matches, BuildKit reuses cached layers from GHCR — including the expensive apt-get and npm install. The ARG before these layers changes the ancestor chain hash, breaking layer matching even though the cache tag is found.

Measured impact (from 414 build logs, run #23164396524)

  • apt-get install: 103s mean per image (rebuilt from scratch, 0/414 cached)
  • npm install: 38s mean per image
  • Combined: 141s/image of unnecessary rebuilds per SDK bump

Secondary factor

SDK PR #2465 (commit d129025, Mar 16) added npm install -g @zed-industries/claude-agent-acp @zed-industries/codex-acp to every benchmark image. This is a new ~38s/image step that didn't exist in the Feb baseline. Not a bug per se, but compounds the cache invalidation problem.

Fix

SDK PR #2522: Move the ARG after the expensive layers. Registry cache layer matching is restored.

Benchmarks PR #547: Set SWT-bench cache-mode default back to max (was changed to off in #541 because cache was broken — it works again with the ARG fix).

Validated

Small-scale A/B test (4 django images)

A/B test on CI (details in #544):

Run Dockerfile apt-get cached? Buildx p50
Seed ARG after (fix), cache-mode=max N/A (cold) 149s
Test ARG after (fix), different SDK SHA 3/4 CACHED 154s
Baseline ARG before (current) 0/414 CACHED 322s

Per-image buildx p50 dropped from 322s to 154s (4 images only — see full-scale validation below).

Full-scale validation (433 images, run #23382357696)

  • Result: 9h04m, 47.8 img/h (vs 35.5 img/h pre-fix with cache-mode=max)
  • Cache behavior confirmed: cache_import_miss_count=1 for 432/433 images, cached_step_count 12-13 for 94% of images
  • Remaining gap: The cache fix restored layer caching but throughput is still below the Feb baseline (66-76 img/h). Profiling identified 42.2% of per-image wall clock is I/O overhead (image export, cache export, push) unrelated to the ARG fix. See SWT-bench image build throughput tracker (historical source of truth) #530 comment 5 for full profiling breakdown.

Previous investigation

The original analysis in this issue correctly identified Layer 1 (SDK build path changes) as the dominant problem but attributed it to Python 3.12/boto3 changes. Those were already present in the fast Feb baseline (SDK cefaebf). The actual culprit was narrowed down in #544 to the ARG ordering in commit fd80128.

Layer 2 (registry cache export contention, PR #541) and Layer 3 (BuildKit instability at high concurrency) from the original analysis remain valid but secondary.


Further Optimizations

Beyond the ARG fix, several improvements could reduce build times further:

Make npm install ACP optional for benchmarks

The ACP servers (@zed-industries/claude-agent-acp, @zed-industries/codex-acp) add ~38s/image (measured). Benchmarks that don't use ACPAgent could skip this step via a build arg like INSTALL_ACP=false.

Reduce apt-get overhead

  • Pre-bake common packages into a shared base layer pushed to GHCR, so base-image-minimal inherits them instead of installing per image
  • Pin apt package versions to improve cache stability

Safe concurrency limits

Cache seeding on Dockerfile changes

  • Any PR modifying the Dockerfile triggers a small (10-20 image) cache-mode=max build to pre-populate registry cache
  • Subsequent full builds then import cached layers instead of rebuilding from scratch

Controlled cold-build workflow

Instead of cold-building all 433 images at max parallelism:

  1. Seed build: max-workers=2, cache-mode=max — populate registry cache without contention
  2. Full build: max-workers=4, cache-mode=off — read seeded cache, don't write

Prevention

  • Add a CI check on Dockerfile PRs that verifies apt-get is CACHED after an ARG value change (regression test for cache compatibility)
  • Never bundle SDK submodule bumps with Dockerfile changes in the same PR

Related Issues

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions