SWT-Bench image building slowness: root cause analysis and fix plan

## Root Cause

SWT-bench image build throughput regressed significantly. Late February runs (SDK `cefaebf`) built 364-382 images in 5h03m-5h29m (66-76 img/h). Mid-March runs built 314-409 images in 9h20m-9h57m (31-42 img/h). The root cause is **broken registry cache** due to a Dockerfile ARG ordering mistake.

### What happened

SDK PR [#2130](https://github.com/OpenHands/software-agent-sdk/pull/2130) (commit `fd80128`, Mar 3) added `OPENHANDS_BUILD_GIT_SHA` ARG **before** the `apt-get install` and `npm install` layers in `base-image-minimal`:

```dockerfile
FROM ${BASE_IMAGE} AS base-image-minimal
+ ARG OPENHANDS_BUILD_GIT_SHA=unknown   # changes every SDK bump
+ ENV OPENHANDS_BUILD_GIT_SHA=...
  RUN apt-get install ...                # cache key now includes SHA → miss
  RUN npm install ...                    # also busted (depends on apt-get)
```

### Why this matters

Benchmark images use **GHCR registry cache** (`--cache-from type=registry`). The cache tags are stable across SDK bumps (`buildcache-{target}-{base_image_slug}`, no SHA). Prior builds export layers with `--cache-to type=registry,mode=max`.

When the Dockerfile build graph matches, BuildKit reuses cached layers from GHCR — including the expensive `apt-get` and `npm install`. The ARG before these layers changes the ancestor chain hash, breaking layer matching even though the cache tag is found.

### Measured impact (from 414 build logs, run [#23164396524](https://github.com/OpenHands/benchmarks/actions/runs/23164396524))

- `apt-get install`: **103s mean** per image (rebuilt from scratch, 0/414 cached)
- `npm install`: **38s mean** per image
- Combined: **141s/image** of unnecessary rebuilds per SDK bump

### Secondary factor

SDK PR [#2465](https://github.com/OpenHands/software-agent-sdk/pull/2465) (commit `d129025`, Mar 16) added `npm install -g @zed-industries/claude-agent-acp @zed-industries/codex-acp` to every benchmark image. This is a new ~38s/image step that didn't exist in the Feb baseline. Not a bug per se, but compounds the cache invalidation problem.

## Fix

**[SDK PR #2522](https://github.com/OpenHands/software-agent-sdk/pull/2522)**: Move the ARG after the expensive layers. Registry cache layer matching is restored.

**[Benchmarks PR #547](https://github.com/OpenHands/benchmarks/pull/547)**: Set SWT-bench `cache-mode` default back to `max` (was changed to `off` in #541 because cache was broken — it works again with the ARG fix).

## Validated

### Small-scale A/B test (4 django images)

A/B test on CI ([details in #544](https://github.com/OpenHands/benchmarks/issues/544)):

| Run | Dockerfile | apt-get cached? | Buildx p50 |
|-----|-----------|----------------|------------|
| [Seed](https://github.com/OpenHands/benchmarks/actions/runs/23343565318) | ARG after (fix), `cache-mode=max` | N/A (cold) | 149s |
| [Test](https://github.com/OpenHands/benchmarks/actions/runs/23343709061) | ARG after (fix), **different SDK SHA** | **3/4 CACHED** | **154s** |
| [Baseline](https://github.com/OpenHands/benchmarks/actions/runs/23164396524) | ARG before (current) | **0/414 CACHED** | **322s** |

Per-image buildx p50 dropped from 322s to 154s (4 images only — see full-scale validation below).

### Full-scale validation (433 images, run [#23382357696](https://github.com/OpenHands/benchmarks/actions/runs/23382357696))

- **Result:** 9h04m, 47.8 img/h (vs 35.5 img/h pre-fix with `cache-mode=max`)
- **Cache behavior confirmed:** `cache_import_miss_count=1` for 432/433 images, `cached_step_count` 12-13 for 94% of images
- **Remaining gap:** The cache fix restored layer caching but throughput is still below the Feb baseline (66-76 img/h). Profiling identified 42.2% of per-image wall clock is I/O overhead (image export, cache export, push) unrelated to the ARG fix. See #530 comment 5 for full profiling breakdown.

## Previous investigation

The original analysis in this issue correctly identified Layer 1 (SDK build path changes) as the dominant problem but attributed it to Python 3.12/boto3 changes. Those were already present in the fast Feb baseline (SDK `cefaebf`). The actual culprit was narrowed down in [#544](https://github.com/OpenHands/benchmarks/issues/544) to the ARG ordering in commit `fd80128`.

Layer 2 (registry cache export contention, PR #541) and Layer 3 (BuildKit instability at high concurrency) from the original analysis remain valid but secondary.

---

## Further Optimizations

Beyond the ARG fix, several improvements could reduce build times further:

### Make npm install ACP optional for benchmarks
The ACP servers (`@zed-industries/claude-agent-acp`, `@zed-industries/codex-acp`) add ~38s/image (measured). Benchmarks that don't use ACPAgent could skip this step via a build arg like `INSTALL_ACP=false`.

### Reduce apt-get overhead
- Pre-bake common packages into a shared base layer pushed to GHCR, so `base-image-minimal` inherits them instead of installing per image
- Pin apt package versions to improve cache stability

### Safe concurrency limits
- Hard-cap `max-workers` at 4 for cold builds (16 workers caused 9-19 BuildKit resets in #531's A/B test)
- Raise prune threshold from 60% to reduce prune cycles (measured: 2 prune events cost 39 min dead time in run #23382357696)

### Cache seeding on Dockerfile changes
- Any PR modifying the Dockerfile triggers a small (10-20 image) `cache-mode=max` build to pre-populate registry cache
- Subsequent full builds then import cached layers instead of rebuilding from scratch

### Controlled cold-build workflow
Instead of cold-building all 433 images at max parallelism:
1. Seed build: `max-workers=2`, `cache-mode=max` — populate registry cache without contention
2. Full build: `max-workers=4`, `cache-mode=off` — read seeded cache, don't write

### Prevention
- Add a CI check on Dockerfile PRs that verifies apt-get is CACHED after an ARG value change (regression test for cache compatibility)
- Never bundle SDK submodule bumps with Dockerfile changes in the same PR

## Related Issues

- #544 — Root cause investigation with A/B test validation
- #540 — Controlled cache-mode experiment (100% registry miss rate confirmed pre-fix)
- #541 — Set cache-mode=off (workaround, to be reverted by #547)
- #542 — ARG ordering fix proposal
- SDK [#2522](https://github.com/OpenHands/software-agent-sdk/pull/2522) — The fix
- SDK [#2130](https://github.com/OpenHands/software-agent-sdk/pull/2130) — The guilty PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SWT-Bench image building slowness: root cause analysis and fix plan #531

Root Cause

What happened

Why this matters

Measured impact (from 414 build logs, run #23164396524)

Secondary factor

Fix

Validated

Small-scale A/B test (4 django images)

Full-scale validation (433 images, run #23382357696)

Previous investigation

Further Optimizations

Make npm install ACP optional for benchmarks

Reduce apt-get overhead

Safe concurrency limits

Cache seeding on Dockerfile changes

Controlled cold-build workflow

Prevention

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Run	Dockerfile	apt-get cached?	Buildx p50
Seed	ARG after (fix), `cache-mode=max`	N/A (cold)	149s
Test	ARG after (fix), different SDK SHA	3/4 CACHED	154s
Baseline	ARG before (current)	0/414 CACHED	322s

SWT-Bench image building slowness: root cause analysis and fix plan #531

Description

Root Cause

What happened

Why this matters

Measured impact (from 414 build logs, run #23164396524)

Secondary factor

Fix

Validated

Small-scale A/B test (4 django images)

Full-scale validation (433 images, run #23382357696)

Previous investigation

Further Optimizations

Make npm install ACP optional for benchmarks

Reduce apt-get overhead

Safe concurrency limits

Cache seeding on Dockerfile changes

Controlled cold-build workflow

Prevention

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions