Skip to content

test(facade): de-flake TestEchoRestEndToEnd on macOS CI#23

Merged
nhuelstng merged 1 commit into
mainfrom
fix/ci-mac-sidecar-port
Jun 29, 2026
Merged

test(facade): de-flake TestEchoRestEndToEnd on macOS CI#23
nhuelstng merged 1 commit into
mainfrom
fix/ci-mac-sidecar-port

Conversation

@nhuelstng

@nhuelstng nhuelstng commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Problem

TestEchoRestEndToEnd flaked repeatedly on macOS CI with sidecar never became healthy after 30s:

  • 28355782166 (2026-06-29)
  • 28237181198 (2026-06-26)
  • 28182411906 (2026-06-25)

Commit e076578 bumped the health-check deadline 5s → 30s, but the flake recurred — the bump only masked slow starts; it could not fix a crash.

Root cause

  1. Go pre-allocated an ephemeral port, closed its listener, then handed that port number to the Python sidecar via SIDECAR_PORT. On a busy macOS runner another process could grab the port in that gap → Python exited with EADDRINUSE.
  2. cmd.Stderr = io.Discard swallowed the crash, so the test only saw a 30s timeout with zero diagnostics — the actual failure mode was invisible.
  3. (Surfaced during verification) The first python3 exec on a fresh macOS runner can take >30s while the kernel validates the binary's code signature / notarization ticket. The 30s port-read deadline hit before Python even reached the ThreadingHTTPServer(...) constructor — stderr was empty.

Fix (internal/facade/integration_test.go, +78/-17)

  • Sidecar binds its own port: Python binds 127.0.0.1:0 and prints PORT=<n> to stdout; Go reads the actual bound port from the child. Eliminates the port-reuse window entirely.
  • Capture stderr to a strings.Builder; include it in every failure message (did not report its port, never bound a port, never became healthy). Never go blind on a sidecar crash again.
  • Warm the Python interpreter up front (python3 -c '...'). Pays the code-signing validation cost before spawning the sidecar, so the port-read deadline is measured against a warm interpreter.
  • Bump port-read deadline 30s → 120s for a genuinely cold interpreter that the warm-up somehow misses. Still bounded by the 5m test timeout.

Verification

  • 20× -race clean locally (~0.2s per run; was 30s+ on failure).
  • gofmt -l clean, go vet clean, full suite green.
  • macOS CI: 6/6 green across independent runs (incl. one rerun on a fresh runner) after this fix. The 1 transient failure during development (run 28365241546) confirmed the new diagnostic: sidecar never bound a port within 30s (stderr: ) → empty stderr, Python stalled before binding — exactly what the warm-up + wider deadline addresses.

Not in scope

TestEmbeddedSourceMatchesCanonical flaked on both OSes on a different branch (runs 28171502590 / 28171406470) — separate issue, not the macOS flake addressed here.

@nhuelstng nhuelstng force-pushed the fix/ci-mac-sidecar-port branch from b320f0e to 201e06c Compare June 29, 2026 10:19
'sidecar never became healthy' recurred on macOS CI (runs 28355782166,
28237181198, 28182411906) even after e076578 bumped the deadline 5s→30s.
The bump only masked slow starts; it could not fix a crash.

Root cause: Go pre-allocated an ephemeral port, closed its listener,
and handed the number to Python via SIDECAR_PORT. On a busy runner the
port could be re-taken in that gap; Python then exited EADDRINUSE.
cmd.Stderr = io.Discard swallowed the crash, so the test only saw a
30s timeout with zero diagnostics.

Fix:
- Sidecar binds 127.0.0.1:0 itself and prints PORT=<n> to stdout;
  Go reads the actual bound port. Eliminates the port-reuse window.
- Capture stderr to a strings.Builder and include it in every failure
  message. Never go blind on a sidecar crash again.
- Warm the Python interpreter up front (python3 -c '...'). The first
  exec of python3 on a fresh macOS runner can take >30s while the
  kernel validates the binary's code signature; paying that cost
  before spawning the sidecar keeps the port-read deadline honest.
- Bump the port-read deadline 30s -> 120s for a genuinely cold
  interpreter that the warm-up somehow missed.

Verified: 20x -race clean locally. On macOS CI: 2/3 runs green; the
1 failure surfaced the new diagnostic (empty stderr -> Python stalled
before binding), which this warm-up + wider deadline addresses.

Signed-off-by: Niclas Hülsmann <niclas.huelsmann@tngtech.com>
@nhuelstng nhuelstng force-pushed the fix/ci-mac-sidecar-port branch 2 times, most recently from 066b061 to 50e9353 Compare June 29, 2026 10:35
@nhuelstng nhuelstng merged commit 1623dca into main Jun 29, 2026
10 checks passed
@nhuelstng nhuelstng deleted the fix/ci-mac-sidecar-port branch June 29, 2026 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant