From 96d4136cee17fad550f7be2d1caad1f29d25de5f Mon Sep 17 00:00:00 2001 From: JacobPEvans <20714140+JacobPEvans@users.noreply.github.com> Date: Sun, 24 May 2026 11:39:22 -0400 Subject: [PATCH 1/2] docs(cicd): add self-hosted runner reliability requirements MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The CI/CD overview documented the four runner tiers but didn't say what a self-hosted runner has to actually be. The recurring token-refresh failure in orbstack-kubernetes (JacobPEvans/orbstack-kubernetes#234, #237) shows the cost of leaving this implicit. Adds a single subsection between "Runner tiers" and "The shape of every IaC pipeline" listing the five non-negotiables for any self-hosted runner: GitHub App auth (not PAT), digest-pinned image, process healthcheck, dead-man's-switch heartbeat, pre-flight secret check. Links to the orbstack-kubernetes runner as the reference implementation. Companion PRs codify the same rules at the AI-agent layer: - JacobPEvans/ai-assistant-instructions#654 (org-wide ci-cd-policy rule) - JacobPEvans/claude-code-plugins#321 (self-hosted-runners skill) Supersedes the earlier standalone runner-topology-page draft in this PR's history — the four-tier CI/CD section landed in #23 in the meantime, making a separate topology page redundant. Assisted-by: Claude --- infrastructure/cicd/overview.mdx | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/infrastructure/cicd/overview.mdx b/infrastructure/cicd/overview.mdx index 750e2f3..0efd5cc 100644 --- a/infrastructure/cicd/overview.mdx +++ b/infrastructure/cicd/overview.mdx @@ -23,6 +23,20 @@ Pick by what the workload actually needs: The decision tree is workload-first: a macOS build picks the Mac tier; an IaC apply picks RunsOn; a public-repo lint picks GitHub-hosted; a sensitive-credential job picks the locked-down self-hosted runner. The cost ordering is "free → very cheap → host-cost → host-cost", but the cost is rarely what drives the choice. +## Self-hosted runner reliability + +The two self-hosted tiers (Mac and locked-down) are the only ones the org physically operates — and they are the single points of failure for any E2E gate that targets them. Both have first-hand experience with silent runner death blocking PRs at merge (most recently [JacobPEvans/orbstack-kubernetes#234](https://github.com/JacobPEvans/orbstack-kubernetes/pull/234)). + +Every self-hosted runner MUST satisfy all five of: + +1. **GitHub App auth, not personal access token.** The runner image authenticates via `APP_ID` + `APP_PRIVATE_KEY` and mints registration tokens from installation tokens internally — auto-refreshing, never expires while the App stays installed. PATs silently expire (fine-grained PATs cap at one year) and break the runner with no upstream visibility; PRs block at merge and nobody notices until someone tries to ship. +2. **Digest-pinned runner image or VM template.** No floating tags (`:latest`, `:ubuntu-jammy` alone). Use `image@sha256:...` with Renovate's docker-compose / docker-image manager tracking the digest, or pin the VM build artifact and bump deliberately. +3. **Process-level healthcheck** — Docker `healthcheck:`, systemd `WatchdogSec`, or equivalent — that probes the runner's actual ability to do its job (reach `api.github.com`, talk to the cluster, etc.). Failed health must surface in standard inspection tools (`docker compose ps`, `systemctl status`). +4. **Dead-man's-switch heartbeat** to healthchecks.io or equivalent, pinged only when the runner is healthy. Silence pages someone. Without this, the runner dies and nobody notices until a PR blocks. +5. **Pre-flight secret check** that asserts required secrets (App key, kubeconfig, age key) are non-empty in the injected env before launching the runner process. Fail loud with the actionable error instead of letting the runner enter a silent retry loop. + +Reference implementation: [`orbstack-kubernetes/docker/actions-runner/`](https://github.com/JacobPEvans/orbstack-kubernetes/tree/main/docker/actions-runner) (`docker-compose.yml`, `Makefile` `runner-*` targets, `docs/TESTING.md`). + ## The shape of every IaC pipeline | Stage | Trigger | Where it runs | What it does | From 9b00d7f3d569d98c5d75199d138bf2aa40dfb02f Mon Sep 17 00:00:00 2001 From: JacobPEvans <20714140+JacobPEvans@users.noreply.github.com> Date: Sun, 24 May 2026 16:59:14 -0400 Subject: [PATCH 2/2] docs(cicd): tighten self-hosted runner section, drop incident context Restate the reliability requirements as policy rather than as "things we learned the hard way". Each rule now reads as the current standard; the historical incident link and the consequence-narrative trailers are removed. Assisted-by: Claude --- infrastructure/cicd/overview.mdx | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/infrastructure/cicd/overview.mdx b/infrastructure/cicd/overview.mdx index 0efd5cc..817b48d 100644 --- a/infrastructure/cicd/overview.mdx +++ b/infrastructure/cicd/overview.mdx @@ -25,15 +25,13 @@ The decision tree is workload-first: a macOS build picks the Mac tier; an IaC ap ## Self-hosted runner reliability -The two self-hosted tiers (Mac and locked-down) are the only ones the org physically operates — and they are the single points of failure for any E2E gate that targets them. Both have first-hand experience with silent runner death blocking PRs at merge (most recently [JacobPEvans/orbstack-kubernetes#234](https://github.com/JacobPEvans/orbstack-kubernetes/pull/234)). +The two self-hosted tiers (Mac and locked-down) are the only ones the org physically operates. Each runner is a single point of failure for any E2E gate that targets it. Every self-hosted runner MUST satisfy all five: -Every self-hosted runner MUST satisfy all five of: - -1. **GitHub App auth, not personal access token.** The runner image authenticates via `APP_ID` + `APP_PRIVATE_KEY` and mints registration tokens from installation tokens internally — auto-refreshing, never expires while the App stays installed. PATs silently expire (fine-grained PATs cap at one year) and break the runner with no upstream visibility; PRs block at merge and nobody notices until someone tries to ship. +1. **GitHub App auth, not personal access token.** The runner image authenticates via `APP_ID` + `APP_PRIVATE_KEY` and mints registration tokens from installation tokens internally. Installation tokens auto-refresh and never expire while the App stays installed. PATs are forbidden — fine-grained PATs cap at one year and the expiry is invisible upstream. 2. **Digest-pinned runner image or VM template.** No floating tags (`:latest`, `:ubuntu-jammy` alone). Use `image@sha256:...` with Renovate's docker-compose / docker-image manager tracking the digest, or pin the VM build artifact and bump deliberately. -3. **Process-level healthcheck** — Docker `healthcheck:`, systemd `WatchdogSec`, or equivalent — that probes the runner's actual ability to do its job (reach `api.github.com`, talk to the cluster, etc.). Failed health must surface in standard inspection tools (`docker compose ps`, `systemctl status`). -4. **Dead-man's-switch heartbeat** to healthchecks.io or equivalent, pinged only when the runner is healthy. Silence pages someone. Without this, the runner dies and nobody notices until a PR blocks. -5. **Pre-flight secret check** that asserts required secrets (App key, kubeconfig, age key) are non-empty in the injected env before launching the runner process. Fail loud with the actionable error instead of letting the runner enter a silent retry loop. +3. **Process-level healthcheck** — Docker `healthcheck:`, systemd `WatchdogSec`, or equivalent — that probes the runner's actual ability to do its job (reach `api.github.com`, talk to the cluster, etc.). Failed health surfaces in standard inspection tools (`docker compose ps`, `systemctl status`). +4. **Dead-man's-switch heartbeat** to healthchecks.io or equivalent, pinged only when the runner is healthy. healthchecks.io fires the on-call page on missed beats. +5. **Pre-flight secret check** that asserts required secrets (App key, kubeconfig, age key) are non-empty in the injected env before launching the runner process. Fail loud with the actionable error. Reference implementation: [`orbstack-kubernetes/docker/actions-runner/`](https://github.com/JacobPEvans/orbstack-kubernetes/tree/main/docker/actions-runner) (`docker-compose.yml`, `Makefile` `runner-*` targets, `docs/TESTING.md`).