Security & resilience audit fixes (38 items)#2
Merged
Conversation
Stages the in-progress GitHub App login flow (installations listing, JWT signing, interactive/non-interactive wizards) as the starting point before applying the audit fix series.
Range copies the iteration value, so the previous code passed &g (the loop variable's address) to each goroutine. Go 1.22+ scopes loop vars per iteration, masking the bug — but the pattern is fragile. Index the slice directly so each goroutine captures a stable pointer to the underlying GroupConfig.
…er (II.17) The Send method deferred resp.Body.Close on the initial response, but reassigned resp on 429 retry — the deferred close then ran against the new body while the manual close still ran against the old. Factoring the retry into postWithRateLimitRetry leaves Send with a single resp and a single deferred close.
The Groups map was always written as empty and never refreshed at runtime, providing no signal to consumers of daemon.state.json. Remove it so the on-disk shape reflects what the daemon actually knows when state is written.
Discarding the kill error silently hid permission and ESRCH cases where ghr believed it had cleaned up but had not. Log the failure so operators see why an orphan persists.
The same RemoveAll + warn-log pattern was duplicated three times in cleanupStaleRunner. Factor it into a single helper so all stale-dir removals follow the same logging contract.
The POSIX kernels we target (macOS, Linux) keep PIDs in a signed 32-bit range. Tightening the type makes that constraint explicit and keeps JSON output stable across architectures. syscall.Kill still expects int, so the call site casts at the boundary.
Previously only empty labels were rejected. GitHub requires runner labels to start alphanumeric, stay within 64 chars, and use only [A-Za-z0-9._-]. Enforce the pattern at config load with table-driven regression coverage so misconfigurations fail fast instead of being reported by the scale set API at runtime.
…III.10) daemon.pid, daemon.state.json, and ghr.sock were spelled out as string literals across cli/daemon.go, cli/state.go, cli/status.go, cli/purge.go, and api/server.go. Adding or renaming any of them meant hunting for every occurrence and the cleanupStateFiles list could drift silently. Introduce internal/state.Paths so each consumer reads the path through the same helper and All() drives bulk cleanup. Also scope the runners/cache/state .gitignore entries to the repo root so internal/state/ is tracked.
Process.Cleanup wiped the workdir but left groups/<group>/runners/<name>/ behind, accumulating one empty directory per ephemeral runner. LogManager.RemoveRunnerLogs closes the tracked writer for that runner and removes the directory. The scaler now calls it after every successful Cleanup (killRunner, Shutdown, HandleJobCompleted) so the log tree shrinks with the runner set.
The random ID was embedded in Name (groupName-<id>) and never read on its own. Removing the redundant field keeps the runner identity single-sourced and trims the struct surface area.
The exec.Cmd handle is purely an implementation detail of the ProcessManager methods; nothing outside this package ever read or mutated it. Hide it so consumers cannot reach into the process state and so the struct surface stops advertising lifecycle details.
srv.Close cut in-flight requests, and the function could return while the Serve goroutine was still about to write to errCh — a use-after- free on the listener and a lost ErrServerClosed log. Replace Close with Shutdown(ctx) bounded by a 5s timeout and drain errCh before returning. Socket cleanup now lives in a defer so both branches (context cancel and Serve error) follow the same exit path.
http.DefaultClient never times out, so any stalled GitHub response
(slow-loris DNS, hung TLS handshake, ratelimited connection held open)
froze ghr login and auth status indefinitely. The runner binary
manager had the same issue with a bare &http.Client{}.
Introduce a 30s package-level client for internal/auth and a 10m
client (large enough for a 100MB tarball over a slow link) for the
runner BinaryManager.
validatePAT inlined the entire GitHub response body in its error, leaking rate-limit headers and other debug noise back to the user. Reuse the existing truncateBody/drainBody helpers (moved to http.go so both validation and installation flows share the same contract) and cap the excerpt at 500 chars.
Anything that logs a *Credentials would dump the raw PAT verbatim. Implement slog.LogValuer on Credentials and GitHubAppCreds so the default slog formatting routes through MaskedPAT and never leaks the token, while still surfacing method, URL, account, and key path for debugging.
text/template performs no XML escaping, so a label or path containing '<', '>', '&', or '"' would inject arbitrary keys into the generated plist (e.g. silently overriding RunAtLoad). Register an "xml" template function backed by encoding/xml.EscapeText and apply it to every operator-controlled field. Regression test asserts an injection payload becomes inert escaped text.
net.Listen on a unix path inherits the process umask, so the daemon status socket landed at 0755 — any local user could read snapshots, PIDs, and health issues. Chmod the socket immediately after Listen, tearing down the listener and the socket file on failure so callers do not end up with an over-permissive descriptor.
… (I.1) extractTarGz previously called sanitizeTarPath on the symlink target, and if that failed it silently fell back to the *original* attacker- controlled Linkname before calling os.Symlink. A forged tarball could plant a symlink such as runner -> /etc/shadow inside the cache, which copyDir would faithfully replicate and later writes would dereference. Validate Linkname with filepath.IsLocal (Go 1.20+) and reject any absolute or escaping path before creating the symlink. Regression tests cover the absolute, relative-escape, and benign local cases.
If an upgraded ghr is asked to recopy a cache that still holds a malicious symlink (or a hand-edited cache directory), copyDir would faithfully replicate the link into the workdir, exposing the runner to writes that escape the workdir tree. Validate every symlink with filepath.IsLocal during the walk and abort on absolute or escaping targets. Regression tests cover the absolute, escape, and benign relative cases.
…(I.2) downloadAndExtract trusted whatever the network handed back and then executed run.sh as the daemon. Fetch the matching .sha256 file co-published by actions/runner releases, hash the body in-stream with crypto/sha256, and reject the install on mismatch. EnsureBits already RemoveAlls the destination on error, so a tampered payload is wiped before any cmd.Start is attempted.
… (II.2) Two groups asking for the same runner version could race: the second goroutine saw run.sh appear mid-extraction, declared the cache ready, and started a runner against a half-extracted tree. A crashed download left the same incomplete cache hanging around forever. Guard each version with a sync.Mutex (keyed via sync.Map.LoadOrStore) and gate "cached" on a final .complete sentinel written only after a successful download+extract. On startup we now also detect and wipe incomplete directories instead of trusting partial state. Regression tests cover the cached, incomplete-cache, and download-cleanup paths.
pgrep -f m.workdirBase would happily match arbitrary user processes if the configured workdir lived in a noisy directory (e.g. /tmp) — every match was SIGKILLed unconditionally. Two layers of defense: 1. config validation rejects relative workdir_base, the obvious system roots (/, /tmp, /var, ...), and paths shorter than 8 chars so substring matches stay narrow. 2. KillOrphanRunners now confirms each pgrep hit with `ps -p <pid> -o command=` and only SIGKILLs when the command line still references workdir_base/.../run.sh. Config tests cover the new rejection cases.
LoadPrivateKey already refuses overly permissive RSA keys, but the credentials file (which holds the PAT and config URL) was read silently regardless of mode. A stray chmod 644 would never surface to the operator. loadFromFile now stats the file and emits a chmod-600 hint on stderr whenever group/world bits are set. Keeps the read going (no breakage on legacy installs) but makes the misconfiguration impossible to miss.
… (I.8) launchctl load/unload/start/stop are deprecated since macOS 10.10 and emit increasingly noisy warnings on modern releases. Switch the service lifecycle to the bootstrap/bootout/kickstart trio, choosing "system" or "gui/<uid>" as the domain target based on euid so both LaunchDaemons and LaunchAgents flow through the same helper.
The controller and the purge command both passed runnerGroupID=1 unconditionally, meaning any org with non-default runner groups silently misrouted scale set lookups. Surface runner_group_id in the YAML config (defaults to 1, the well-known "Default" group), validate it >= 1, and propagate the value to controller.New + purgeScaleSets.
…he lock (II.6) runChecks held m.mu for the duration of every reporter and notifier call, so a slow Discord webhook (~1s of throttle) or a hanging Uptime Kuma push would stall Status() and the next tick. Snapshot the issues, counts, reporters, and notifier under the lock, then dispatch in a goroutine after releasing it. The thread-safe noopNotifier helper now lets tests wait on dispatched events with a timeout.
…cy (II.8) II.8 noted that runChecks rebuilds m.issues under Lock while Status reads under RLock, so callers always see a coherent snapshot. Lock this invariant in place with a -race driver that interleaves 50 runChecks calls with 50 Status reads and fails if any issue field is torn.
A panic inside controller.Run, health.Run, api.Run, or the log cleaner killed the whole daemon. KeepAlive=true in the plist would restart us, but in-memory state was lost. Wrap each long-running goroutine in safeActor: deferred recover logs the panic with the actor name and stack via slog, then returns a sentinel error so oklog/run can tear the group down cleanly. Tests cover both the panic path and a normal error pass-through.
nextBackoff doubled deterministically, so all groups that failed at the same tick (e.g. during a GitHub outage) retried in lockstep and hammered the API together. Wrap the doubling in a math/rand-driven ±20% jitter that still respects backoffMax. Tests cover the bounds, the cap behaviour, and confirm successive calls vary.
After SIGKILL the previous code did `return <-done`, which would block forever if Wait never returned (zombie, reparented to launchd, etc.). Wrap that final Wait in a 5s timeout so Stop reliably surfaces stuck processes instead of stalling the scaler shutdown loop.
….10) postWithRateLimitRetry only knew about 429; a transient 502/503 caused the notification to drop on the first attempt. Add a single retry path with a 2s default backoff for any 5xx response (test helper shortens it to 10ms). Honors context cancellation while waiting so a daemon shutdown does not have to wait through the backoff.
Bumping the runner version left every previous cache directory on disk forever, which on SSD-constrained hosts ate gigabytes per minor release. After a successful EnsureBits, scan the cache directory for .complete-marked siblings, sort by modTime, and remove anything beyond the last cacheKeepVersions (=3). Incomplete caches (no marker) are left untouched and surface via the existing II.2 cleanup.
During a GitHub outage, every login/auth-status/listener-reconnect call kept hammering the API on every tick, burning rate-limit and serving no purpose. Introduce a small package-level circuit breaker around httpClient.Do that: - counts consecutive network errors and 5xx responses, - opens after 5 consecutive failures, - stays open for 60s, then half-opens to probe with a single request. Validation success / non-5xx response closes the breaker. Tests cover the threshold, success reset, half-open behaviour, and the breakable classification.
…V.7) If the controller or health monitor deadlocks, launchd still sees the process alive and never restarts us. Add a watchdog actor that probes the daemon's own /health endpoint over the unix socket every 30s. After 3 consecutive failures it logs critical and exits with code 2, which lets launchd's KeepAlive policy bring us back fresh. Tests cover the probeOK status-code classification (2xx/3xx/4xx healthy, 5xx unhealthy) and confirm connection errors are reported.
…s (VII.6)
Runner output landed in date-rotated files as raw bytes, so operators
could never correlate a job log line with the daemon event that
provisioned the runner. Wrap the writer returned by RunnerOutputFile
in a tagged JSON emitter: each newline-terminated line becomes one
{time, source, group, runner, line} object. Partial lines are
buffered until the next \n; Close flushes any trailing fragment.
The previous tree omitted controller/, health/, notification/, monitoring/, api/, launchd/, and state/, and labelled internal/github as a SDK adapter that does not exist by that name. Update the layout block to reflect every internal package shipped today and annotate each line with what it owns.
- api/server.go: derive the shutdown context from WithoutCancel(ctx) so contextcheck stops flagging the previous context.Background() switch while preserving the "ctx is already done" semantics. - cli/watchdog_test.go: close the response body when present in the connection-error path (bodyclose). - logging/tagged_writer.go: drop the unused source parameter; the field is always "runner" today.
Align the const block so fmt-check passes; no behaviour change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Series of focused fixes from the recent
audit.md, covering exactly the items listed in the brief — no scope creep, every item committed independently so the history reads as an audit log.pgrepworkdir guards, I.4 unix socket chmod 0600, I.5 bounded HTTP clients, I.6 credentials-file perms warning, I.7 XML-escape the launchd plist, I.8 migrate tolaunchctl bootstrap/bootout/kickstart, I.9copyDirrejects non-local symlinks, I.10 graceful API shutdown, I.13Credentials.LogValuemasks the PAT..completemarker, II.3 truncate PAT validation bodies, II.4 stable group pointer in the controller fan-out, II.5runner_group_idconfig-driven, II.6 reporter/notifier dispatch outside the health mutex, II.7 API server waits for Serve to return, II.8 race-tested Status/runChecks invariant, II.11 unexportProcess.cmd, II.13 dropRunnerInstance.ID, II.14 drop unuseddaemonState.Groups, II.15 log SIGKILL failures, II.16 extractremoveStaleDirhelper, II.17 fix Discord double-close.internal/state.Pathscentralizes the state filenames, III.14 narrow PID toint32.safeActorwrapper, IV.2 small circuit breaker around GitHub HTTP, IV.5 cache GC keeping the 3 most-recent runner versions, IV.7 liveness watchdog that exits to let launchd respawn, IV.8 ±20%% jitter on the backoff, IV.9 post-SIGKILL Wait timeout, IV.10 single retry on Discord 5xx.audit.mdhas been progressively trimmed of every addressed item — what remains is the still-open work.Stats
go build ./... && go vet ./... && go test -race -count=1 ./....backoffMinnaming, an unused//nolint:unconvertdirective).Test plan
make ci— build + vet + fmt-check +go test -race+govulncheck.groups/<grp>/runners/<r>/<date>.jsonas tagged JSON envelopes.