Skip to content

feat: complete v2 rewrite with scaleset SDK#1

Merged
RedBoardDev merged 53 commits into
mainfrom
feat-complete-refactor-v2
May 14, 2026
Merged

feat: complete v2 rewrite with scaleset SDK#1
RedBoardDev merged 53 commits into
mainfrom
feat-complete-refactor-v2

Conversation

@RedBoardDev
Copy link
Copy Markdown
Owner

Summary

Full rewrite of ghr — a self-hosted GitHub Actions runner controller for macOS — built on the official actions/scaleset Go SDK.

  • Architecture: package-by-feature layout under internal/ with clean dependency injection via oklog/run.Group
  • Core engine: scale set orchestration with long-polling, JIT runner configs, ephemeral process management, and idle timeout with min_runners support
  • CLI: Cobra-based commands — start, stop, restart, run, status, purge, login (interactive wizard), logout
  • Observability: structured slog logging with file rotation, health monitor (runner + disk checks), Uptime Kuma push reporter
  • Notifications: event-driven alerts via Discord webhooks and generic webhooks with severity filtering
  • IPC: Unix socket JSON API for ghr status queries
  • macOS integration: launchd service management (install/uninstall/start/stop)
  • Config: YAML-based configuration with env overlay, validation, and byte-size parsing
  • Auth: credential storage with PAT/GitHub App support and resolution order
  • Tests: comprehensive unit tests across config, auth, logging, controller, runner, notification, health, API, and launchd packages
  • E2E: simple and complete test suites with a 46-job workflow across 10 phases and an automated validation script

Replaces the previous minimal reconciler-based implementation entirely (old-version/ removed).

Test plan

  • go build ./cmd/ghr compiles successfully
  • go test ./... — all unit tests pass
  • go test -race ./... — no data races
  • go vet ./... — no static analysis issues
  • Config loading works with config.example.yaml and env.example
  • CLI commands render help correctly (ghr --help, ghr start --help, etc.)

- Introduced a new skill for project management in ghr v2, detailing the role, workflow, and delegation to specialized agents.
- Included guidelines for discussing features, creating specs, and implementing tasks.

feat: create scaleset-sdk skill for GitHub Actions runner autoscalers

- Added a new skill for building custom GitHub Actions runner autoscalers using the scaleset Go SDK.
- Provided a quick start pattern and detailed the Scaler interface for implementation.

docs: add complete API reference for actions/scaleset

- Created a comprehensive API reference document for the actions/scaleset package, covering constants, core types, client methods, and error handling.

docs: include macOS adaptation guide for scaleset-sdk

- Added a guide for adapting the Docker example to macOS process-based runners, detailing necessary changes in implementation and configuration.

chore: add golangci-lint configuration

- Introduced a configuration file for golangci-lint to enforce coding standards and best practices across the codebase.

docs: create CLAUDE.md for ghr project overview

- Added a project overview document for ghr, outlining architecture, code conventions, commit conventions, key dependencies, and specs.
…authentication specs

- Introduced structured logging specification (04-logging.md) detailing log structure, format, and implementation.
- Added notification service specification (05-notifications.md) with architecture, Discord provider implementation, and event types.
- Created uptime monitoring integration specification (06-uptime-kuma.md) outlining push-based health monitoring and status logic.
- Defined configuration schema and validation rules (07-config.md) for the application, including environment variable handling.
- Established authentication mechanism (08-auth.md) via interactive CLI, supporting Personal Access Tokens and GitHub Apps, with a dedicated credentials file.
Module github.com/RedBoardDev/gh-runners-tool/v2 with Go 1.25.3.
Dependencies: actions/scaleset, cobra, oklog/run, godotenv, yaml.v3.
Group, RunnerInstance, RunnerSnapshot, Event, EventLevel,
GroupHealthStatus, HealthIssue. Pure structs, no logic.
Full config struct hierarchy matching spec 07. Duration and ByteSize
custom types. Root vs non-root default paths. Multi-error validation.
Credentials file with 0600 perms. Load with priority resolution:
flag, env, file. PAT and GitHub App validation. Masked PAT display.
slog-based MultiHandler for fan-out to multiple destinations.
Date-aware file rotation. Hierarchical loggers: daemon, group, runner.
Log retention cleanup.
Root command with global flags (config, token, log-level).
Commands: start, stop, restart, run, status, purge, login, logout, auth.
Run command wires config, auth, and logging packages.
110 table-driven tests with race detector. Config validation (50),
auth load/save/validate (35), logging hierarchy and rotation (25).
BinaryManager: download, cache, version resolution for runner bits.
ProcessManager: prepare workdir, start/stop/kill process, stale cleanup.
Tar path traversal protection. Recursive dir copy.
Thin wrapper around actions/scaleset. PAT and GitHub App auth.
Scale set create/get/delete, JIT config, session, listener creation.
…ebhook

Provider interface with fan-out service. Event filtering by type,
wildcard, and level. Discord embeds with color mapping and mentions.
Generic HTTP webhook provider.
GroupController manages per-group goroutines with retry and backoff.
MacOSScaler implements listener.Scaler for runner lifecycle: scale up
on demand, track idle/busy, stop and cleanup on job completion.
Periodic liveness checks (kill -0), runner timeout detection,
disk space monitoring. Consumer-side interfaces for state and reporting.
Push-based health heartbeats for daemon and per-group monitoring.
Implements reporter interface from health package.
HTTP server on {state_dir}/ghr.sock. GET /status returns runner
snapshots and health. GET /health returns health status only.
Plist generation, launchctl load/unload/start/stop wrappers.
Root (LaunchDaemon) vs user (LaunchAgent) path resolution.
Build all components via DI in daemon.go. Run 4 actors: controller,
health monitor, API server, signal handler. Graceful shutdown with
timeout context.
Start/stop via launchd. Status queries Unix socket API with offline
fallback. Purge stops daemon, deletes scale sets, cleans workdirs.
Critical: StartedAt populated, session.Close on shutdown, CleanupStale
at startup, daemon.state.json written, min_disk_space validated, Uptime
Kuma tokens resolved from env vars.

Health: group-level checks (divergence, consecutive failures), idle
timeout, corrective actions (kill zombie/stuck runners), RunnerKiller
interface.

Polish: status tables and --watch mode, purge waits for busy runners,
Discord rate limiting and footer/avatar, daily log cleanup, job duration
logging, scale set label mismatch warning, event type constants, dead
code removed.
Extract CleanupStale and helpers to stay under 200 LOC per file.
Prompts for auth method (PAT or GitHub App), collects credentials,
validates against GitHub API, displays username and scopes, saves
to credentials file. Falls back to flag-based mode when --method set.
Full YAML schema with all fields and sensible defaults.
Environment variables reference for secrets and overrides.
Runner: binary caching, download extraction, prepare/cleanup, stale
cleanup. API: status/health handlers with mocks. Controller: scaler
snapshots, desired count capping, job started/completed events.
…rocessManager

Consumer-side interface in controller/ for testability. Accepts
Prepare, Start, Stop, Cleanup. *runner.ProcessManager satisfies it.
Session close after long-poll timeout is expected behavior. The session
expires server-side during the ~50s poll, making the close return a
400. Not actionable, not user-facing.
Runners self-terminate after completing a job. Stop() now returns nil
for expected exits (ExitError, process already done) instead of
surfacing signal: terminated as a warning.
Simple: 1 group, 1 runner, 1 job — smoke test.
Complete: 4 groups, 20 jobs testing scale-up/down, concurrency,
queuing, failure handling, min/max runners, Discord notifications,
Uptime Kuma monitoring, health checks, idle timeout, shutdown.
Parses daemon and group logs to verify: scale set creation, runner
provisioning, concurrency, min_runners pre-provisioning, job counts,
failure detection, duration stats, cleanup state, log structure.
Outputs pass/fail/warn report with exit code.
Remove set -e to handle grep exit code 1 (no match) gracefully.
Use wc -l instead of grep -c for reliable counting.
Phases: startup validation, concurrent burst, queuing pressure, real
workloads (checkout, build, CPU, disk I/O, network, matrix), error
handling (exit codes, bad commands, timeouts, recovery), deploy
pipeline (5 stages), second wave, rapid fire (6 instant), long
running (2 min), cross-group outputs.
Logs enabled, base_url_set, daemon_token_set, and group_tokens count
at startup to diagnose .env loading issues.
ps aux + grep pattern matching was unreliable on macOS. pgrep -f
provides exact process matching.
Idle timeout now skips runners within the min_runners threshold to
prevent kill/reprovision loops. Runners beyond min_runners are killed
on detection instead of just notified.

KillOrphanRunners uses pgrep to find processes whose workdir matches
our workdir_base, catching orphans left by interrupted shutdowns.
RunnerSnapshot and HealthIssue now have json tags for proper API
serialization. Config loader also loads .env from the config file
directory, fixing launchd mode where cwd differs from config location.
statusHealth.Issues was int but API returns []HealthIssue array.
Removed phantom Service field from statusResponse. Health section
now shows individual issues.
@RedBoardDev RedBoardDev merged commit b53ae28 into main May 14, 2026
8 checks passed
@RedBoardDev RedBoardDev deleted the feat-complete-refactor-v2 branch May 14, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant