feat: complete v2 rewrite with scaleset SDK#1
Merged
Conversation
- Introduced a new skill for project management in ghr v2, detailing the role, workflow, and delegation to specialized agents. - Included guidelines for discussing features, creating specs, and implementing tasks. feat: create scaleset-sdk skill for GitHub Actions runner autoscalers - Added a new skill for building custom GitHub Actions runner autoscalers using the scaleset Go SDK. - Provided a quick start pattern and detailed the Scaler interface for implementation. docs: add complete API reference for actions/scaleset - Created a comprehensive API reference document for the actions/scaleset package, covering constants, core types, client methods, and error handling. docs: include macOS adaptation guide for scaleset-sdk - Added a guide for adapting the Docker example to macOS process-based runners, detailing necessary changes in implementation and configuration. chore: add golangci-lint configuration - Introduced a configuration file for golangci-lint to enforce coding standards and best practices across the codebase. docs: create CLAUDE.md for ghr project overview - Added a project overview document for ghr, outlining architecture, code conventions, commit conventions, key dependencies, and specs.
…authentication specs - Introduced structured logging specification (04-logging.md) detailing log structure, format, and implementation. - Added notification service specification (05-notifications.md) with architecture, Discord provider implementation, and event types. - Created uptime monitoring integration specification (06-uptime-kuma.md) outlining push-based health monitoring and status logic. - Defined configuration schema and validation rules (07-config.md) for the application, including environment variable handling. - Established authentication mechanism (08-auth.md) via interactive CLI, supporting Personal Access Tokens and GitHub Apps, with a dedicated credentials file.
Module github.com/RedBoardDev/gh-runners-tool/v2 with Go 1.25.3. Dependencies: actions/scaleset, cobra, oklog/run, godotenv, yaml.v3.
Group, RunnerInstance, RunnerSnapshot, Event, EventLevel, GroupHealthStatus, HealthIssue. Pure structs, no logic.
Full config struct hierarchy matching spec 07. Duration and ByteSize custom types. Root vs non-root default paths. Multi-error validation.
Credentials file with 0600 perms. Load with priority resolution: flag, env, file. PAT and GitHub App validation. Masked PAT display.
slog-based MultiHandler for fan-out to multiple destinations. Date-aware file rotation. Hierarchical loggers: daemon, group, runner. Log retention cleanup.
Root command with global flags (config, token, log-level). Commands: start, stop, restart, run, status, purge, login, logout, auth. Run command wires config, auth, and logging packages.
110 table-driven tests with race detector. Config validation (50), auth load/save/validate (35), logging hierarchy and rotation (25).
BinaryManager: download, cache, version resolution for runner bits. ProcessManager: prepare workdir, start/stop/kill process, stale cleanup. Tar path traversal protection. Recursive dir copy.
Thin wrapper around actions/scaleset. PAT and GitHub App auth. Scale set create/get/delete, JIT config, session, listener creation.
…ebhook Provider interface with fan-out service. Event filtering by type, wildcard, and level. Discord embeds with color mapping and mentions. Generic HTTP webhook provider.
GroupController manages per-group goroutines with retry and backoff. MacOSScaler implements listener.Scaler for runner lifecycle: scale up on demand, track idle/busy, stop and cleanup on job completion.
Periodic liveness checks (kill -0), runner timeout detection, disk space monitoring. Consumer-side interfaces for state and reporting.
Push-based health heartbeats for daemon and per-group monitoring. Implements reporter interface from health package.
HTTP server on {state_dir}/ghr.sock. GET /status returns runner
snapshots and health. GET /health returns health status only.
Plist generation, launchctl load/unload/start/stop wrappers. Root (LaunchDaemon) vs user (LaunchAgent) path resolution.
Build all components via DI in daemon.go. Run 4 actors: controller, health monitor, API server, signal handler. Graceful shutdown with timeout context.
Start/stop via launchd. Status queries Unix socket API with offline fallback. Purge stops daemon, deletes scale sets, cleans workdirs.
Critical: StartedAt populated, session.Close on shutdown, CleanupStale at startup, daemon.state.json written, min_disk_space validated, Uptime Kuma tokens resolved from env vars. Health: group-level checks (divergence, consecutive failures), idle timeout, corrective actions (kill zombie/stuck runners), RunnerKiller interface. Polish: status tables and --watch mode, purge waits for busy runners, Discord rate limiting and footer/avatar, daily log cleanup, job duration logging, scale set label mismatch warning, event type constants, dead code removed.
Extract CleanupStale and helpers to stay under 200 LOC per file.
Prompts for auth method (PAT or GitHub App), collects credentials, validates against GitHub API, displays username and scopes, saves to credentials file. Falls back to flag-based mode when --method set.
Full YAML schema with all fields and sensible defaults. Environment variables reference for secrets and overrides.
Runner: binary caching, download extraction, prepare/cleanup, stale cleanup. API: status/health handlers with mocks. Controller: scaler snapshots, desired count capping, job started/completed events.
…rocessManager Consumer-side interface in controller/ for testability. Accepts Prepare, Start, Stop, Cleanup. *runner.ProcessManager satisfies it.
Session close after long-poll timeout is expected behavior. The session expires server-side during the ~50s poll, making the close return a 400. Not actionable, not user-facing.
Runners self-terminate after completing a job. Stop() now returns nil for expected exits (ExitError, process already done) instead of surfacing signal: terminated as a warning.
Simple: 1 group, 1 runner, 1 job — smoke test. Complete: 4 groups, 20 jobs testing scale-up/down, concurrency, queuing, failure handling, min/max runners, Discord notifications, Uptime Kuma monitoring, health checks, idle timeout, shutdown.
Parses daemon and group logs to verify: scale set creation, runner provisioning, concurrency, min_runners pre-provisioning, job counts, failure detection, duration stats, cleanup state, log structure. Outputs pass/fail/warn report with exit code.
Remove set -e to handle grep exit code 1 (no match) gracefully. Use wc -l instead of grep -c for reliable counting.
Phases: startup validation, concurrent burst, queuing pressure, real workloads (checkout, build, CPU, disk I/O, network, matrix), error handling (exit codes, bad commands, timeouts, recovery), deploy pipeline (5 stages), second wave, rapid fire (6 instant), long running (2 min), cross-group outputs.
Logs enabled, base_url_set, daemon_token_set, and group_tokens count at startup to diagnose .env loading issues.
ps aux + grep pattern matching was unreliable on macOS. pgrep -f provides exact process matching.
Idle timeout now skips runners within the min_runners threshold to prevent kill/reprovision loops. Runners beyond min_runners are killed on detection instead of just notified. KillOrphanRunners uses pgrep to find processes whose workdir matches our workdir_base, catching orphans left by interrupted shutdowns.
RunnerSnapshot and HealthIssue now have json tags for proper API serialization. Config loader also loads .env from the config file directory, fixing launchd mode where cwd differs from config location.
statusHealth.Issues was int but API returns []HealthIssue array. Removed phantom Service field from statusResponse. Health section now shows individual issues.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full rewrite of ghr — a self-hosted GitHub Actions runner controller for macOS — built on the official
actions/scalesetGo SDK.internal/with clean dependency injection viaoklog/run.Groupstart,stop,restart,run,status,purge,login(interactive wizard),logoutghr statusqueriesReplaces the previous minimal reconciler-based implementation entirely (
old-version/removed).Test plan
go build ./cmd/ghrcompiles successfullygo test ./...— all unit tests passgo test -race ./...— no data racesgo vet ./...— no static analysis issuesconfig.example.yamlandenv.exampleghr --help,ghr start --help, etc.)