diff --git a/.claude/.gitignore b/.claude/.gitignore new file mode 100644 index 0000000..93c0f73 --- /dev/null +++ b/.claude/.gitignore @@ -0,0 +1 @@ +settings.local.json diff --git a/.claude/README.md b/.claude/README.md new file mode 100644 index 0000000..c531d3f --- /dev/null +++ b/.claude/README.md @@ -0,0 +1,46 @@ +# .claude/ — Project Configuration + +## Commands + +| Command | Description | +|---|---| +| `/pm` | Project manager mode — discuss, plan, delegate | +| `/impl ` | Implement a feature from specs | +| `/test [packages]` | Run tests with race detector | +| `/check` | Full CI pipeline (build, vet, fmt, lint, test) | + +## Agents + +| Agent | Role | +|---|---| +| `spec-writer` | Write/update specs from a brief | +| `developer` | Implement Go code from specs | +| `code-reviewer` | Review code quality and idioms | +| `spec-checker` | Verify implementation matches specs | + +## Structure + +``` +.claude/ +├── CLAUDE.md # Project instructions (always loaded) +├── settings.json # Permissions, env vars +├── rules/ # Auto-loaded by file type +│ ├── go-style.md # *.go → naming, errors, concurrency +│ ├── architecture.md # internal/** → packages, interfaces, DI +│ ├── testing.md # *_test.go → table-driven, mocking +│ └── security.md # *.go, *.yaml → secrets, exec safety +├── skills/ # Loaded on demand +│ ├── pm/ # Project manager orchestrator +│ ├── go-expert/ # Go patterns (oklog/run, slog, exec...) +│ └── scaleset-sdk/ # actions/scaleset SDK reference +├── agents/ # Specialized subagents +│ ├── developer.md +│ ├── spec-writer.md +│ ├── code-reviewer.md +│ └── spec-checker.md +└── commands/ # Slash commands + ├── pm.md + ├── impl.md + ├── test.md + └── check.md +``` diff --git a/.claude/agents/code-reviewer.md b/.claude/agents/code-reviewer.md new file mode 100644 index 0000000..4c6a0ab --- /dev/null +++ b/.claude/agents/code-reviewer.md @@ -0,0 +1,24 @@ +--- +name: code-reviewer +description: Review Go code for correctness, idioms, error handling, concurrency safety, and alignment with project specs. +model: sonnet +effort: 3 +allowedTools: + - Read + - Grep + - Glob +--- + +# Go Code Reviewer + +You review Go code in the ghr project. Focus on: + +1. **Correctness**: Does the code do what the spec says? Check against `specs/` files. +2. **Error handling**: Every error wrapped with context? No ignored errors? Sentinel errors used correctly? +3. **Concurrency**: Mutex used correctly? No data races? Context propagation complete? Goroutines have shutdown paths? +4. **Interfaces**: Consumer-side only? Minimal (1-3 methods)? No getter interfaces? +5. **Go idioms**: Naming follows Go conventions? No Java patterns? Structs with exported fields? +6. **Security**: No hardcoded secrets? No unsanitized exec input? Permissions checked? +7. **Tests**: Coverage of error paths? Table-driven? Race detector compatible? + +Be specific. Reference line numbers. Suggest concrete fixes, not vague improvements. diff --git a/.claude/agents/developer.md b/.claude/agents/developer.md new file mode 100644 index 0000000..9063186 --- /dev/null +++ b/.claude/agents/developer.md @@ -0,0 +1,64 @@ +--- +name: developer +description: Implement Go code for ghr v2. Receives precise instructions from the PM with spec references, files to create/modify, and expected behavior. Writes production-quality Go code with tests. +model: opus +effort: 3 +allowedTools: + - Read + - Write + - Edit + - Bash + - Grep + - Glob +--- + +# Developer + +You are a senior Go developer implementing features for the ghr v2 project. + +## Input + +You receive **implementation instructions** from the PM containing: +- Which spec(s) to follow (read them first) +- Which files to create or modify +- Expected behavior and edge cases +- Dependencies on other packages + +## Process + +1. **Read the spec** — understand exactly what's expected +2. **Read the architecture** — `specs/00-architecture.md` for package placement and patterns +3. **Read existing code** — understand what's already implemented, import conventions +4. **Implement** — write the code, following the spec precisely +5. **Write tests** — alongside the implementation, not after +6. **Verify** — `go build ./cmd/ghr` and `go test -race ./...` +7. **Report** — list what was created/modified and any deviations from spec + +## Code standards + +- Package-by-feature under `internal/` +- Consumer-side interfaces (defined where consumed, unexported, minimal) +- Structs with exported fields (no getter interfaces) +- Error wrapping: `fmt.Errorf("context: %w", err)` +- `context.Context` as first parameter +- Table-driven tests with `t.Run` +- `oklog/run` for top-level actors, internal retry for per-group goroutines +- Secrets via env vars, never hardcoded +- No `any` without justification +- No ignored errors with `_` + +## What you do NOT do + +- You don't decide architecture — that's in the specs +- You don't add features not in the spec — flag them to the PM +- You don't skip tests — every exported function gets tested +- You don't skip error handling — every error is wrapped and returned +- You don't use global state — everything via dependency injection + +## When something is unclear + +If the spec is ambiguous or you find a contradiction: +1. State what's unclear +2. State the two (or more) interpretations +3. State which you'd pick and why +4. Implement your pick but flag it in your report diff --git a/.claude/agents/spec-checker.md b/.claude/agents/spec-checker.md new file mode 100644 index 0000000..708f95f --- /dev/null +++ b/.claude/agents/spec-checker.md @@ -0,0 +1,31 @@ +--- +name: spec-checker +description: Verify that implementation matches the project specs. Use when implementing a new feature to ensure nothing is missed. +model: sonnet +effort: 3 +allowedTools: + - Read + - Grep + - Glob +--- + +# Spec Compliance Checker + +You verify that Go code matches the specs in `specs/`. For a given feature: + +1. Read the relevant spec file(s) from `specs/` +2. Read the implementation code +3. Compare point by point: + - Are all specified behaviors implemented? + - Are all edge cases handled as described? + - Do struct fields match the spec? + - Do function signatures match? + - Are config defaults correct? + - Are error messages as specified? +4. Report: + - Implemented correctly + - Missing from implementation + - Deviations from spec (with reasoning if the deviation seems intentional) + - Spec ambiguities discovered during review + +Be thorough. Cross-reference between specs (e.g., spec 01 references spec 08 for auth). diff --git a/.claude/agents/spec-writer.md b/.claude/agents/spec-writer.md new file mode 100644 index 0000000..34a3c83 --- /dev/null +++ b/.claude/agents/spec-writer.md @@ -0,0 +1,66 @@ +--- +name: spec-writer +description: Write detailed technical specs for ghr v2 features. Receives a brief from the PM, reads existing specs for context, and produces a complete spec document. +model: opus +effort: 3 +allowedTools: + - Read + - Write + - Edit + - Grep + - Glob +--- + +# Spec Writer + +You write technical specifications for the ghr v2 project. + +## Input + +You receive a **brief** from the PM that describes what needs to be specified. The brief contains: +- What the feature does +- User's requirements and decisions +- Related existing specs to reference +- Any constraints or non-goals + +## Process + +1. **Read existing specs** for context (especially `specs/00-architecture.md`) +2. **Read related specs** mentioned in the brief +3. **Write the spec** following the established format + +## Spec format + +Follow the same structure as existing specs in `specs/`: + +```markdown +# Spec XX — Title + +## Overview +1-2 sentences describing the feature. + +--- + +## [Feature sections] +Detailed description with: +- Go code examples (structs, interfaces, function signatures) +- Config YAML examples +- Flow descriptions (startup, shutdown, error handling) +- Decision rationale (why this approach) + +## Config schema +Relevant YAML fields for this feature. + +## Integration points +How this feature connects to other specs/packages. +``` + +## Rules + +- Be specific — include Go signatures, YAML examples, concrete values +- Reference other specs by number (e.g., "see spec 08-auth.md") +- Use the same terminology as existing specs +- Flag any contradictions with existing specs +- Don't over-specify implementation details that should be left to the developer +- Config secrets via env vars only, never in YAML +- Follow the architecture from spec 00 (package-by-feature, consumer-side interfaces) diff --git a/.claude/commands/check.md b/.claude/commands/check.md new file mode 100644 index 0000000..85dd54d --- /dev/null +++ b/.claude/commands/check.md @@ -0,0 +1,16 @@ +--- +description: Run full CI checks locally (build, vet, fmt, lint, test) +allowed-tools: Bash(go *) Bash(golangci-lint *) +--- + +# Full CI Check + +Run the complete check pipeline: + +1. `go build ./cmd/ghr` — must compile +2. `go vet ./...` — static analysis +3. Check formatting: `gofmt -l .` — must return empty (all formatted) +4. `golangci-lint run` — lint (config: `.golangci.yml`) +5. `go test -race ./...` — all tests with race detector + +Report pass/fail for each step. Stop on first failure. diff --git a/.claude/commands/impl.md b/.claude/commands/impl.md new file mode 100644 index 0000000..c3a2426 --- /dev/null +++ b/.claude/commands/impl.md @@ -0,0 +1,17 @@ +--- +description: Implement a feature from a spec. Reads the spec, plans, implements, tests. +--- + +# Implement from Spec + +Implement $ARGUMENTS following this workflow: + +1. **Read the spec**: Find the relevant spec in `specs/` for the requested feature +2. **Read architecture**: Check `specs/00-architecture.md` for package placement and interfaces +3. **Plan**: List the files to create/modify, the structs, interfaces, and functions needed +4. **Implement**: Write the code following the spec precisely +5. **Test**: Write tests alongside the implementation +6. **Verify**: Run `go build ./cmd/ghr` and `go test -race ./...` +7. **Review**: Check against the spec for any missed items + +If the spec is ambiguous or contradicts another spec, flag it before implementing. diff --git a/.claude/commands/pm.md b/.claude/commands/pm.md new file mode 100644 index 0000000..d558193 --- /dev/null +++ b/.claude/commands/pm.md @@ -0,0 +1,21 @@ +--- +description: Start the project manager mode. Discuss features, create specs, plan and delegate implementation. +--- + +# Project Manager Mode + +You are now the **project manager** for ghr v2. Read the PM skill at `.claude/skills/pm/SKILL.md` for your full instructions. + +Before anything else: +1. Read `specs/00-architecture.md` to understand the current architecture +2. Check what exists in `internal/` to know the project state +3. Greet the user and ask what they want to work on + +Task from user: $ARGUMENTS + +If no arguments, ask what they want to work on. Options: +- Discuss a feature or idea +- Create a new spec +- Implement a feature from an existing spec +- Review project status +- Something else diff --git a/.claude/commands/test.md b/.claude/commands/test.md new file mode 100644 index 0000000..1c1c227 --- /dev/null +++ b/.claude/commands/test.md @@ -0,0 +1,16 @@ +--- +description: Run tests with race detector and show results +allowed-tools: Bash(go test *) +--- + +# Run Tests + +Run tests for $ARGUMENTS (default: all packages): + +```bash +go test -race -v $ARGUMENTS +``` + +If no arguments: `go test -race ./...` + +After tests complete, summarize: passed/failed/skipped counts and any failures. diff --git a/.claude/rules/architecture.md b/.claude/rules/architecture.md new file mode 100644 index 0000000..dbb5ad9 --- /dev/null +++ b/.claude/rules/architecture.md @@ -0,0 +1,39 @@ +--- +paths: + - "internal/**/*.go" + - "cmd/**/*.go" +--- + +# Architecture Rules + +## Package structure +- Package-by-feature under `internal/`, one level deep. No `domain/`, `app/`, `infra/` layers. +- `internal/model/` contains ONLY shared data structs and enums. No interfaces. No logic. Under 100 LOC. +- Each package owns its feature end-to-end. + +## Interfaces +- Define interfaces where they are CONSUMED, not where they are implemented. +- Consumer-side interfaces are unexported (lowercase) and minimal (1-3 methods). +- Never create a central `ports.go` or `interfaces.go`. +- Never create getter interfaces (`ID() string`, `Name() string`). Use struct fields. + +## Dependencies +- Dependency injection is manual in `cmd/ghr/main.go`. No DI framework. +- The `controller/` package defines what it needs from `github/` via a small interface. +- The `health/` package defines what it needs from `controller/` via a small interface. +- Import direction: `cli` → `controller` → `github`, `runner`, `notification`. Never the reverse. + +## Concurrency +- `oklog/run.Group` for the top-level daemon actors (controller, health, API server, signal handler). +- When ONE actor fails, ALL are interrupted — clean deterministic shutdown. +- Per-group goroutines are managed INSIDE the controller with their own retry logic. +- A single group failure does NOT kill other groups. + +## Configuration +- All config values come from the config struct. No global variables. +- Secrets via env vars only, never in YAML. +- Auth credentials via `ghr login` / credentials file, not config. + +## Specs +- Before implementing a feature, read the corresponding spec in `specs/`. +- If the spec is unclear or you need to deviate, flag it rather than guessing. diff --git a/.claude/rules/code-cleanliness.md b/.claude/rules/code-cleanliness.md new file mode 100644 index 0000000..7747f32 --- /dev/null +++ b/.claude/rules/code-cleanliness.md @@ -0,0 +1,22 @@ +# Code Cleanliness + +## Comments +- No comments in code (Exception: explain the why). Code must be self-documenting through clear naming. +- No godoc comments on types, functions, or methods. Names speak for themselves. +- No inline comments, no section separators (--- lines), no TODO markers. +- No commented-out code. +- Exception: required `//go:` directives and `//nolint:` directives. + +## File size +- Source files must stay under 200 LOC (excluding tests). +- If a file grows beyond 200 LOC, split by logical concern into separate files. +- One responsibility per file. Name files after what they contain. + +## Structure +- Use subdirectories when a package has more than 5-6 files with distinct concerns. +- Test files are exempt from the 200 LOC limit but should still be well-organized. +- Group related types/functions in the same file. Don't scatter a concept across files. + +## Naming +- File names describe their content: `handler.go`, `writer.go`, `validate.go`. +- No generic names: `utils.go`, `helpers.go`, `common.go`, `misc.go`. diff --git a/.claude/rules/go-style.md b/.claude/rules/go-style.md new file mode 100644 index 0000000..3099f59 --- /dev/null +++ b/.claude/rules/go-style.md @@ -0,0 +1,47 @@ +--- +paths: + - "**/*.go" +--- + +# Go Style & Idioms + +## Naming +- Package names: short, lowercase, singular (`runner` not `runners`, `config` not `configuration`) +- Exported names: PascalCase, meaningful without package prefix (`runner.Process` not `runner.RunnerProcess`) +- Unexported: camelCase +- Acronyms: all caps (`ID`, `HTTP`, `URL`, `API`, `PID`, `JIT`) +- Interface names: verb-er for single-method (`io.Reader`), descriptive for multi-method + +## Error handling +- Always wrap with context: `fmt.Errorf("start runner %s: %w", name, err)` +- Never ignore errors with `_` — handle or log explicitly +- Use sentinel errors (`var ErrNotFound = errors.New(...)`) for expected conditions +- Use `errors.Is` / `errors.As` for checking, never string comparison +- Return early on error (no deep nesting) + +## Functions +- `context.Context` always first parameter +- Return concrete types, accept interfaces +- Keep functions short (< 40 lines guideline) +- Prefer named return values only when it aids godoc clarity + +## Concurrency +- Protect shared state with `sync.Mutex` (not channels for simple state) +- Always use `context.Context` for cancellation +- Never start a goroutine without a way to stop it +- Use `oklog/run` for top-level actor management +- Use `sync.WaitGroup` or `errgroup` for worker pools + +## Testing +- Table-driven with `t.Run` subtests +- Test file in same package (white-box) or `_test` package (black-box) +- Use `testify/assert` or `testify/require` for assertions +- Use `httptest.Server` for HTTP tests +- Test names: `TestFunctionName_Scenario_Expected` +- Race detector: always run with `-race` in CI + +## Packages +- Everything under `internal/` (nothing exported outside module) +- One feature per package, no `utils/` or `helpers/` +- Avoid circular imports — if needed, extract shared types to `model/` +- Package-level `var` and `init()` only for simple defaults, never for complex setup diff --git a/.claude/rules/security.md b/.claude/rules/security.md new file mode 100644 index 0000000..22dfd46 --- /dev/null +++ b/.claude/rules/security.md @@ -0,0 +1,19 @@ +--- +paths: + - "**/*.go" + - "**/*.yaml" + - "**/*.json" +--- + +# Security Rules + +- Never hardcode secrets (tokens, keys, passwords). Use env vars or the credentials file. +- Never log secrets. PATs are masked (`ghp_xxxx...xxxx`), JIT configs are never logged. +- JIT configs (`EncodedJITConfig`) are secrets — treat as such until consumed by the runner. +- Credentials file: `0600` permissions. Warn if overly permissive. +- Private key paths: verify `0600` permissions at login time. +- Webhook URLs (Discord, etc.): via env vars only, never in config.yaml. +- Never `exec.Command` with unsanitized user input. +- Never `filepath.Join` with untrusted path components (path traversal). +- TLS: do not skip verification by default. Support custom CAs via config if needed. +- Validate all external input (config values, API responses, env vars). diff --git a/.claude/rules/testing.md b/.claude/rules/testing.md new file mode 100644 index 0000000..fabfa2f --- /dev/null +++ b/.claude/rules/testing.md @@ -0,0 +1,43 @@ +--- +paths: + - "**/*_test.go" +--- + +# Testing Rules + +## Structure +- One test file per source file: `foo.go` → `foo_test.go` +- Table-driven tests with `t.Run` for every non-trivial function +- Group related tests in subtests: `TestGroupController/startup`, `TestGroupController/shutdown` + +## Naming +- `TestFunctionName` for basic tests +- `TestFunctionName_Scenario` for specific scenarios +- `TestFunctionName_Scenario_Expected` for full clarity +- Benchmark: `BenchmarkFunctionName` + +## Assertions +- Use `testify/require` for fatal checks (stop test on failure) +- Use `testify/assert` for non-fatal checks (continue test) +- Never use bare `if err != nil { t.Fatal(err) }` when testify is available + +## Mocking +- Consumer-side interfaces make mocking trivial +- Hand-written fakes preferred over generated mocks for simple interfaces +- Use `httptest.Server` for HTTP integration tests +- Use the scaleset SDK's `internal/testserver` pattern for GitHub API mocks + +## Coverage +- Run with `-race` flag always +- Focus on behavior, not coverage percentage +- Test error paths, not just happy paths +- Timeouts in tests: use `context.WithTimeout` or `time.After`, never bare `time.Sleep` + +## What to test +- `model/` — no tests needed (pure data) +- `controller/` — mock github client + runner backend +- `runner/` — test binary download with httptest, process lifecycle with real exec +- `health/` — mock runner state + reporter interfaces +- `notification/` — test providers against httptest.Server +- `config/` — table-driven validation +- `cli/` — thin layer, minimal tests diff --git a/.claude/settings.json b/.claude/settings.json new file mode 100644 index 0000000..dd1467c --- /dev/null +++ b/.claude/settings.json @@ -0,0 +1,45 @@ +{ + "permissions": { + "allow": [ + "Bash(go build *)", + "Bash(go test *)", + "Bash(go fmt *)", + "Bash(go vet *)", + "Bash(go mod *)", + "Bash(go run *)", + "Bash(gofmt *)", + "Bash(golangci-lint *)", + "Bash(git status)", + "Bash(git diff *)", + "Bash(git log *)", + "Bash(git branch *)", + "Bash(git show *)", + "Bash(ls *)", + "Bash(find *)", + "Bash(wc *)", + "Bash(head *)", + "Bash(tail *)", + "Bash(cat go.mod)", + "Bash(cat go.sum)", + "Bash(mkdir -p *)", + "Read", + "Edit", + "Write", + "Grep", + "Glob" + ], + "deny": [ + "Bash(rm -rf /)", + "Bash(sudo *)", + "Read(.env)", + "Read(**/.env)", + "Read(**/credentials.json)", + "Edit(.env)", + "Edit(**/credentials.json)" + ] + }, + "env": { + "GOPROXY": "https://proxy.golang.org", + "CGO_ENABLED": "1" + } +} diff --git a/.claude/skills/go-expert/SKILL.md b/.claude/skills/go-expert/SKILL.md new file mode 100644 index 0000000..b26dbae --- /dev/null +++ b/.claude/skills/go-expert/SKILL.md @@ -0,0 +1,74 @@ +--- +name: go-expert +description: Advanced Go patterns and best practices for daemon/service projects. Use when writing Go code involving goroutine lifecycle, context propagation, graceful shutdown, process management, HTTP clients, structured logging (slog), table-driven tests, consumer-side interfaces, or any Go architectural decision. Triggers on Go code, go.mod changes, or Go-related questions. +paths: + - "**/*.go" + - "go.mod" + - "go.sum" +--- + +# Go Expert Patterns + +Advanced Go patterns for daemon/service projects. Read `references/patterns.md` for the full reference when implementing complex patterns. + +## Quick reference — most common patterns + +### Goroutine lifecycle (oklog/run) +```go +var g run.Group +// Add actors: each is an (execute, interrupt) pair +g.Add(func() error { return server.Run(ctx) }, func(error) { cancel() }) +g.Add(func() error { <-ctx.Done(); return nil }, func(error) { cancel() }) +err := g.Run() // blocks until first actor returns, then interrupts all others +``` + +### Graceful shutdown +```go +ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM) +defer stop() +// ... run services with ctx ... +// On signal: ctx is cancelled, services stop, cleanup runs +shutdownCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second) +defer cancel() +service.Shutdown(shutdownCtx) +``` + +### Consumer-side interface +```go +// In the CONSUMER package, not the producer: +type store interface { + Get(ctx context.Context, id string) (*Thing, error) + Put(ctx context.Context, thing *Thing) error +} +// The producer returns a concrete struct that implicitly satisfies this. +``` + +### Process management (exec.Cmd) +```go +cmd := exec.CommandContext(ctx, path) +cmd.Dir = workDir +cmd.Env = append(os.Environ(), "KEY=value") +cmd.Stdout = logFile +cmd.Stderr = logFile +if err := cmd.Start(); err != nil { return err } +// Graceful stop: +cmd.Process.Signal(syscall.SIGTERM) +done := make(chan error, 1) +go func() { done <- cmd.Wait() }() +select { +case err := <-done: // exited +case <-time.After(10 * time.Second): + cmd.Process.Kill() + <-done +} +``` + +### Structured logging (slog) +```go +logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelInfo})) +groupLogger := logger.With("group", groupName) +runnerLogger := groupLogger.With("runner", runnerName) +runnerLogger.Info("job completed", "result", "success", "duration_s", 42) +``` + +For the full pattern library, read `references/patterns.md`. diff --git a/.claude/skills/go-expert/references/patterns.md b/.claude/skills/go-expert/references/patterns.md new file mode 100644 index 0000000..b2259e7 --- /dev/null +++ b/.claude/skills/go-expert/references/patterns.md @@ -0,0 +1,550 @@ +# Go Expert — Full Pattern Reference + +## Table of Contents + +1. [Goroutine lifecycle management](#1-goroutine-lifecycle) +2. [Context propagation](#2-context-propagation) +3. [Error handling patterns](#3-error-handling) +4. [HTTP client patterns](#4-http-client) +5. [Process management](#5-process-management) +6. [Structured logging (slog)](#6-structured-logging) +7. [Testing patterns](#7-testing) +8. [Configuration loading](#8-configuration) +9. [Concurrency patterns](#9-concurrency) +10. [File system operations](#10-filesystem) + +--- + +## 1. Goroutine lifecycle + +### oklog/run for daemon actors + +```go +import "github.com/oklog/run" + +var g run.Group + +// Actor: long-running service +{ + ctx, cancel := context.WithCancel(context.Background()) + g.Add( + func() error { return myService.Run(ctx) }, // execute + func(error) { cancel() }, // interrupt + ) +} + +// Actor: signal handler +{ + ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM) + g.Add( + func() error { <-ctx.Done(); return nil }, + func(error) { cancel() }, + ) +} + +// When ANY actor returns, ALL others are interrupted via their interrupt func. +if err := g.Run(); err != nil { + log.Fatal(err) +} +``` + +### errgroup for bounded parallel work + +```go +import "golang.org/x/sync/errgroup" + +g, ctx := errgroup.WithContext(ctx) +g.SetLimit(10) // max 10 concurrent + +for _, item := range items { + g.Go(func() error { + return process(ctx, item) + }) +} +if err := g.Wait(); err != nil { + return err +} +``` + +### Worker pool with backpressure + +```go +type Pool struct { + sem chan struct{} + wg sync.WaitGroup +} + +func NewPool(size int) *Pool { + return &Pool{sem: make(chan struct{}, size)} +} + +func (p *Pool) Go(fn func()) { + p.wg.Add(1) + p.sem <- struct{}{} // blocks if pool is full + go func() { + defer p.wg.Done() + defer func() { <-p.sem }() + fn() + }() +} + +func (p *Pool) Wait() { p.wg.Wait() } +``` + +--- + +## 2. Context propagation + +### Always pass context, never store it + +```go +// YES +func (s *Service) Process(ctx context.Context, id string) error { ... } + +// NO — storing context in a struct +type Service struct { + ctx context.Context // don't do this +} +``` + +### context.WithoutCancel for cleanup operations + +```go +// Cleanup must complete even if parent context is cancelled +func (s *Service) Shutdown(ctx context.Context) { + cleanupCtx := context.WithoutCancel(ctx) + // or: cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + s.cleanup(cleanupCtx) +} +``` + +### Timeout per operation + +```go +ctx, cancel := context.WithTimeout(ctx, 15*time.Second) +defer cancel() +resp, err := client.Do(req.WithContext(ctx)) +``` + +--- + +## 3. Error handling + +### Sentinel errors + +```go +var ( + ErrNotFound = errors.New("not found") + ErrConflict = errors.New("conflict") + ErrTimeout = errors.New("timeout") +) + +// Usage: +if errors.Is(err, ErrNotFound) { ... } +``` + +### Wrapping with context + +```go +func (s *Service) GetUser(ctx context.Context, id string) (*User, error) { + user, err := s.store.Get(ctx, id) + if err != nil { + return nil, fmt.Errorf("get user %s: %w", id, err) + } + return user, nil +} +``` + +### Custom error types + +```go +type ValidationError struct { + Field string + Message string +} + +func (e *ValidationError) Error() string { + return fmt.Sprintf("validation: %s: %s", e.Field, e.Message) +} + +// Check: +var ve *ValidationError +if errors.As(err, &ve) { + log.Printf("field %s: %s", ve.Field, ve.Message) +} +``` + +--- + +## 4. HTTP client + +### Client with timeout and retry + +```go +client := &http.Client{ + Timeout: 30 * time.Second, + Transport: &http.Transport{ + MaxIdleConns: 100, + MaxIdleConnsPerHost: 10, + IdleConnTimeout: 90 * time.Second, + }, +} +``` + +### Request with context + +```go +req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil) +if err != nil { + return fmt.Errorf("build request: %w", err) +} +req.Header.Set("Authorization", "Bearer "+token) +req.Header.Set("Accept", "application/json") + +resp, err := client.Do(req) +if err != nil { + return fmt.Errorf("request: %w", err) +} +defer resp.Body.Close() + +if resp.StatusCode >= 300 { + body, _ := io.ReadAll(resp.Body) + return fmt.Errorf("HTTP %d: %s", resp.StatusCode, string(body)) +} +``` + +### Exponential backoff with jitter + +```go +func backoff(attempt int, base, max time.Duration) time.Duration { + d := base * time.Duration(1< max { + d = max + } + jitter := time.Duration(rand.Int63n(int64(d / 5))) + return d + jitter - d/10 +} +``` + +--- + +## 5. Process management + +### Start with PID tracking + +```go +cmd := exec.CommandContext(ctx, binPath) +cmd.Dir = workDir +cmd.Env = append(os.Environ(), envVars...) +cmd.Stdout = logFile +cmd.Stderr = logFile + +if err := cmd.Start(); err != nil { + return fmt.Errorf("start: %w", err) +} + +// Write PID file +pidPath := filepath.Join(workDir, ".pid") +os.WriteFile(pidPath, []byte(strconv.Itoa(cmd.Process.Pid)), 0o644) +``` + +### Graceful stop (SIGTERM → wait → SIGKILL) + +```go +func stopProcess(cmd *exec.Cmd, timeout time.Duration) error { + if cmd.Process == nil { + return nil + } + if err := cmd.Process.Signal(syscall.SIGTERM); err != nil { + return cmd.Process.Kill() + } + done := make(chan error, 1) + go func() { done <- cmd.Wait() }() + select { + case err := <-done: + return err + case <-time.After(timeout): + return cmd.Process.Kill() + } +} +``` + +### Check PID alive + +```go +func pidAlive(pid int) bool { + if pid <= 0 { + return false + } + err := syscall.Kill(pid, 0) + return err == nil || errors.Is(err, syscall.EPERM) +} +``` + +--- + +## 6. Structured logging + +### slog with JSON handler + +```go +handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{ + Level: slog.LevelInfo, + AddSource: false, +}) +logger := slog.New(handler) +``` + +### Logger hierarchy with context + +```go +daemonLogger := logger.With("component", "daemon") +groupLogger := daemonLogger.With("group", groupName) +runnerLogger := groupLogger.With("runner", runnerName) + +runnerLogger.Info("job completed", + "job_id", jobID, + "result", "success", + "duration_s", elapsed.Seconds(), +) +``` + +### Multi-handler (write to multiple destinations) + +```go +type MultiHandler struct { + handlers []slog.Handler +} + +func (m *MultiHandler) Enabled(ctx context.Context, level slog.Level) bool { + for _, h := range m.handlers { + if h.Enabled(ctx, level) { + return true + } + } + return false +} + +func (m *MultiHandler) Handle(ctx context.Context, r slog.Record) error { + for _, h := range m.handlers { + if h.Enabled(ctx, r.Level) { + _ = h.Handle(ctx, r) + } + } + return nil +} + +func (m *MultiHandler) WithAttrs(attrs []slog.Attr) slog.Handler { + handlers := make([]slog.Handler, len(m.handlers)) + for i, h := range m.handlers { + handlers[i] = h.WithAttrs(attrs) + } + return &MultiHandler{handlers: handlers} +} + +func (m *MultiHandler) WithGroup(name string) slog.Handler { + handlers := make([]slog.Handler, len(m.handlers)) + for i, h := range m.handlers { + handlers[i] = h.WithGroup(name) + } + return &MultiHandler{handlers: handlers} +} +``` + +--- + +## 7. Testing + +### Table-driven test + +```go +func TestParseConfig(t *testing.T) { + tests := []struct { + name string + input string + want *Config + wantErr string + }{ + { + name: "valid org scope", + input: "https://github.com/my-org", + want: &Config{Scope: "org", Owner: "my-org"}, + }, + { + name: "empty URL", + input: "", + wantErr: "url is required", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got, err := ParseConfig(tt.input) + if tt.wantErr != "" { + require.ErrorContains(t, err, tt.wantErr) + return + } + require.NoError(t, err) + assert.Equal(t, tt.want, got) + }) + } +} +``` + +### HTTP test server + +```go +func TestClient_ListRunners(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + assert.Equal(t, "/orgs/my-org/actions/runners", r.URL.Path) + assert.Equal(t, "Bearer test-token", r.Header.Get("Authorization")) + json.NewEncoder(w).Encode(map[string]any{ + "runners": []map[string]any{ + {"id": 1, "name": "runner-1", "status": "online"}, + }, + }) + })) + defer srv.Close() + + client := NewClient(srv.URL, "test-token") + runners, err := client.ListRunners(context.Background()) + require.NoError(t, err) + assert.Len(t, runners, 1) +} +``` + +--- + +## 8. Configuration + +### YAML with defaults + +```go +type Config struct { + Level string `yaml:"level"` + Dir string `yaml:"dir"` + MaxSize int `yaml:"max_size"` +} + +func (c *Config) applyDefaults() { + if c.Level == "" { + c.Level = "info" + } + if c.Dir == "" { + if os.Getuid() == 0 { + c.Dir = "/var/log/ghr" + } else { + home, _ := os.UserHomeDir() + c.Dir = filepath.Join(home, ".local", "share", "ghr", "logs") + } + } +} +``` + +### Validation + +```go +func (c *Config) Validate() error { + if len(c.Groups) == 0 { + return fmt.Errorf("at least one group is required") + } + for i, g := range c.Groups { + if g.Name == "" { + return fmt.Errorf("groups[%d].name is required", i) + } + if g.MaxRunners < 1 { + return fmt.Errorf("groups[%d].max_runners must be >= 1", i) + } + if g.MinRunners > g.MaxRunners { + return fmt.Errorf("groups[%d].min_runners (%d) > max_runners (%d)", i, g.MinRunners, g.MaxRunners) + } + } + return nil +} +``` + +--- + +## 9. Concurrency + +### Mutex-protected state + +```go +type RunnerState struct { + mu sync.Mutex + idle map[string]*Process + busy map[string]*Process +} + +func (s *RunnerState) MarkBusy(name string) { + s.mu.Lock() + defer s.mu.Unlock() + proc, ok := s.idle[name] + if !ok { + return // log warning + } + delete(s.idle, name) + s.busy[name] = proc +} + +func (s *RunnerState) Count() int { + s.mu.Lock() + defer s.mu.Unlock() + return len(s.idle) + len(s.busy) +} + +func (s *RunnerState) Snapshot() []RunnerSnapshot { + s.mu.Lock() + defer s.mu.Unlock() + // Return a copy, not the map itself + out := make([]RunnerSnapshot, 0, len(s.idle)+len(s.busy)) + for _, p := range s.idle { out = append(out, p.Snapshot("idle")) } + for _, p := range s.busy { out = append(out, p.Snapshot("busy")) } + return out +} +``` + +--- + +## 10. Filesystem + +### Safe directory copy + +```go +func copyDir(src, dst string) error { + return filepath.WalkDir(src, func(path string, d fs.DirEntry, err error) error { + if err != nil { + return err + } + rel, _ := filepath.Rel(src, path) + target := filepath.Join(dst, rel) + + if d.IsDir() { + return os.MkdirAll(target, 0o755) + } + + info, err := d.Info() + if err != nil { + return err + } + return copyFile(path, target, info.Mode()) + }) +} + +func copyFile(src, dst string, perm fs.FileMode) error { + in, err := os.Open(src) + if err != nil { + return err + } + defer in.Close() + + out, err := os.OpenFile(dst, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, perm) + if err != nil { + return err + } + defer out.Close() + + _, err = io.Copy(out, in) + return err +} +``` diff --git a/.claude/skills/pm/SKILL.md b/.claude/skills/pm/SKILL.md new file mode 100644 index 0000000..bec02c3 --- /dev/null +++ b/.claude/skills/pm/SKILL.md @@ -0,0 +1,73 @@ +--- +name: pm +description: Project manager orchestrator for ghr v2. Use this when the user wants to discuss features, plan work, create specs, or implement features with a structured workflow. Acts as a tech lead that delegates to specialized agents (spec-writer, developer, reviewer, tester). Triggers on project planning, feature discussion, spec creation, implementation requests, or when the user says "pm", "project manager", "let's plan", "let's implement", or "new feature". +--- + +# Project Manager — ghr v2 + +You are the **technical project manager** for ghr v2. You orchestrate the project by talking to the user and delegating to specialized agents. + +## Your role + +- You are the single point of contact for the user +- You understand the full project vision (specs in `specs/`) +- You make decisions, prioritize, and delegate +- You never write code yourself — you delegate to agents +- You track progress and report back concisely + +## How you work + +### When the user wants to discuss / brainstorm +Talk directly. Ask questions. Challenge ideas. Push back if something is over-engineered or contradicts existing specs. Your goal: converge on a clear decision. + +### When the user wants a new spec +Delegate to the **spec-writer** agent. But first: +1. Have a conversation with the user to understand EXACTLY what they want +2. Ask targeted questions (not open-ended dumps) +3. Reference existing specs that might be impacted +4. Once you have clarity, write a brief (~5 lines) for the spec-writer +5. Spawn the spec-writer agent with the brief +6. Review the output, show it to the user, iterate + +### When the user wants to implement a feature +1. Identify which spec(s) cover this feature +2. Break it down into implementation tasks (ordered by dependency) +3. For each task, spawn the **developer** agent with precise instructions +4. After implementation, spawn the **code-reviewer** agent +5. After review, spawn the **tester** agent if tests are missing +6. Run `/check` to validate the full pipeline +7. Report results to the user + +### When the user asks about project status +Read the specs, check which files exist in `internal/`, report what's done vs what's left. + +## Agents you can delegate to + +| Agent | Use for | +|---|---| +| **spec-writer** | Writing new specs or updating existing ones. Give it a clear brief. | +| **developer** | Writing Go code. Give it the spec reference, files to create/modify, and expected behavior. | +| **code-reviewer** | Reviewing code quality, Go idioms, spec compliance. Give it the files to review. | +| **spec-checker** | Verifying implementation matches specs. Give it the spec + implementation files. | + +## Rules + +- **Never guess** — if you're unsure about the user's intent, ask +- **One thing at a time** — don't overload agents with multiple unrelated tasks +- **Show your plan** — before delegating, tell the user what you're about to do +- **Keep context lean** — give agents only what they need, not the whole project history +- **Specs are the source of truth** — always check specs before making decisions +- **Flag contradictions** — if something conflicts with existing specs, surface it immediately + +## Current specs + +Read the relevant ones before any decision: +- `specs/00-architecture.md` — package structure, interfaces, DI +- `specs/01-core-scaleset.md` — scale set engine, scaler, runner manager +- `specs/02-cli-commands.md` — CLI commands (start/stop/run/status/purge/login) +- `specs/03-health-monitor.md` — health checks +- `specs/04-logging.md` — structured logging +- `specs/05-notifications.md` — Discord, webhooks +- `specs/06-uptime-kuma.md` — push monitoring +- `specs/07-config.md` — YAML config schema +- `specs/08-auth.md` — authentication (login wizard) diff --git a/.claude/skills/scaleset-sdk/SKILL.md b/.claude/skills/scaleset-sdk/SKILL.md new file mode 100644 index 0000000..ddab90e --- /dev/null +++ b/.claude/skills/scaleset-sdk/SKILL.md @@ -0,0 +1,137 @@ +--- +name: scaleset-sdk +description: Build custom GitHub Actions runner autoscalers using the official actions/scaleset Go SDK. Use this skill whenever working with GitHub Actions Runner Scale Sets, implementing the Scaler interface, configuring JIT runners, managing scale set sessions, or building self-hosted runner infrastructure (including ghr). Triggers on any code importing "github.com/actions/scaleset", any mention of scale sets, JIT runner config, runner autoscaling, or self-hosted runner management with Go. Also use when debugging scale set authentication, message polling, or runner lifecycle issues. +--- + +# GitHub Actions Runner Scale Set SDK + +Complete reference for building custom autoscaling solutions with `github.com/actions/scaleset`. + +## When to use this skill + +- Writing Go code that imports `github.com/actions/scaleset` or `github.com/actions/scaleset/listener` +- Implementing the `listener.Scaler` interface +- Building a custom runner backend (process, VM, container) +- Debugging scale set auth, polling, or runner lifecycle +- Working on ghr (GitHub runner controller for macOS) + +## Quick start pattern + +Every scale set autoscaler follows this skeleton: + +```go +// 1. Create client (PAT or GitHub App) +client, _ := scaleset.NewClientWithPersonalAccessToken(scaleset.NewClientWithPersonalAccessTokenConfig{ + GitHubConfigURL: "https://github.com/my-org", + PersonalAccessToken: token, + SystemInfo: scaleset.SystemInfo{System: "ghr", Version: "1.0"}, +}) + +// 2. Create or get scale set +scaleSet, _ := client.CreateRunnerScaleSet(ctx, &scaleset.RunnerScaleSet{ + Name: "my-runners", // this IS the runs-on: label + RunnerGroupID: 1, // 1 = "default" + Labels: []scaleset.Label{{Type: "System", Name: "my-runners"}}, + RunnerSetting: scaleset.RunnerSetting{DisableUpdate: true}, +}) +defer client.DeleteRunnerScaleSet(context.WithoutCancel(ctx), scaleSet.ID) + +// 3. Open message session +sessionClient, _ := client.MessageSessionClient(ctx, scaleSet.ID, hostname) +defer sessionClient.Close(context.Background()) + +// 4. Create listener + run with your Scaler +l, _ := listener.New(sessionClient, listener.Config{ + ScaleSetID: scaleSet.ID, + MaxRunners: 15, +}) +l.Run(ctx, &MyScaler{}) // blocks until ctx cancelled or error +``` + +## The Scaler interface (the only thing you implement) + +```go +type Scaler interface { + HandleDesiredRunnerCount(ctx context.Context, count int) (int, error) + HandleJobStarted(ctx context.Context, jobInfo *scaleset.JobStarted) error + HandleJobCompleted(ctx context.Context, jobInfo *scaleset.JobCompleted) error +} +``` + +### HandleDesiredRunnerCount(ctx, count) (int, error) + +- `count` = `statistics.TotalAssignedJobs` (jobs needing runners RIGHT NOW) +- Called VERY frequently: at init, after every message, after every long-poll timeout (~50s) +- Return the actual runner count you scaled to (used for metrics only) +- Any error terminates `Run()` +- Scaling formula from the reference example: `target = min(maxRunners, minRunners + count)` +- Scale-down is NOT done here — it happens in `HandleJobCompleted` + +### HandleJobStarted(ctx, jobInfo) error + +- Mark the runner as busy (bookkeeping). No scaling action needed. +- `jobInfo.RunnerName` identifies which runner got the job. +- Any error terminates `Run()` + +### HandleJobCompleted(ctx, jobInfo) error + +- THIS is where scale-down happens: destroy the runner process/container/VM + cleanup workdir. +- `jobInfo.RunnerName` identifies which runner to destroy. +- `jobInfo.Result`: `"Succeeded"`, `"Failed"`, or `"Cancelled"` (cancelled = job reassignment, not a real completion) +- Any error terminates `Run()` + +### Processing order within a single message batch + +1. AcquireJobs (automatic, not exposed to Scaler) +2. All HandleJobStarted calls +3. All HandleJobCompleted calls +4. HandleDesiredRunnerCount + +JobCompleted runs BEFORE HandleDesiredRunnerCount. This is why the count naturally decreases after runners are cleaned up. + +## JIT Runner Config (replaces config.sh) + +```go +jit, _ := scalesetClient.GenerateJitRunnerConfig(ctx, + &scaleset.RunnerScaleSetJitRunnerSetting{Name: "runner-abc123"}, + scaleSetID, +) +// jit.EncodedJITConfig is a base64 blob — treat as SECRET until consumed +``` + +The runner binary reads the JIT config from an env var instead of needing `config.sh`: + +```go +cmd := exec.Command("./run.sh") +cmd.Env = append(os.Environ(), "ACTIONS_RUNNER_INPUT_JITCONFIG="+jit.EncodedJITConfig) +cmd.Start() +``` + +No `config.sh` step needed. No registration token management. The JIT config contains everything. + +## Authentication + +Read `references/api-reference.md` section "Authentication" for the full flow. Summary: + +- **GitHub App (recommended)**: `ClientID` + `InstallationID` + `PrivateKey` (PEM). Auto-rotates tokens. +- **PAT**: simpler, broader scope. Pass as `PersonalAccessToken`. +- Token exchange is automatic: PAT/App -> registration token -> admin token. Refresh is transparent (60s before expiry). + +## Key design facts + +1. **Scale set name = workflow label**. `runs-on: my-scale-set` targets the scale set named `my-scale-set`. +2. **Runners are ephemeral by default**. One job, then removed. +3. **Long-polling, not interval polling**. `GetMessage` blocks up to ~50s. React instantly to new jobs. +4. **Message ack is optimistic**. Messages are deleted BEFORE your Scaler processes them. +5. **`handleMessage` uses `context.WithoutCancel`**. Even during shutdown, message processing completes. +6. **Scale set is deleted on daemon shutdown** (`defer DeleteRunnerScaleSet`). Clean state on restart. +7. **Session token refresh is automatic**. 401 -> refresh -> retry (once). Transparent to your code. +8. **Any Scaler error kills the listener loop**. Handle transient errors inside your Scaler. +9. **SetMaxRunners is thread-safe**. Call it anytime to adjust capacity dynamically. +10. **Go 1.25+ required**. + +## Reference docs + +For detailed API signatures, types, error handling, and endpoint maps, read: +- `references/api-reference.md` — Complete SDK reference (types, methods, auth, errors, endpoints) +- `references/macos-adaptation.md` — How to adapt the Docker example to macOS process-based runners diff --git a/.claude/skills/scaleset-sdk/references/api-reference.md b/.claude/skills/scaleset-sdk/references/api-reference.md new file mode 100644 index 0000000..cb24b20 --- /dev/null +++ b/.claude/skills/scaleset-sdk/references/api-reference.md @@ -0,0 +1,515 @@ +# actions/scaleset — Complete API Reference + +## Table of Contents + +1. [Package constants](#1-package-constants) +2. [Core types](#2-core-types) +3. [Job message types](#3-job-message-types) +4. [Client construction](#4-client-construction) +5. [HTTP options](#5-http-options) +6. [Authentication flow](#6-authentication-flow) +7. [Client API methods](#7-client-api-methods) +8. [MessageSessionClient](#8-messagesessionclient) +9. [Listener package](#9-listener-package) +10. [Error handling](#10-error-handling) +11. [Config URL parsing](#11-config-url-parsing) +12. [Full endpoint map](#12-full-endpoint-map) +13. [Statistics fields](#13-statistics-fields) +14. [Long-polling mechanics](#14-long-polling-mechanics) +15. [Concurrency model](#15-concurrency-model) +16. [Known limitations](#16-known-limitations) +17. [Dependencies](#17-dependencies) + +--- + +## 1. Package constants + +```go +const HeaderScaleSetMaxCapacity = "X-ScaleSetMaxCapacity" +const DefaultRunnerGroup = "default" + +type MessageType string +const ( + MessageTypeJobAvailable MessageType = "JobAvailable" + MessageTypeJobAssigned MessageType = "JobAssigned" + MessageTypeJobStarted MessageType = "JobStarted" + MessageTypeJobCompleted MessageType = "JobCompleted" +) + +var ErrInvalidGitHubConfigURL = fmt.Errorf("invalid config URL, should point to an enterprise, org, or repository") +``` + +--- + +## 2. Core types + +```go +type RunnerScaleSet struct { + ID int `json:"id,omitempty"` + Name string `json:"name,omitempty"` + RunnerGroupID int `json:"runnerGroupId,omitempty"` + RunnerGroupName string `json:"runnerGroupName,omitempty"` + Labels []Label `json:"labels,omitempty"` + RunnerSetting RunnerSetting `json:"RunnerSetting,omitempty"` + CreatedOn time.Time `json:"createdOn,omitempty"` + RunnerJitConfigURL string `json:"runnerJitConfigUrl,omitempty"` + Statistics *RunnerScaleSetStatistic `json:"statistics,omitempty"` +} + +type Label struct { + Type string `json:"type"` // "System" or empty (defaults to "System") + Name string `json:"name"` +} + +type RunnerSetting struct { + DisableUpdate bool `json:"disableUpdate,omitempty"` +} + +type RunnerGroup struct { + ID int `json:"id"` + Name string `json:"name"` + Size int `json:"size"` + IsDefault bool `json:"isDefaultGroup"` +} + +type RunnerScaleSetSession struct { + SessionID uuid.UUID `json:"sessionId,omitempty"` + OwnerName string `json:"ownerName,omitempty"` + RunnerScaleSet *RunnerScaleSet `json:"runnerScaleSet,omitempty"` + MessageQueueURL string `json:"messageQueueUrl,omitempty"` + MessageQueueAccessToken string `json:"messageQueueAccessToken,omitempty"` + Statistics *RunnerScaleSetStatistic `json:"statistics,omitempty"` +} + +type RunnerScaleSetStatistic struct { + TotalAvailableJobs int `json:"totalAvailableJobs"` + TotalAcquiredJobs int `json:"totalAcquiredJobs"` + TotalAssignedJobs int `json:"totalAssignedJobs"` // THE scaling metric + TotalRunningJobs int `json:"totalRunningJobs"` + TotalRegisteredRunners int `json:"totalRegisteredRunners"` + TotalBusyRunners int `json:"totalBusyRunners"` + TotalIdleRunners int `json:"totalIdleRunners"` +} + +type RunnerScaleSetMessage struct { + MessageID int + Statistics *RunnerScaleSetStatistic + JobAvailableMessages []*JobAvailable + JobAssignedMessages []*JobAssigned + JobStartedMessages []*JobStarted + JobCompletedMessages []*JobCompleted +} + +type RunnerScaleSetJitRunnerSetting struct { + Name string `json:"name"` + WorkFolder string `json:"workFolder"` +} + +type RunnerScaleSetJitRunnerConfig struct { + Runner *RunnerReference `json:"runner"` + EncodedJITConfig string `json:"encodedJITConfig"` +} + +type RunnerReference struct { + ID int `json:"id"` + Name string `json:"name"` + RunnerScaleSetID int `json:"runnerScaleSetId"` +} + +type SystemInfo struct { + System string `json:"system"` + Version string `json:"version"` + CommitSHA string `json:"commit_sha"` + ScaleSetID int `json:"scale_set_id"` + Subsystem string `json:"subsystem"` +} + +type GitHubAppAuth struct { + ClientID string + InstallationID int64 + PrivateKey string // PEM-formatted RSA private key +} + +type ProxyFunc func(req *http.Request) (*url.URL, error) +``` + +--- + +## 3. Job message types + +```go +type JobMessageBase struct { + JobMessageType + RunnerRequestID int64 `json:"runnerRequestId"` + RepositoryName string `json:"repositoryName"` + OwnerName string `json:"ownerName"` + JobID string `json:"jobId"` + JobWorkflowRef string `json:"jobWorkflowRef"` + JobDisplayName string `json:"jobDisplayName"` + WorkflowRunID int64 `json:"workflowRunId"` + EventName string `json:"eventName"` + RequestLabels []string `json:"requestLabels"` + QueueTime time.Time `json:"queueTime"` + ScaleSetAssignTime time.Time `json:"scaleSetAssignTime"` + RunnerAssignTime time.Time `json:"runnerAssignTime"` + FinishTime time.Time `json:"finishTime"` +} + +type JobAvailable struct { + AcquireJobURL string `json:"acquireJobUrl"` + JobMessageBase +} + +type JobAssigned struct { + JobMessageBase +} + +type JobStarted struct { + RunnerID int `json:"runnerId"` + RunnerName string `json:"runnerName"` + JobMessageBase +} + +type JobCompleted struct { + Result string `json:"result"` // "Succeeded", "Failed", "Cancelled" + RunnerID int `json:"runnerId"` + RunnerName string `json:"runnerName"` + JobMessageBase +} +``` + +--- + +## 4. Client construction + +```go +// GitHub App (recommended) +type ClientWithGitHubAppConfig struct { + GitHubConfigURL string + GitHubAppAuth GitHubAppAuth + SystemInfo SystemInfo +} +func NewClientWithGitHubApp(config ClientWithGitHubAppConfig, options ...HTTPOption) (*Client, error) + +// PAT +type NewClientWithPersonalAccessTokenConfig struct { + GitHubConfigURL string + PersonalAccessToken string + SystemInfo SystemInfo +} +func NewClientWithPersonalAccessToken(config NewClientWithPersonalAccessTokenConfig, options ...HTTPOption) (*Client, error) +``` + +GitHubConfigURL examples: +- Org: `https://github.com/my-org` +- Repo: `https://github.com/my-org/my-repo` +- Enterprise: `https://github.com/enterprises/my-enterprise` +- GHES: `https://ghes.company.com/my-org` + +--- + +## 5. HTTP options + +```go +type HTTPOption func(*httpClientOption) + +func WithRetryMax(retryMax int) HTTPOption // default: 4 +func WithRetryWaitMax(retryWaitMax time.Duration) HTTPOption // default: 30s +func WithTimeout(duration time.Duration) HTTPOption // default: 5min +func WithLogger(logger *slog.Logger) HTTPOption // default: discard +func WithRootCAs(rootCAs *x509.CertPool) HTTPOption // custom CA pool +func WithoutTLSVerify() HTTPOption // skip TLS verification +func WithProxy(proxyFunc ProxyFunc) HTTPOption // custom proxy +func WithRetryableHTTPClint(client *retryablehttp.Client) HTTPOption // NOTE: typo in name is intentional (published API) +``` + +--- + +## 6. Authentication flow + +### GitHub App path (4 steps, all automatic) + +1. **Create JWT**: RS256 signed, iat = now-60s (clock skew), exp = iat+9min, iss = ClientID +2. **Get installation access token**: `POST /app/installations/{id}/access_tokens` with Bearer JWT +3. **Get registration token**: `POST /orgs/{org}/actions/runners/registration-token` (or /repos/ or /enterprises/) with Bearer access_token +4. **Get admin connection**: `POST /actions/runner-registration` with `Authorization: RemoteAuth {registration_token}` — returns `ActionsServiceURL` + `AdminToken` (JWT) + +### PAT path (2 steps) + +1. **Get registration token**: same endpoint, with Bearer PAT directly +2. **Get admin connection**: same as step 4 above + +### Token refresh + +`updateTokenIfNeeded()` runs before every Actions Service request. If admin token expires within 60s, full chain re-executes. Expiry parsed from JWT claims (ParseUnverified). + +The admin connection request retries on 401 and 403 (propagation delays). + +--- + +## 7. Client API methods + +All methods are thread-safe (mutex-protected). + +### Scale Set CRUD + +```go +func (c *Client) CreateRunnerScaleSet(ctx, *RunnerScaleSet) (*RunnerScaleSet, error) +// POST /_apis/runtime/runnerscalesets +// Auto-adds label from Name if no labels provided. Errors if both Name and Labels empty. + +func (c *Client) GetRunnerScaleSet(ctx, runnerGroupID int, name string) (*RunnerScaleSet, error) +// GET /_apis/runtime/runnerscalesets?runnerGroupId={id}&name={name} +// Returns nil,nil if count=0. Error if count>1. + +func (c *Client) GetRunnerScaleSetByID(ctx, id int) (*RunnerScaleSet, error) +// GET /_apis/runtime/runnerscalesets/{id} + +func (c *Client) UpdateRunnerScaleSet(ctx, id int, *RunnerScaleSet) (*RunnerScaleSet, error) +// PATCH /_apis/runtime/runnerscalesets/{id} + +func (c *Client) DeleteRunnerScaleSet(ctx, id int) error +// DELETE /_apis/runtime/runnerscalesets/{id} — expects 204 +``` + +### Runner management + +```go +func (c *Client) GetRunner(ctx, runnerID int) (*RunnerReference, error) +func (c *Client) GetRunnerByName(ctx, name string) (*RunnerReference, error) // nil,nil if not found +func (c *Client) RemoveRunner(ctx, runnerID int64) error // expects 204 +``` + +### JIT config + +```go +func (c *Client) GenerateJitRunnerConfig(ctx, *RunnerScaleSetJitRunnerSetting, scaleSetID int) (*RunnerScaleSetJitRunnerConfig, error) +// POST /_apis/runtime/runnerscalesets/{id}/generatejitconfig +``` + +### Runner group + +```go +func (c *Client) GetRunnerGroupByName(ctx, name string) (*RunnerGroup, error) +// Default group has ID=1 (hardcode for "default" to skip this call) +``` + +### Message session + +```go +func (c *Client) MessageSessionClient(ctx, scaleSetID int, owner string, options ...HTTPOption) (*MessageSessionClient, error) +// Creates session immediately (POST). owner = hostname or UUID. +``` + +### Utility + +```go +func (c *Client) SetSystemInfo(info SystemInfo) +func (c *Client) SystemInfo() SystemInfo +func (c *Client) DebugInfo() string // JSON with HasProxy, HasRootCA, SystemInfo +``` + +--- + +## 8. MessageSessionClient + +```go +func (c *MessageSessionClient) GetMessage(ctx, lastMessageID, maxCapacity int) (*RunnerScaleSetMessage, error) +// Long-polls. 200=message, 202=nil,nil (no messages). Auto-refreshes on 401. + +func (c *MessageSessionClient) DeleteMessage(ctx, messageID int) error +// Ack. 204=success. Auto-refreshes on 401. + +func (c *MessageSessionClient) AcquireJobs(ctx, requestIDs []int64) ([]int64, error) +// Claims jobs. Returns actually acquired IDs (may be subset). + +func (c *MessageSessionClient) Session() RunnerScaleSetSession +// Returns copy of current session. + +func (c *MessageSessionClient) Close(ctx) error +// Deletes session. Always call (use defer). +``` + +--- + +## 9. Listener package + +```go +import "github.com/actions/scaleset/listener" + +type Config struct { + ScaleSetID int + MaxRunners int + Logger *slog.Logger +} + +func New(client Client, config Config, options ...Option) (*Listener, error) +func (l *Listener) Run(ctx context.Context, scaler Scaler) error +func (l *Listener) SetMaxRunners(count int) // thread-safe, takes effect on next poll + +type Scaler interface { + HandleDesiredRunnerCount(ctx context.Context, count int) (int, error) + HandleJobStarted(ctx context.Context, jobInfo *scaleset.JobStarted) error + HandleJobCompleted(ctx context.Context, jobInfo *scaleset.JobCompleted) error +} + +type Client interface { + GetMessage(ctx context.Context, lastMessageID, maxCapacity int) (*scaleset.RunnerScaleSetMessage, error) + DeleteMessage(ctx context.Context, messageID int) error + AcquireJobs(ctx context.Context, requestIDs []int64) ([]int64, error) + Session() scaleset.RunnerScaleSetSession +} + +type MetricsRecorder interface { + RecordStatistics(statistics *scaleset.RunnerScaleSetStatistic) + RecordJobStarted(msg *scaleset.JobStarted) + RecordJobCompleted(msg *scaleset.JobCompleted) + RecordDesiredRunners(count int) +} + +func WithMetricsRecorder(recorder MetricsRecorder) Option +``` + +### Run() loop internals + +1. Read initial session statistics +2. Call `HandleDesiredRunnerCount(ctx, initialStats.TotalAssignedJobs)` +3. Loop: + - `GetMessage(ctx, lastMessageID, maxRunners)` — long-polls ~50s + - If nil: call `HandleDesiredRunnerCount` with cached stats, continue + - If message: ack (DeleteMessage) → AcquireJobs → HandleJobStarted(s) → HandleJobCompleted(s) → HandleDesiredRunnerCount + - Any error from Scaler: return error (terminates Run) + +--- + +## 10. Error handling + +### Sentinel errors + +```go +var RunnerNotFoundError = scalesetError("runner not found") +var RunnerExistsError = scalesetError("runner exists") +var JobStillRunningError = scalesetError("job still running") +var MessageQueueTokenExpiredError = scalesetError("message queue token expired") +``` + +Use `errors.Is(err, scaleset.RunnerNotFoundError)` etc. + +### Exception mapping + +Server returns JSON `{"typeName":"...", "message":"..."}`. Mapped: +- `AgentExistsException` → `RunnerExistsError` +- `AgentNotFoundException` → `RunnerNotFoundError` +- `JobStillRunningException` → `JobStillRunningError` + +### Error metadata + +All HTTP errors include ActivityId and X-GitHub-Request-Id headers in the message. + +--- + +## 11. Config URL parsing + +| URL pattern | Scope | Example | +|---|---|---| +| `github.com/{org}` | Organization | `https://github.com/my-org` | +| `github.com/{org}/{repo}` | Repository | `https://github.com/my-org/my-repo` | +| `github.com/enterprises/{name}` | Enterprise | `https://github.com/enterprises/my-ent` | +| `ghes.example.com/{org}` | Org (GHES) | `https://ghes.corp.com/my-org` | + +API URL routing: +- Hosted (github.com, *.ghe.com): `api.github.com` or `api.{host}` +- GHES: `{host}/api/v3` +- `GITHUB_ACTIONS_FORCE_GHES` env var forces GHES mode + +--- + +## 12. Full endpoint map + +| Method | HTTP | Endpoint | Status | +|---|---|---|---| +| CreateRunnerScaleSet | POST | `/_apis/runtime/runnerscalesets` | 200 | +| GetRunnerScaleSet | GET | `/_apis/runtime/runnerscalesets?runnerGroupId=&name=` | 200 | +| GetRunnerScaleSetByID | GET | `/_apis/runtime/runnerscalesets/{id}` | 200 | +| UpdateRunnerScaleSet | PATCH | `/_apis/runtime/runnerscalesets/{id}` | 200 | +| DeleteRunnerScaleSet | DELETE | `/_apis/runtime/runnerscalesets/{id}` | 204 | +| GetRunnerGroupByName | GET | `/_apis/runtime/runnergroups/?groupName=` | 200 | +| GetRunner | GET | `/_apis/distributedtask/pools/0/agents/{id}` | 200 | +| GetRunnerByName | GET | `/_apis/distributedtask/pools/0/agents?agentName=` | 200 | +| RemoveRunner | DELETE | `/_apis/distributedtask/pools/0/agents/{id}` | 204 | +| GenerateJitRunnerConfig | POST | `/_apis/runtime/runnerscalesets/{id}/generatejitconfig` | 200 | +| createMessageSession | POST | `/_apis/runtime/runnerscalesets/{id}/sessions` | 200 | +| deleteMessageSession | DELETE | `/_apis/runtime/runnerscalesets/{id}/sessions/{sessionId}` | 204 | +| refreshMessageSession | PATCH | `/_apis/runtime/runnerscalesets/{id}/sessions/{sessionId}` | 200 | +| AcquireJobs | POST | `/_apis/runtime/runnerscalesets/{id}/acquirejobs` | 200 | +| GetMessage | GET | `{messageQueueURL}?lastMessageId=` | 200/202 | +| DeleteMessage | DELETE | `{messageQueueURL}/{messageId}` | 204 | +| Registration token (org) | POST | `/orgs/{org}/actions/runners/registration-token` | 201 | +| Registration token (repo) | POST | `/repos/{owner}/{repo}/actions/runners/registration-token` | 201 | +| Registration token (ent) | POST | `/enterprises/{ent}/actions/runners/registration-token` | 201 | +| Access token (App) | POST | `/app/installations/{id}/access_tokens` | 201 | +| Admin connection | POST | `/actions/runner-registration` | 2xx | + +--- + +## 13. Statistics fields + +```go +type RunnerScaleSetStatistic struct { + TotalAvailableJobs int // jobs waiting to be assigned + TotalAcquiredJobs int // jobs claimed by AcquireJobs + TotalAssignedJobs int // THE metric: jobs that need runners + TotalRunningJobs int // jobs currently executing + TotalRegisteredRunners int // runners registered with GitHub + TotalBusyRunners int // runners currently running a job + TotalIdleRunners int // runners waiting for a job +} +``` + +`TotalAssignedJobs >= TotalRunningJobs`. Use `TotalAssignedJobs` for scaling, NOT individual message counts (messages are capped at 50 per batch). + +--- + +## 14. Long-polling mechanics + +- `GetMessage` uses HTTP long-polling (~50s server-side timeout) +- HTTP 200 = messages available (returned immediately) +- HTTP 202 = no messages (timeout, returns nil,nil) +- `lastMessageId` query param prevents reprocessing +- `X-ScaleSetMaxCapacity` header tells server your capacity +- Messages not ack'd (DeleteMessage) are redelivered +- Job reassignment: jobs can appear as JobAssigned → JobCompleted(Cancelled) up to 3 times with incremental delays + +--- + +## 15. Concurrency model + +- `Client.mu sync.Mutex` — every public method acquires it +- `MessageSessionClient.mu sync.Mutex` — separate mutex, every public method acquires it +- `Listener.maxRunners atomic.Uint32` — SetMaxRunners is lock-free +- When MessageSessionClient needs the parent Client (for token refresh), it explicitly acquires innerClient.mu + +--- + +## 16. Known limitations + +1. **Public Preview** — interfaces may change +2. **Go 1.25+ required** +3. **Message batch cap of 50** — don't count individual messages for scaling +4. **Silent label dropping on GHES < 3.21** without feature flag +5. **Typo in API**: `WithRetryableHTTPClint` (missing 'e') — can't be fixed +6. **HTTP defaults**: retryMax=4, retryWaitMax=30s, timeout=5min +7. **All response bodies read into memory** (BOM-trimmed) +8. **`GITHUB_ACTIONS_FORCE_GHES`** env var forces GHES mode (check existence, not value) + +--- + +## 17. Dependencies + +| Package | Version | Role | +|---|---|---| +| golang-jwt/jwt/v4 | v4.5.2 | JWT signing/verification | +| hashicorp/go-retryablehttp | v0.7.8 | HTTP retries | +| google/uuid | v1.6.0 | Session IDs | +| spf13/cobra | v1.10.2 | CLI framework (example) | +| stretchr/testify | v1.11.1 | Testing | diff --git a/.claude/skills/scaleset-sdk/references/macos-adaptation.md b/.claude/skills/scaleset-sdk/references/macos-adaptation.md new file mode 100644 index 0000000..ca43e8f --- /dev/null +++ b/.claude/skills/scaleset-sdk/references/macos-adaptation.md @@ -0,0 +1,200 @@ +# Adapting the Docker Example to macOS Process-Based Runners + +## What stays the same (no changes) + +Everything from the `scaleset` and `listener` packages is backend-agnostic: + +- Client creation (PAT or GitHub App) +- Scale set CRUD +- Message session + listener loop +- `listener.Scaler` interface contract (same 3 methods, same semantics) +- Scaling formula: `target = min(maxRunners, minRunners + count)` +- HandleJobStarted: state transition (idle → busy) +- Signal handling and shutdown flow +- JIT config generation via `GenerateJitRunnerConfig` + +## What must change + +### 1. Replace Docker with exec.Command + +**Docker version:** +```go +c, _ := dockerClient.ContainerCreate(ctx, &container.Config{ + Image: runnerImage, + User: "runner", + Cmd: []string{"/home/runner/run.sh"}, + Env: []string{"ACTIONS_RUNNER_INPUT_JITCONFIG=" + jit.EncodedJITConfig}, +}, nil, nil, nil, name) +dockerClient.ContainerStart(ctx, c.ID, container.StartOptions{}) +``` + +**macOS version:** +```go +cmd := exec.CommandContext(ctx, filepath.Join(workDir, "run.sh")) +cmd.Dir = workDir +cmd.Env = append(os.Environ(), "ACTIONS_RUNNER_INPUT_JITCONFIG="+jit.EncodedJITConfig) +cmd.Stdout = os.Stdout +cmd.Stderr = os.Stderr +cmd.Start() +``` + +### 2. Runner state tracking + +**Docker version:** +```go +type runnerState struct { + mu sync.Mutex + idle map[string]string // name → containerID + busy map[string]string // name → containerID +} +``` + +**macOS version:** +```go +type runnerProcess struct { + cmd *exec.Cmd + workDir string + pid int +} + +type runnerState struct { + mu sync.Mutex + idle map[string]*runnerProcess // name → process + busy map[string]*runnerProcess // name → process +} +``` + +### 3. HandleJobCompleted — cleanup + +**Docker version:** +```go +func (s *Scaler) HandleJobCompleted(ctx context.Context, jobInfo *scaleset.JobCompleted) error { + containerID := s.runners.markDone(jobInfo.RunnerName) + return s.dockerClient.ContainerRemove(ctx, containerID, container.RemoveOptions{Force: true}) +} +``` + +**macOS version:** +```go +func (s *Scaler) HandleJobCompleted(ctx context.Context, jobInfo *scaleset.JobCompleted) error { + proc := s.runners.markDone(jobInfo.RunnerName) + if proc.cmd.Process != nil { + _ = proc.cmd.Process.Kill() + _ = proc.cmd.Wait() + } + return os.RemoveAll(proc.workDir) +} +``` + +### 4. Shutdown + +**Docker version:** `ContainerRemove(force: true)` for all containers. + +**macOS version:** +```go +func (s *Scaler) shutdown(ctx context.Context) { + s.runners.mu.Lock() + defer s.runners.mu.Unlock() + for name, proc := range s.runners.idle { + _ = proc.cmd.Process.Kill() + _ = proc.cmd.Wait() + _ = os.RemoveAll(proc.workDir) + } + for name, proc := range s.runners.busy { + _ = proc.cmd.Process.Kill() + _ = proc.cmd.Wait() + _ = os.RemoveAll(proc.workDir) + } + clear(s.runners.idle) + clear(s.runners.busy) +} +``` + +### 5. Runner binary management + +Docker has the runner inside the image. On macOS you need: + +```go +// Download + extract once at startup +func (m *Manager) ensureRunnerBits(ctx context.Context, version string) (string, error) { + // Resolve "latest" → actual version via GitHub Releases API + // Download https://github.com/actions/runner/releases/download/v{ver}/actions-runner-osx-{arch}-{ver}.tar.gz + // Extract to cacheDir/{version}/ + // Return path to extracted directory +} + +// Copy cached bits to each runner's workdir +func (m *Manager) prepareWorkdir(baseDir, runnerID string) (string, error) { + workDir := filepath.Join(baseDir, runnerID) + os.MkdirAll(workDir, 0o755) + copyDir(cachedRunnerDir, workDir) + return workDir, nil +} +``` + +### 6. Config changes + +Remove: +- `RunnerImage` field + +Add: +- `RunnerVersion` (string: "latest" or pinned like "2.330.0") +- `CacheDir` (path for cached runner binaries) +- `WorkdirBase` (base path for runner workdirs) + +## Complete startRunner for macOS + +```go +func (s *Scaler) startRunner(ctx context.Context) (string, error) { + name := fmt.Sprintf("runner-%s", randHex(4)) + + // 1. Generate JIT config + jit, err := s.scalesetClient.GenerateJitRunnerConfig(ctx, + &scaleset.RunnerScaleSetJitRunnerSetting{Name: name}, + s.scaleSetID, + ) + if err != nil { + return "", fmt.Errorf("generate JIT config: %w", err) + } + + // 2. Prepare workdir (copy cached runner bits) + workDir, err := s.manager.prepareWorkdir(s.workdirBase, name) + if err != nil { + return "", fmt.Errorf("prepare workdir: %w", err) + } + + // 3. Start runner process with JIT config + cmd := exec.CommandContext(ctx, filepath.Join(workDir, "run.sh")) + cmd.Dir = workDir + cmd.Env = append(os.Environ(), "ACTIONS_RUNNER_INPUT_JITCONFIG="+jit.EncodedJITConfig) + cmd.Stdout = os.Stdout + cmd.Stderr = os.Stderr + + if err := cmd.Start(); err != nil { + _ = os.RemoveAll(workDir) + return "", fmt.Errorf("start runner: %w", err) + } + + // 4. Track + s.runners.addIdle(name, &runnerProcess{ + cmd: cmd, + workDir: workDir, + pid: cmd.Process.Pid, + }) + + return name, nil +} +``` + +## Architecture comparison + +| Layer | Docker | macOS | +|---|---|---| +| Runner backend | Container | exec.Cmd process | +| Config delivery | JITCONFIG env var | Same env var | +| State tracking | name → containerID | name → *runnerProcess | +| Scale up | ContainerCreate + Start | exec.Command + Start | +| Scale down | ContainerRemove(force) | Kill + Wait + RemoveAll | +| Shutdown | Force remove all | Kill all + cleanup dirs | +| Isolation | Container | Filesystem (workdirs) | +| Runner binary | Inside Docker image | Downloaded + cached | diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml new file mode 100644 index 0000000..fd6bdda --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.yml @@ -0,0 +1,59 @@ +name: Bug Report +description: Report a bug +labels: [bug] +body: + - type: textarea + id: description + attributes: + label: Description + description: What happened? + validations: + required: true + - type: textarea + id: expected + attributes: + label: Expected behavior + description: What did you expect to happen? + validations: + required: true + - type: textarea + id: reproduce + attributes: + label: Steps to reproduce + description: How can we reproduce this? + validations: + required: true + - type: input + id: version + attributes: + label: ghr version + description: Output of `ghr version` + placeholder: "e.g. ghr 1.0.0 (commit: abc1234, built: 2026-01-01T00:00:00Z)" + validations: + required: true + - type: dropdown + id: os + attributes: + label: macOS version + options: + - macOS 15 (Sequoia) + - macOS 14 (Sonoma) + - macOS 13 (Ventura) + - Other + validations: + required: true + - type: dropdown + id: arch + attributes: + label: Architecture + options: + - Apple Silicon (arm64) + - Intel (amd64) + validations: + required: true + - type: textarea + id: logs + attributes: + label: Relevant logs + description: Paste any relevant log output + render: shell diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml new file mode 100644 index 0000000..3ba13e0 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -0,0 +1 @@ +blank_issues_enabled: false diff --git a/.github/ISSUE_TEMPLATE/feature_request.yml b/.github/ISSUE_TEMPLATE/feature_request.yml new file mode 100644 index 0000000..fba5a47 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.yml @@ -0,0 +1,23 @@ +name: Feature Request +description: Suggest a feature +labels: [enhancement] +body: + - type: textarea + id: problem + attributes: + label: Problem + description: What problem does this solve? + validations: + required: true + - type: textarea + id: solution + attributes: + label: Proposed solution + description: How would you like it to work? + validations: + required: true + - type: textarea + id: alternatives + attributes: + label: Alternatives considered + description: Any other approaches you've thought about? diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100644 index 0000000..fbc93bc --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,69 @@ +name: CI + +on: + push: + branches: [main] + pull_request: + branches: [main] + +concurrency: + group: ci-${{ github.ref }} + cancel-in-progress: true + +jobs: + lint: + name: Lint + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 + - uses: actions/setup-go@d35c59abb061a4a6fb18e82ac0862c26744d6ab5 # v5.5.0 + with: + go-version-file: go.mod + - uses: golangci/golangci-lint-action@4afd733a84b1f43292c63897423277bb7f4313a9 # v8.0.0 + with: + version: latest + + vet: + name: Vet & Format + runs-on: ubuntu-latest + needs: lint + steps: + - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 + - uses: actions/setup-go@d35c59abb061a4a6fb18e82ac0862c26744d6ab5 # v5.5.0 + with: + go-version-file: go.mod + - run: go vet ./... + - run: make fmt-check + + build: + name: Build + runs-on: ubuntu-latest + needs: vet + strategy: + matrix: + goos: [darwin, linux] + goarch: [amd64, arm64] + steps: + - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 + - uses: actions/setup-go@d35c59abb061a4a6fb18e82ac0862c26744d6ab5 # v5.5.0 + with: + go-version-file: go.mod + - run: make build + env: + GOOS: ${{ matrix.goos }} + GOARCH: ${{ matrix.goarch }} + CGO_ENABLED: "0" + + test: + name: Test + runs-on: ${{ matrix.os }} + needs: build + strategy: + matrix: + os: [macos-latest, ubuntu-latest] + steps: + - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 + - uses: actions/setup-go@d35c59abb061a4a6fb18e82ac0862c26744d6ab5 # v5.5.0 + with: + go-version-file: go.mod + - run: make test diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml new file mode 100644 index 0000000..f11f713 --- /dev/null +++ b/.github/workflows/release.yml @@ -0,0 +1,27 @@ +name: Release + +on: + push: + tags: + - "v*.*.*" + +permissions: + contents: write + +jobs: + release: + name: Release + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 + with: + fetch-depth: 0 + - uses: actions/setup-go@d35c59abb061a4a6fb18e82ac0862c26744d6ab5 # v5.5.0 + with: + go-version-file: go.mod + - uses: goreleaser/goreleaser-action@9ed2f89a662bf1735a48bc8557fd212fa902bebf # v6.3.0 + with: + version: latest + args: release --clean + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} diff --git a/.gitignore b/.gitignore index 98fe80e..47bf342 100644 --- a/.gitignore +++ b/.gitignore @@ -1,21 +1,43 @@ -.serena -*.md -.DS_Store -.gocache/ -.env -.env.* -config.yaml -*.log -*.out -*.test -coverage.* +# Binary +/ghr +/ghr.exe +/v2/ghr +/v2/ghr.exe + +# Build bin/ dist/ -ghr.exe -**/*.exe -**/*.dll -**/*.so -**/*.dylib + +# Go +*.test +*.out +coverage.* + +# Environment & secrets +.env +.env.* +credentials.json +*.pem + +# OS +.DS_Store +Thumbs.db + +# IDE .vscode/ .idea/ -ghr +*.swp +*.swo +*~ + +# Claude Code +.claude/settings.local.json +.serena + +# Logs (local dev) +*.log + +# Runner workdirs (local dev) +runners/ +cache/ +state/ diff --git a/.golangci.yml b/.golangci.yml new file mode 100644 index 0000000..32a6311 --- /dev/null +++ b/.golangci.yml @@ -0,0 +1,86 @@ +version: "2" + +run: + timeout: 5m + +linters: + enable: + - errcheck + - govet + - ineffassign + - staticcheck + - unused + - gocritic + - misspell + - nolintlint + - prealloc + - revive + - unconvert + - unparam + - errorlint + - bodyclose + - contextcheck + - nilerr + - exhaustive + + settings: + errcheck: + exclude-functions: + - (net/http.ResponseWriter).Write + - (*log/slog.Logger).Info + - (*log/slog.Logger).Debug + - (*log/slog.Logger).Warn + - (*log/slog.Logger).Error + + gocritic: + enabled-tags: + - diagnostic + - style + - performance + + revive: + rules: + - name: blank-imports + - name: context-as-argument + - name: context-keys-type + - name: error-return + - name: error-strings + - name: error-naming + - name: if-return + - name: increment-decrement + - name: var-naming + - name: range + - name: receiver-naming + - name: time-naming + - name: unexported-return + - name: indent-error-flow + - name: errorf + - name: empty-block + - name: superfluous-else + - name: unused-parameter + - name: unreachable-code + + exhaustive: + default-signifies-exhaustive: true + + exclusions: + presets: + - std-error-handling + rules: + - path: _test\.go + linters: + - errcheck + - gocritic + - unparam + - revive + - linters: + - gocritic + text: "hugeParam: r is heavy" + path: internal/logging/handler\.go + - linters: + - gocritic + text: "filepathJoin" + path: internal/launchd/ + - linters: + - nilerr + path: internal/cli/auth\.go diff --git a/.goreleaser.yml b/.goreleaser.yml new file mode 100644 index 0000000..62a4b70 --- /dev/null +++ b/.goreleaser.yml @@ -0,0 +1,49 @@ +version: 2 + +builds: + - main: ./cmd/ghr + binary: ghr + env: + - CGO_ENABLED=0 + goos: + - darwin + - linux + goarch: + - amd64 + - arm64 + ldflags: + - -s -w + - -X '{{ .ModulePath }}/internal/cli.version={{ .Version }}' + - -X '{{ .ModulePath }}/internal/cli.commit={{ .ShortCommit }}' + - -X '{{ .ModulePath }}/internal/cli.date={{ .Date }}' + +archives: + - formats: [tar.gz] + name_template: "{{ .ProjectName }}_{{ .Version }}_{{ .Os }}_{{ .Arch }}" + +checksum: + name_template: checksums.txt + +changelog: + sort: asc + filters: + exclude: + - "^docs:" + - "^style:" + - "^chore\\(deps\\):" + groups: + - title: Features + regexp: "^feat" + - title: Bug Fixes + regexp: "^fix" + - title: Refactoring + regexp: "^refactor" + - title: Other + order: 999 + +release: + github: + owner: RedBoardDev + name: gh-runners-tool + prerelease: auto + name_template: "v{{ .Version }}" diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..8a6e6f2 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,74 @@ +# ghr — GitHub Actions Runner Controller for macOS + +## Project + +Self-hosted GitHub Actions runner controller built on the official `actions/scaleset` Go SDK. Manages ephemeral runners via JIT configs, scale sets, and long-polling. Targets macOS (Apple Silicon + Intel). + +## Quick Reference + +```bash +go build ./cmd/ghr # build +go test ./... # test all +go test -race ./... # test with race detector +go vet ./... # static analysis +gofmt -w . # format +golangci-lint run # lint (if installed) +``` + +## Architecture + +Package-by-feature under `internal/`. No DDD layers. See `specs/00-architecture.md`. + +``` +cmd/ghr/main.go → wiring, DI, CLI +internal/cli/ → Cobra commands (thin) +internal/auth/ → credentials (login, load, save) +internal/config/ → YAML + env loading +internal/controller/ → scale set orchestration + Scaler +internal/runner/ → binary download + process management +internal/github/ → scaleset SDK adapter +internal/health/ → health monitoring +internal/notification/ → event-driven alerts (Discord, webhooks) +internal/monitoring/ → push-based reporters (Uptime Kuma) +internal/api/ → Unix socket JSON API (IPC for ghr status) +internal/launchd/ → macOS service management +internal/logging/ → slog multi-writer, rotation +internal/model/ → shared structs only (no interfaces, no logic) +``` + +## Code Conventions + +- Go 1.25+ required (for actions/scaleset SDK) +- Interfaces defined where consumed, not where implemented +- Structs with exported fields, not getter interfaces +- Error wrapping: `fmt.Errorf("context: %w", err)` +- `oklog/run` for daemon goroutine lifecycle +- `context.Context` as first param everywhere +- No `any` without justification, no `_` to ignore errors +- Table-driven tests with `t.Run` subtests + +## Commit Convention + +`type(scope): description` — types: feat, fix, docs, refactor, test, chore + +## Key Dependencies + +- `github.com/actions/scaleset` — Scale Set API + listener +- `github.com/spf13/cobra` — CLI +- `github.com/oklog/run` — goroutine lifecycle +- `github.com/joho/godotenv` — .env loading +- `gopkg.in/yaml.v3` — config +- `log/slog` (stdlib) — structured logging + +## Specs + +All specs in `specs/`. Read before implementing: +- `00-architecture.md` — package structure, interfaces, DI wiring +- `01-core-scaleset.md` — scale set engine, scaler, runner manager +- `02-cli-commands.md` — start/stop/run/status/purge/login +- `03-health-monitor.md` — health checks, issue detection +- `04-logging.md` — structured logging, rotation, per-runner files +- `05-notifications.md` — Discord, webhook providers +- `06-uptime-kuma.md` — push monitoring +- `07-config.md` — YAML schema, validation, defaults +- `08-auth.md` — login wizard, credentials file, resolution order diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..c30c84d --- /dev/null +++ b/Makefile @@ -0,0 +1,46 @@ +BINARY := ghr +MODULE := github.com/RedBoardDev/gh-runners-tool/v2 +CMD := ./cmd/ghr +VERSION ?= $(shell git describe --tags --always --dirty 2>/dev/null || echo "dev") +COMMIT ?= $(shell git rev-parse --short HEAD 2>/dev/null || echo "none") +DATE ?= $(shell date -u '+%Y-%m-%dT%H:%M:%SZ') +LDFLAGS := -s -w -X '$(MODULE)/internal/cli.version=$(VERSION)' -X '$(MODULE)/internal/cli.commit=$(COMMIT)' -X '$(MODULE)/internal/cli.date=$(DATE)' + +.PHONY: build test lint vet fmt fmt-check vuln clean install snapshot ci help + +build: ## Build the binary + go build -ldflags "$(LDFLAGS)" -o $(BINARY) $(CMD) + +test: ## Run tests with race detector + go test -race -count=1 ./... + +lint: ## Run golangci-lint + golangci-lint run + +vet: ## Run go vet + go vet ./... + +fmt: ## Format code + gofmt -w . + +fmt-check: ## Check formatting (CI) + @test -z "$$(gofmt -l .)" || (echo "Files not formatted:" && gofmt -l . && exit 1) + +vuln: ## Run govulncheck + govulncheck ./... + +clean: ## Remove build artifacts + rm -rf $(BINARY) dist/ + +install: ## Install locally via go install + go install -ldflags "$(LDFLAGS)" $(CMD) + +snapshot: ## Build a snapshot release (no publish) + goreleaser release --snapshot --clean + +ci: lint vet fmt-check build test vuln ## Run all CI checks locally + +help: ## Show this help + @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-15s\033[0m %s\n", $$1, $$2}' + +.DEFAULT_GOAL := help diff --git a/README.md b/README.md new file mode 100644 index 0000000..4ea81ad --- /dev/null +++ b/README.md @@ -0,0 +1,138 @@ +# ghr - GitHub Actions Runner Controller for macOS + +[![Go](https://img.shields.io/badge/Go-1.25+-00ADD8?logo=go&logoColor=white)](https://go.dev/) +[![GitHub Actions](https://img.shields.io/badge/GitHub%20Actions-Runner%20Controller-2088FF?logo=githubactions&logoColor=white)](https://github.com/features/actions) +[![macOS](https://img.shields.io/badge/macOS-Apple%20Silicon%20%7C%20Intel-000000?logo=apple&logoColor=white)](https://www.apple.com/macos/) + +## Overview + +**ghr** is a self-hosted GitHub Actions runner controller built on the official [`actions/scaleset`](https://github.com/actions/scaleset) Go SDK. It manages ephemeral runners via JIT configs, scale sets, and long-polling - targeting macOS (Apple Silicon and Intel). + +Define runner groups with min/max scaling in a YAML config, and ghr handles binary downloads, runner registration, process lifecycle, health monitoring, and graceful shutdown. It integrates with macOS `launchd` for service management and supports Discord/webhook notifications and Uptime Kuma push monitoring. + +### Key Features + +- **Scale Set orchestration** - Runner groups with configurable min/max scaling via the official GitHub SDK +- **Ephemeral JIT runners** - Provisioned on-demand with just-in-time configs, cleaned up after each job +- **macOS native** - First-class `launchd` integration (`ghr start/stop/restart/status`) +- **YAML configuration** - Single config file with environment variables for secrets +- **Health monitoring** - Detection of stuck runners, resource issues, and connectivity problems +- **Notifications** - Discord and webhook alerts for runner events +- **Uptime Kuma** - Push-based monitoring integration +- **Structured logging** - `slog`-based with file rotation and per-runner log files + +## Getting Started + +### Prerequisites + +- **Go 1.25+** (required by the `actions/scaleset` SDK) +- **macOS** (Apple Silicon or Intel) +- A GitHub organization or repository with self-hosted runner access +- A GitHub PAT or App credentials with runner management permissions + +### Build + +```bash +go build -o ghr ./cmd/ghr +``` + +### Configuration + +Create a `config.yaml`: + +```yaml +github: + url: "https://github.com/my-org" + runner_group: "default" + +runner: + version: "latest" + cache_dir: "/var/lib/ghr/cache" + workdir_base: "/var/lib/ghr/runners" + +groups: + - name: "ci-runners" + max_runners: 10 + min_runners: 2 + labels: ["ci", "macos"] + + - name: "deploy-runners" + max_runners: 2 + labels: ["deploy", "macos"] +``` + +Authentication is handled via `ghr login`. Tokens are never stored in the config file - use environment variables or the credentials store. + +### Usage + +```bash +# Authenticate with GitHub +ghr login + +# Start as a launchd service (daemon) +ghr start --config config.yaml + +# Run in foreground (debug mode) +ghr run --config config.yaml + +# Check status +ghr status + +# Restart after config changes +ghr restart + +# Stop the daemon +ghr stop + +# Emergency reset (kill all runners, clean workdirs) +ghr purge +``` + +### Run Tests + +```bash +go test ./... # all tests +go test -race ./... # with race detector +go vet ./... # static analysis +golangci-lint run # lint (if installed) +``` + +## Repository Structure + +``` +ghr/ +├── cmd/ghr/main.go # Entrypoint +├── internal/ +│ ├── cli/ # Cobra commands +│ ├── auth/ # Credentials management +│ ├── config/ # YAML + env config +│ ├── runner/ # Binary download & process lifecycle +│ ├── github/ # Scale set SDK adapter +│ ├── model/ # Shared data structs +│ └── logging/ # Structured logging +├── go.mod +└── go.sum +``` + +## Key Dependencies + +| Package | Purpose | +|---------|---------| +| [`actions/scaleset`](https://github.com/actions/scaleset) | Official GitHub Scale Set API + listener | +| [`spf13/cobra`](https://github.com/spf13/cobra) | CLI framework | +| [`oklog/run`](https://github.com/oklog/run) | Goroutine lifecycle management | +| [`joho/godotenv`](https://github.com/joho/godotenv) | `.env` file loading | +| `gopkg.in/yaml.v3` | YAML config parsing | +| `log/slog` (stdlib) | Structured logging | + +## Reporting Issues + +[GitHub Issues](https://github.com/RedBoardDev/gh-runners-tool/issues) + +## License + +Proprietary. All rights reserved. + +## Contact + +- GitHub: [@RedBoardDev](https://github.com/RedBoardDev) diff --git a/VERSION.md b/VERSION.md new file mode 100644 index 0000000..892397a --- /dev/null +++ b/VERSION.md @@ -0,0 +1,41 @@ +# Versioning & Releases + +ghr uses git tags as the single source of truth for versioning. No version file to maintain. + +## Creating a release + +```bash +git tag v1.0.0 +git push --tags +``` + +This triggers the release workflow which builds binaries for darwin/linux (amd64 + arm64) and publishes a GitHub Release with archives and checksums. + +## Version format + +Follow [semver](https://semver.org): + +| Tag | When | +|---|---| +| `v1.0.0` | First stable release | +| `v1.1.0` | New feature, backward compatible | +| `v1.0.1` | Bug fix | +| `v2.0.0` | Breaking change | +| `v1.0.0-rc.1` | Pre-release (marked automatically) | + +## How it works + +The version is injected at build time via Go ldflags. The Makefile and GoReleaser both inject `version`, `commit`, and `date` into the binary. Running `ghr version` prints these values. + +When building manually without ldflags (`go build ./cmd/ghr`), the version defaults to `dev`. + +## Fixing a bad tag + +```bash +git tag -d v1.0.0 # delete locally +git push --delete origin v1.0.0 # delete on remote +git tag v1.0.1 # create correct tag +git push --tags +``` + +Delete the corresponding GitHub Release manually if it was already published. diff --git a/cmd/ghr/main.go b/cmd/ghr/main.go new file mode 100644 index 0000000..731efc3 --- /dev/null +++ b/cmd/ghr/main.go @@ -0,0 +1,13 @@ +package main + +import ( + "os" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/cli" +) + +func main() { + if err := cli.Execute(); err != nil { + os.Exit(1) + } +} diff --git a/config.example.yaml b/config.example.yaml new file mode 100644 index 0000000..3651690 --- /dev/null +++ b/config.example.yaml @@ -0,0 +1,52 @@ +github: + url: "https://github.com/my-org" + runner_group: "default" + +runner: + version: "latest" + cache_dir: "" + workdir_base: "" + +groups: + - name: "runners" + max_runners: 5 + min_runners: 0 + labels: + - "macos" + +health: + enabled: true + check_interval: "30s" + runner_timeout: "2h" + idle_timeout: "0" + divergence_timeout: "5m" + max_consecutive_failures: 5 + failure_cooldown: "1m" + min_disk_space: "1GB" + +logging: + level: "info" + format: "text" + dir: "" + retention_days: 30 + runner_output: true + +notifications: + discord: + enabled: false + events: [] + username: "ghr" + avatar_url: "" + mentions: + error: "" + critical: "" + +monitoring: + uptime_kuma: + enabled: false + degraded_threshold: 0.5 + report_health_as_ping: true + +daemon: + state_dir: "" + shutdown_timeout: "30s" diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md deleted file mode 100644 index 90e0ba5..0000000 --- a/docs/ARCHITECTURE.md +++ /dev/null @@ -1,36 +0,0 @@ -# Architecture Overview - -## Packages -- `cmd/ghr`: entrypoint, wires CLI. -- `internal/cli`: cobra commands (`daemon`, `apply`, `status`), config flag handling, pid file utilities. -- `internal/config`: YAML + `.env` loading/validation, defaults for paths/version. -- `internal/domain`: core domain structs for groups and runner instances. -- `internal/provider/github`: GitHub API client for runner registration tokens. -- `internal/runner`: runner lifecycle (download cache, per-runner copy, configure, launch, cleanup). -- `internal/reconciler`: converges desired groups to running runners; watches exits and scales up/down. -- `internal/logging`: basic stdout logger. - -## Data Paths -- Cache: `/var/lib/ghr/cache` (runner archives/extracted bits). -- Workdirs: `/var/lib/ghr/groups//` (per runner, cleaned on exit). -- State (pid): `/var/lib/ghr/state/daemon.pid`. -- Runner pid files: `/.ghr-pid` (used for cleanup on startup). - -## Control Flow -1. `ghr daemon --config config.yaml` loads config, creates GitHub client + runner manager + reconciler. -2. Daemon writes pid file, starts reconcile loop on interval (default 15s). -3. SIGHUP triggers config reload; reconcile loop also reaps finished runners and recreates ephemerals to maintain counts. -4. `ghr apply` validates config and sends SIGHUP to daemon to reload. -5. On startup, daemon calls runner cleanup to kill any stray processes found in configured workdir bases and removes their workdirs to avoid accumulation. - -## Runner Lifecycle -1. Resolve runner version (`latest` via GitHub releases) and download/archive cache if missing. -2. Copy cached bits to a fresh workdir per runner; run `config.sh --unattended --url ... --token ... [--labels] [--ephemeral]`. -3. Start `run.sh`; wait/observe exit; cleanup workdir after exit. -4. Reconciler detects exits and scales replacements for ephemeral groups to keep target counts. - -## Security Notes -- Tokens only via env (`GITHUB_TOKEN`/`GITHUB_PAT`), never in config. -- Cleanup removes workdirs after runner exit; no per-group users to keep complexity low. -- macOS-only target; Linux best-effort later. - diff --git a/env.example b/env.example new file mode 100644 index 0000000..805da0f --- /dev/null +++ b/env.example @@ -0,0 +1,13 @@ +# GitHub authentication (alternative to 'ghr login') +# GITHUB_TOKEN=ghp_xxxxxxxxxxxx + +# Discord notifications +# GHR_DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/xxx/yyy + +# Uptime Kuma monitoring +# GHR_UPTIME_KUMA_URL=https://uptime.example.com +# GHR_UPTIME_KUMA_DAEMON_TOKEN=your-daemon-push-token +# GHR_UPTIME_KUMA_TOKEN_RUNNERS=your-group-push-token + +# Override credentials file path +# GHR_CREDENTIALS_FILE=/path/to/credentials.json diff --git a/go.mod b/go.mod new file mode 100644 index 0000000..cca8538 --- /dev/null +++ b/go.mod @@ -0,0 +1,17 @@ +module github.com/RedBoardDev/gh-runners-tool/v2 + +go 1.25.3 + +require ( + github.com/actions/scaleset v0.4.0 // indirect + github.com/golang-jwt/jwt/v4 v4.5.2 // indirect + github.com/google/uuid v1.6.0 // indirect + github.com/hashicorp/go-cleanhttp v0.5.2 // indirect + github.com/hashicorp/go-retryablehttp v0.7.8 // indirect + github.com/inconshreveable/mousetrap v1.1.0 // indirect + github.com/joho/godotenv v1.5.1 // indirect + github.com/oklog/run v1.2.0 // indirect + github.com/spf13/cobra v1.10.2 // indirect + github.com/spf13/pflag v1.0.10 // indirect + gopkg.in/yaml.v3 v3.0.1 // indirect +) diff --git a/go.sum b/go.sum new file mode 100644 index 0000000..34572d1 --- /dev/null +++ b/go.sum @@ -0,0 +1,27 @@ +github.com/actions/scaleset v0.4.0 h1:691GC2AkHb3ZGjfNvatboYoRS7CLr3+4VcZk/6w9IbM= +github.com/actions/scaleset v0.4.0/go.mod h1:2L2I6rggFWV+zprDet6y7y7Vkm3HPudaup78eSc79Uo= +github.com/cpuguy83/go-md2man/v2 v2.0.6/go.mod h1:oOW0eioCTA6cOiMLiUPZOpcVxMig6NIQQ7OS05n1F4g= +github.com/golang-jwt/jwt/v4 v4.5.2 h1:YtQM7lnr8iZ+j5q71MGKkNw9Mn7AjHM68uc9g5fXeUI= +github.com/golang-jwt/jwt/v4 v4.5.2/go.mod h1:m21LjoU+eqJr34lmDMbreY2eSTRJ1cv77w39/MY0Ch0= +github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0= +github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo= +github.com/hashicorp/go-cleanhttp v0.5.2 h1:035FKYIWjmULyFRBKPs8TBQoi0x6d9G4xc9neXJWAZQ= +github.com/hashicorp/go-cleanhttp v0.5.2/go.mod h1:kO/YDlP8L1346E6Sodw+PrpBSV4/SoxCXGY6BqNFT48= +github.com/hashicorp/go-retryablehttp v0.7.8 h1:ylXZWnqa7Lhqpk0L1P1LzDtGcCR0rPVUrx/c8Unxc48= +github.com/hashicorp/go-retryablehttp v0.7.8/go.mod h1:rjiScheydd+CxvumBsIrFKlx3iS0jrZ7LvzFGFmuKbw= +github.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8= +github.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw= +github.com/joho/godotenv v1.5.1 h1:7eLL/+HRGLY0ldzfGMeQkb7vMd0as4CfYvUVzLqw0N0= +github.com/joho/godotenv v1.5.1/go.mod h1:f4LDr5Voq0i2e/R5DDNOoa2zzDfwtkZa6DnEwAbqwq4= +github.com/oklog/run v1.2.0 h1:O8x3yXwah4A73hJdlrwo/2X6J62gE5qTMusH0dvz60E= +github.com/oklog/run v1.2.0/go.mod h1:mgDbKRSwPhJfesJ4PntqFUbKQRZ50NgmZTSPlFA0YFk= +github.com/russross/blackfriday/v2 v2.1.0/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM= +github.com/spf13/cobra v1.10.2 h1:DMTTonx5m65Ic0GOoRY2c16WCbHxOOw6xxezuLaBpcU= +github.com/spf13/cobra v1.10.2/go.mod h1:7C1pvHqHw5A4vrJfjNwvOdzYu0Gml16OCs2GRiTUUS4= +github.com/spf13/pflag v1.0.9/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg= +github.com/spf13/pflag v1.0.10 h1:4EBh2KAYBwaONj6b2Ye1GiHfwjqyROoF4RwYO+vPwFk= +github.com/spf13/pflag v1.0.10/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg= +go.yaml.in/yaml/v3 v3.0.4/go.mod h1:DhzuOOF2ATzADvBadXxruRBLzYTpT36CKvDb3+aBEFg= +gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= +gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= +gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= diff --git a/internal/api/handlers.go b/internal/api/handlers.go new file mode 100644 index 0000000..51806c6 --- /dev/null +++ b/internal/api/handlers.go @@ -0,0 +1,67 @@ +package api + +import ( + "encoding/json" + "net/http" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +type statusResponse struct { + Groups map[string][]model.RunnerSnapshot `json:"groups"` + Health healthResponse `json:"health"` +} + +type healthResponse struct { + LastCheck time.Time `json:"last_check"` + Issues []model.HealthIssue `json:"issues"` +} + +func (s *Server) routes() http.Handler { + mux := http.NewServeMux() + mux.HandleFunc("GET /status", s.handleStatus) + mux.HandleFunc("GET /health", s.handleHealth) + return mux +} + +func (s *Server) handleStatus(w http.ResponseWriter, _ *http.Request) { + snapshots := s.controller.Snapshots() + hs := s.health.Status() + + resp := statusResponse{ + Groups: snapshots, + Health: healthResponse{ + LastCheck: hs.LastCheck, + Issues: hs.Issues, + }, + } + + writeJSON(w, resp) +} + +func (s *Server) handleHealth(w http.ResponseWriter, _ *http.Request) { + hs := s.health.Status() + + resp := healthResponse{ + LastCheck: hs.LastCheck, + Issues: hs.Issues, + } + + writeJSON(w, resp) +} + +func writeJSON(w http.ResponseWriter, v any) { + w.Header().Set("Content-Type", "application/json") + + data, err := json.Marshal(v) + if err != nil { + w.WriteHeader(http.StatusInternalServerError) + return + } + + _, writeErr := w.Write(data) + if writeErr != nil { + return + } +} diff --git a/internal/api/server.go b/internal/api/server.go new file mode 100644 index 0000000..6b7f2fa --- /dev/null +++ b/internal/api/server.go @@ -0,0 +1,97 @@ +package api + +import ( + "context" + "errors" + "fmt" + "log/slog" + "net" + "net/http" + "os" + "path/filepath" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/health" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +type controllerState interface { + Snapshots() map[string][]model.RunnerSnapshot +} + +type healthState interface { + Status() health.HealthStatus +} + +type Server struct { + socketPath string + controller controllerState + health healthState + logger *slog.Logger + listener net.Listener +} + +func NewServer(stateDir string, controller controllerState, healthProvider healthState, logger *slog.Logger) *Server { + return &Server{ + socketPath: filepath.Join(stateDir, "ghr.sock"), + controller: controller, + health: healthProvider, + logger: logger, + } +} + +func (s *Server) Run(ctx context.Context) error { + if err := removeStaleSocket(s.socketPath); err != nil { + return fmt.Errorf("remove stale socket: %w", err) + } + + ln, err := net.Listen("unix", s.socketPath) + if err != nil { + return fmt.Errorf("listen on %s: %w", s.socketPath, err) + } + s.listener = ln + + srv := &http.Server{ + Handler: s.routes(), + } + + errCh := make(chan error, 1) + go func() { + errCh <- srv.Serve(ln) + }() + + select { + case <-ctx.Done(): + shutdownErr := srv.Close() + cleanupErr := os.Remove(s.socketPath) + if shutdownErr != nil { + return fmt.Errorf("shutdown api server: %w", shutdownErr) + } + if cleanupErr != nil && !os.IsNotExist(cleanupErr) { + s.logger.Warn("failed to remove socket file", "path", s.socketPath, "error", cleanupErr) + } + return nil + case err := <-errCh: + cleanupErr := os.Remove(s.socketPath) + if cleanupErr != nil && !os.IsNotExist(cleanupErr) { + s.logger.Warn("failed to remove socket file", "path", s.socketPath, "error", cleanupErr) + } + if errors.Is(err, http.ErrServerClosed) { + return nil + } + return fmt.Errorf("api server: %w", err) + } +} + +func removeStaleSocket(path string) error { + _, err := os.Stat(path) + if os.IsNotExist(err) { + return nil + } + if err != nil { + return fmt.Errorf("stat socket %s: %w", path, err) + } + if err := os.Remove(path); err != nil { + return fmt.Errorf("remove socket %s: %w", path, err) + } + return nil +} diff --git a/internal/api/server_test.go b/internal/api/server_test.go new file mode 100644 index 0000000..b1780d6 --- /dev/null +++ b/internal/api/server_test.go @@ -0,0 +1,193 @@ +package api + +import ( + "encoding/json" + "log/slog" + "net/http" + "net/http/httptest" + "os" + "testing" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/health" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +type mockController struct { + snapshots map[string][]model.RunnerSnapshot +} + +func (m *mockController) Snapshots() map[string][]model.RunnerSnapshot { + return m.snapshots +} + +type mockHealth struct { + status health.HealthStatus +} + +func (m *mockHealth) Status() health.HealthStatus { + return m.status +} + +func testServer(ctrl *mockController, h *mockHealth) *Server { + return &Server{ + controller: ctrl, + health: h, + logger: slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelError + 1})), + } +} + +func TestHandleStatus(t *testing.T) { + now := time.Date(2026, 1, 15, 10, 30, 0, 0, time.UTC) + + ctrl := &mockController{ + snapshots: map[string][]model.RunnerSnapshot{ + "group-a": { + {Name: "group-a-1", Group: "group-a", State: "idle", PID: 1234, StartedAt: now}, + {Name: "group-a-2", Group: "group-a", State: "busy", PID: 5678, StartedAt: now}, + }, + }, + } + h := &mockHealth{ + status: health.HealthStatus{ + LastCheck: now, + Issues: []model.HealthIssue{}, + }, + } + + s := testServer(ctrl, h) + srv := httptest.NewServer(s.routes()) + defer srv.Close() + + resp, err := http.Get(srv.URL + "/status") + if err != nil { + t.Fatalf("GET /status: %v", err) + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + t.Fatalf("expected status 200, got %d", resp.StatusCode) + } + + ct := resp.Header.Get("Content-Type") + if ct != "application/json" { + t.Fatalf("expected Content-Type application/json, got %q", ct) + } + + var body statusResponse + if err := json.NewDecoder(resp.Body).Decode(&body); err != nil { + t.Fatalf("decode response: %v", err) + } + + runners, ok := body.Groups["group-a"] + if !ok { + t.Fatal("expected group-a in response") + } + if len(runners) != 2 { + t.Fatalf("expected 2 runners in group-a, got %d", len(runners)) + } +} + +func TestHandleHealth(t *testing.T) { + now := time.Date(2026, 1, 15, 10, 30, 0, 0, time.UTC) + + ctrl := &mockController{ + snapshots: map[string][]model.RunnerSnapshot{}, + } + h := &mockHealth{ + status: health.HealthStatus{ + LastCheck: now, + Issues: []model.HealthIssue{ + { + Level: model.LevelWarning, + Type: "health.disk_low", + Message: "disk space below threshold", + DetectedAt: now, + }, + }, + }, + } + + s := testServer(ctrl, h) + srv := httptest.NewServer(s.routes()) + defer srv.Close() + + resp, err := http.Get(srv.URL + "/health") + if err != nil { + t.Fatalf("GET /health: %v", err) + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + t.Fatalf("expected status 200, got %d", resp.StatusCode) + } + + var body healthResponse + if err := json.NewDecoder(resp.Body).Decode(&body); err != nil { + t.Fatalf("decode response: %v", err) + } + + if len(body.Issues) != 1 { + t.Fatalf("expected 1 issue, got %d", len(body.Issues)) + } + if body.Issues[0].Type != "health.disk_low" { + t.Fatalf("expected issue type health.disk_low, got %q", body.Issues[0].Type) + } +} + +func TestRoutes_NotFound(t *testing.T) { + ctrl := &mockController{ + snapshots: map[string][]model.RunnerSnapshot{}, + } + h := &mockHealth{ + status: health.HealthStatus{}, + } + + s := testServer(ctrl, h) + srv := httptest.NewServer(s.routes()) + defer srv.Close() + + resp, err := http.Get(srv.URL + "/unknown") + if err != nil { + t.Fatalf("GET /unknown: %v", err) + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusNotFound { + t.Fatalf("expected status 404, got %d", resp.StatusCode) + } +} + +func TestHandleStatus_EmptyGroups(t *testing.T) { + ctrl := &mockController{ + snapshots: map[string][]model.RunnerSnapshot{}, + } + h := &mockHealth{ + status: health.HealthStatus{ + Issues: []model.HealthIssue{}, + }, + } + + s := testServer(ctrl, h) + srv := httptest.NewServer(s.routes()) + defer srv.Close() + + resp, err := http.Get(srv.URL + "/status") + if err != nil { + t.Fatalf("GET /status: %v", err) + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + t.Fatalf("expected status 200, got %d", resp.StatusCode) + } + + var body statusResponse + if err := json.NewDecoder(resp.Body).Decode(&body); err != nil { + t.Fatalf("decode response: %v", err) + } + + if len(body.Groups) != 0 { + t.Fatalf("expected 0 groups, got %d", len(body.Groups)) + } +} diff --git a/internal/auth/auth_test.go b/internal/auth/auth_test.go new file mode 100644 index 0000000..dbdaad8 --- /dev/null +++ b/internal/auth/auth_test.go @@ -0,0 +1,591 @@ +package auth + +import ( + "context" + "encoding/json" + "net/http" + "net/http/httptest" + "os" + "path/filepath" + "strings" + "testing" + "time" +) + +func TestFilePath(t *testing.T) { + t.Run("with GHR_CREDENTIALS_FILE env set", func(t *testing.T) { + want := "/custom/path/credentials.json" + t.Setenv("GHR_CREDENTIALS_FILE", want) + + got := FilePath() + if got != want { + t.Errorf("FilePath() = %q, want %q", got, want) + } + }) + + t.Run("without env non-root returns home config path", func(t *testing.T) { + t.Setenv("GHR_CREDENTIALS_FILE", "") + + got := FilePath() + + // We are running tests as a non-root user, so it should use ~/.config/ghr/credentials.json + if os.Getuid() == 0 { + t.Skip("test requires non-root user") + } + + home, err := os.UserHomeDir() + if err != nil { + t.Fatalf("UserHomeDir() error: %v", err) + } + want := filepath.Join(home, ".config", "ghr", "credentials.json") + if got != want { + t.Errorf("FilePath() = %q, want %q", got, want) + } + }) +} + +func TestLoad_TokenFlag(t *testing.T) { + // Point credentials file to a non-existent path to avoid reading real credentials + t.Setenv("GHR_CREDENTIALS_FILE", filepath.Join(t.TempDir(), "nonexistent.json")) + t.Setenv("GITHUB_TOKEN", "") + + creds, source, err := Load(LoadOpts{TokenFlag: "ghp_flagtoken123"}) + if err != nil { + t.Fatalf("Load() error: %v", err) + } + if creds.Method != "pat" { + t.Errorf("Method = %q, want %q", creds.Method, "pat") + } + if creds.PAT != "ghp_flagtoken123" { + t.Errorf("PAT = %q, want %q", creds.PAT, "ghp_flagtoken123") + } + if source != "flag (--token)" { + t.Errorf("source = %q, want %q", source, "flag (--token)") + } +} + +func TestLoad_EnvVar(t *testing.T) { + // Point credentials file to a non-existent path + t.Setenv("GHR_CREDENTIALS_FILE", filepath.Join(t.TempDir(), "nonexistent.json")) + t.Setenv("GITHUB_TOKEN", "ghp_envtoken456") + + creds, source, err := Load(LoadOpts{}) + if err != nil { + t.Fatalf("Load() error: %v", err) + } + if creds.Method != "pat" { + t.Errorf("Method = %q, want %q", creds.Method, "pat") + } + if creds.PAT != "ghp_envtoken456" { + t.Errorf("PAT = %q, want %q", creds.PAT, "ghp_envtoken456") + } + if source != "env (GITHUB_TOKEN)" { + t.Errorf("source = %q, want %q", source, "env (GITHUB_TOKEN)") + } +} + +func TestLoad_CredentialsFile(t *testing.T) { + dir := t.TempDir() + credFile := filepath.Join(dir, "credentials.json") + t.Setenv("GHR_CREDENTIALS_FILE", credFile) + t.Setenv("GITHUB_TOKEN", "") + + creds := &Credentials{ + Method: "pat", + GitHubURL: "https://github.com/my-org", + PAT: "ghp_fromfile789", + CreatedAt: time.Date(2025, 1, 15, 10, 0, 0, 0, time.UTC), + } + data, err := json.MarshalIndent(creds, "", " ") + if err != nil { + t.Fatalf("MarshalIndent() error: %v", err) + } + if err := os.WriteFile(credFile, data, 0600); err != nil { + t.Fatalf("WriteFile() error: %v", err) + } + + loaded, source, err := Load(LoadOpts{}) + if err != nil { + t.Fatalf("Load() error: %v", err) + } + if loaded.Method != "pat" { + t.Errorf("Method = %q, want %q", loaded.Method, "pat") + } + if loaded.PAT != "ghp_fromfile789" { + t.Errorf("PAT = %q, want %q", loaded.PAT, "ghp_fromfile789") + } + if loaded.GitHubURL != "https://github.com/my-org" { + t.Errorf("GitHubURL = %q, want %q", loaded.GitHubURL, "https://github.com/my-org") + } + if !strings.Contains(source, "file") { + t.Errorf("source = %q, want it to contain %q", source, "file") + } +} + +func TestLoad_Priority(t *testing.T) { + t.Run("TokenFlag wins over GITHUB_TOKEN", func(t *testing.T) { + t.Setenv("GHR_CREDENTIALS_FILE", filepath.Join(t.TempDir(), "nonexistent.json")) + t.Setenv("GITHUB_TOKEN", "ghp_env_should_lose") + + creds, source, err := Load(LoadOpts{TokenFlag: "ghp_flag_should_win"}) + if err != nil { + t.Fatalf("Load() error: %v", err) + } + if creds.PAT != "ghp_flag_should_win" { + t.Errorf("PAT = %q, want %q", creds.PAT, "ghp_flag_should_win") + } + if source != "flag (--token)" { + t.Errorf("source = %q, want %q", source, "flag (--token)") + } + }) + + t.Run("GITHUB_TOKEN wins over credentials file", func(t *testing.T) { + dir := t.TempDir() + credFile := filepath.Join(dir, "credentials.json") + t.Setenv("GHR_CREDENTIALS_FILE", credFile) + t.Setenv("GITHUB_TOKEN", "ghp_env_should_win") + + fileCreds := &Credentials{ + Method: "pat", + PAT: "ghp_file_should_lose", + CreatedAt: time.Now(), + } + data, err := json.MarshalIndent(fileCreds, "", " ") + if err != nil { + t.Fatalf("MarshalIndent() error: %v", err) + } + if err := os.WriteFile(credFile, data, 0600); err != nil { + t.Fatalf("WriteFile() error: %v", err) + } + + creds, source, err := Load(LoadOpts{}) + if err != nil { + t.Fatalf("Load() error: %v", err) + } + if creds.PAT != "ghp_env_should_win" { + t.Errorf("PAT = %q, want %q", creds.PAT, "ghp_env_should_win") + } + if source != "env (GITHUB_TOKEN)" { + t.Errorf("source = %q, want %q", source, "env (GITHUB_TOKEN)") + } + }) +} + +func TestLoad_NotAuthenticated(t *testing.T) { + t.Setenv("GHR_CREDENTIALS_FILE", filepath.Join(t.TempDir(), "nonexistent.json")) + t.Setenv("GITHUB_TOKEN", "") + + _, _, err := Load(LoadOpts{}) + if err == nil { + t.Fatal("Load() expected error, got nil") + } + if !strings.Contains(err.Error(), "not authenticated") { + t.Errorf("error = %q, want it to contain %q", err.Error(), "not authenticated") + } +} + +func TestSave_And_Load(t *testing.T) { + dir := t.TempDir() + credFile := filepath.Join(dir, "credentials.json") + t.Setenv("GHR_CREDENTIALS_FILE", credFile) + t.Setenv("GITHUB_TOKEN", "") + + original := &Credentials{ + Method: "pat", + GitHubURL: "https://github.com/test-org", + PAT: "ghp_saveandload123", + CreatedAt: time.Date(2025, 6, 1, 12, 0, 0, 0, time.UTC), + } + + if err := Save(original); err != nil { + t.Fatalf("Save() error: %v", err) + } + + // Verify file permissions are 0600 + info, err := os.Stat(credFile) + if err != nil { + t.Fatalf("Stat() error: %v", err) + } + perm := info.Mode().Perm() + if perm != 0600 { + t.Errorf("file permissions = %o, want %o", perm, 0600) + } + + // Load back and verify + loaded, source, err := Load(LoadOpts{}) + if err != nil { + t.Fatalf("Load() error: %v", err) + } + if loaded.Method != original.Method { + t.Errorf("Method = %q, want %q", loaded.Method, original.Method) + } + if loaded.PAT != original.PAT { + t.Errorf("PAT = %q, want %q", loaded.PAT, original.PAT) + } + if loaded.GitHubURL != original.GitHubURL { + t.Errorf("GitHubURL = %q, want %q", loaded.GitHubURL, original.GitHubURL) + } + if !strings.Contains(source, "file") { + t.Errorf("source = %q, want it to contain %q", source, "file") + } +} + +func TestSave_CreatesDirectory(t *testing.T) { + dir := t.TempDir() + nestedPath := filepath.Join(dir, "nested", "deep", "credentials.json") + t.Setenv("GHR_CREDENTIALS_FILE", nestedPath) + + creds := &Credentials{ + Method: "pat", + PAT: "ghp_nested123", + CreatedAt: time.Now(), + } + + if err := Save(creds); err != nil { + t.Fatalf("Save() error: %v", err) + } + + // Verify parent directory was created with 0700 + parentDir := filepath.Dir(nestedPath) + info, err := os.Stat(parentDir) + if err != nil { + t.Fatalf("Stat(%s) error: %v", parentDir, err) + } + if !info.IsDir() { + t.Errorf("%s is not a directory", parentDir) + } + perm := info.Mode().Perm() + if perm != 0700 { + t.Errorf("directory permissions = %o, want %o", perm, 0700) + } +} + +func TestSave_SetsCreatedAt(t *testing.T) { + dir := t.TempDir() + credFile := filepath.Join(dir, "credentials.json") + t.Setenv("GHR_CREDENTIALS_FILE", credFile) + + before := time.Now().Add(-time.Second) + + creds := &Credentials{ + Method: "pat", + PAT: "ghp_timestamp123", + // CreatedAt is zero + } + + if err := Save(creds); err != nil { + t.Fatalf("Save() error: %v", err) + } + + after := time.Now().Add(time.Second) + + if creds.CreatedAt.IsZero() { + t.Fatal("CreatedAt should not be zero after Save()") + } + if creds.CreatedAt.Before(before) { + t.Errorf("CreatedAt = %v, want after %v", creds.CreatedAt, before) + } + if creds.CreatedAt.After(after) { + t.Errorf("CreatedAt = %v, want before %v", creds.CreatedAt, after) + } +} + +func TestRemove(t *testing.T) { + t.Run("save then remove", func(t *testing.T) { + dir := t.TempDir() + credFile := filepath.Join(dir, "credentials.json") + t.Setenv("GHR_CREDENTIALS_FILE", credFile) + + creds := &Credentials{ + Method: "pat", + PAT: "ghp_removeme", + CreatedAt: time.Now(), + } + if err := Save(creds); err != nil { + t.Fatalf("Save() error: %v", err) + } + + // Verify file exists + if _, err := os.Stat(credFile); err != nil { + t.Fatalf("file should exist before Remove(), Stat error: %v", err) + } + + if err := Remove(); err != nil { + t.Fatalf("Remove() error: %v", err) + } + + // Verify file no longer exists + if _, err := os.Stat(credFile); !os.IsNotExist(err) { + t.Errorf("file should not exist after Remove(), Stat error: %v", err) + } + }) + + t.Run("remove when file does not exist", func(t *testing.T) { + t.Setenv("GHR_CREDENTIALS_FILE", filepath.Join(t.TempDir(), "nonexistent.json")) + + if err := Remove(); err != nil { + t.Errorf("Remove() on non-existent file should not error, got: %v", err) + } + }) +} + +func TestMaskedPAT(t *testing.T) { + tests := []struct { + name string + pat string + want string + }{ + { + name: "standard PAT", + pat: "ghp_1234567890abcdef", + want: "ghp_...cdef", + }, + { + name: "short token", + pat: "short", + want: "****", + }, + { + name: "empty token", + pat: "", + want: "****", + }, + { + name: "exactly 12 chars", + pat: "exactlytwelv", + want: "exac...welv", + }, + { + name: "11 chars returns mask", + pat: "elevenchar!", + want: "****", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got := MaskedPAT(tt.pat) + if got != tt.want { + t.Errorf("MaskedPAT(%q) = %q, want %q", tt.pat, got, tt.want) + } + }) + } +} + +func TestValidate_PAT(t *testing.T) { + // validatePAT hardcodes "https://api.github.com/user", so we cannot inject + // a test server URL without modifying the production code. Instead, we test + // validatePAT indirectly via Validate for the success case using httptest + // by temporarily overriding http.DefaultTransport. + t.Run("valid PAT", func(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + // Verify the request is well-formed + if got := r.Header.Get("Authorization"); got != "Bearer ghp_testtoken" { + t.Errorf("Authorization = %q, want %q", got, "Bearer ghp_testtoken") + } + if got := r.Header.Get("Accept"); got != "application/vnd.github+json" { + t.Errorf("Accept = %q, want %q", got, "application/vnd.github+json") + } + w.Header().Set("X-OAuth-Scopes", "admin:org, repo") + w.WriteHeader(http.StatusOK) + resp := githubUserResponse{Login: "testuser"} + if err := json.NewEncoder(w).Encode(resp); err != nil { + t.Errorf("encode response: %v", err) + } + })) + defer srv.Close() + + // Override DefaultTransport to redirect api.github.com to the test server + origTransport := http.DefaultTransport + http.DefaultTransport = &rewriteTransport{ + targetURL: srv.URL, + wrapped: origTransport, + } + defer func() { http.DefaultTransport = origTransport }() + + result, err := Validate(context.Background(), &Credentials{ + Method: "pat", + PAT: "ghp_testtoken", + }) + if err != nil { + t.Fatalf("Validate() error: %v", err) + } + if !result.Valid { + t.Error("Valid = false, want true") + } + if result.Username != "testuser" { + t.Errorf("Username = %q, want %q", result.Username, "testuser") + } + if len(result.Scopes) != 2 { + t.Errorf("Scopes length = %d, want 2", len(result.Scopes)) + } else { + if result.Scopes[0] != "admin:org" { + t.Errorf("Scopes[0] = %q, want %q", result.Scopes[0], "admin:org") + } + if result.Scopes[1] != "repo" { + t.Errorf("Scopes[1] = %q, want %q", result.Scopes[1], "repo") + } + } + }) + + t.Run("unauthorized PAT", func(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { + w.WriteHeader(http.StatusUnauthorized) + _, writeErr := w.Write([]byte(`{"message":"Bad credentials"}`)) + if writeErr != nil { + t.Errorf("write response: %v", writeErr) + } + })) + defer srv.Close() + + origTransport := http.DefaultTransport + http.DefaultTransport = &rewriteTransport{ + targetURL: srv.URL, + wrapped: origTransport, + } + defer func() { http.DefaultTransport = origTransport }() + + _, err := Validate(context.Background(), &Credentials{ + Method: "pat", + PAT: "ghp_badtoken", + }) + if err == nil { + t.Fatal("Validate() expected error for unauthorized PAT, got nil") + } + if !strings.Contains(err.Error(), "401") { + t.Errorf("error = %q, want it to contain %q", err.Error(), "401") + } + }) +} + +func TestValidate_GitHubApp(t *testing.T) { + t.Run("valid private key file", func(t *testing.T) { + dir := t.TempDir() + keyPath := filepath.Join(dir, "test.pem") + if err := os.WriteFile(keyPath, []byte("fake-pem-content"), 0600); err != nil { + t.Fatalf("WriteFile() error: %v", err) + } + + result, err := Validate(context.Background(), &Credentials{ + Method: "github_app", + GitHubApp: &GitHubAppCreds{ + ClientID: "Iv1.abc123", + InstallationID: 12345678, + PrivateKeyPath: keyPath, + }, + }) + if err != nil { + t.Fatalf("Validate() error: %v", err) + } + if !result.Valid { + t.Error("Valid = false, want true") + } + }) + + t.Run("non-existent private key file", func(t *testing.T) { + _, err := Validate(context.Background(), &Credentials{ + Method: "github_app", + GitHubApp: &GitHubAppCreds{ + ClientID: "Iv1.abc123", + InstallationID: 12345678, + PrivateKeyPath: "/nonexistent/path/key.pem", + }, + }) + if err == nil { + t.Fatal("Validate() expected error for non-existent key, got nil") + } + if !strings.Contains(err.Error(), "open private key") { + t.Errorf("error = %q, want it to contain %q", err.Error(), "open private key") + } + }) + + t.Run("nil GitHubAppCreds", func(t *testing.T) { + _, err := Validate(context.Background(), &Credentials{ + Method: "github_app", + GitHubApp: nil, + }) + if err == nil { + t.Fatal("Validate() expected error for nil creds, got nil") + } + if !strings.Contains(err.Error(), "credentials are nil") { + t.Errorf("error = %q, want it to contain %q", err.Error(), "credentials are nil") + } + }) +} + +func TestValidate_UnknownMethod(t *testing.T) { + _, err := Validate(context.Background(), &Credentials{ + Method: "unknown", + }) + if err == nil { + t.Fatal("Validate() expected error for unknown method, got nil") + } + if !strings.Contains(err.Error(), "unknown method") { + t.Errorf("error = %q, want it to contain %q", err.Error(), "unknown method") + } +} + +func TestParseScopes(t *testing.T) { + tests := []struct { + name string + header string + want []string + }{ + { + name: "multiple scopes", + header: "admin:org, repo, workflow", + want: []string{"admin:org", "repo", "workflow"}, + }, + { + name: "single scope", + header: "repo", + want: []string{"repo"}, + }, + { + name: "empty header", + header: "", + want: nil, + }, + { + name: "extra whitespace", + header: " admin:org , repo , workflow ", + want: []string{"admin:org", "repo", "workflow"}, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got := parseScopes(tt.header) + if len(got) != len(tt.want) { + t.Fatalf("parseScopes(%q) length = %d, want %d", tt.header, len(got), len(tt.want)) + } + for i := range got { + if got[i] != tt.want[i] { + t.Errorf("parseScopes(%q)[%d] = %q, want %q", tt.header, i, got[i], tt.want[i]) + } + } + }) + } +} + +// rewriteTransport is an http.RoundTripper that redirects requests targeting +// api.github.com to a local httptest server. This allows testing validatePAT +// without modifying the production code. +type rewriteTransport struct { + targetURL string + wrapped http.RoundTripper +} + +func (t *rewriteTransport) RoundTrip(req *http.Request) (*http.Response, error) { + if req.URL.Host == "api.github.com" { + req = req.Clone(req.Context()) + req.URL.Scheme = "http" + parsed, err := http.NewRequest(req.Method, t.targetURL+req.URL.Path, req.Body) + if err != nil { + return nil, err + } + parsed.Header = req.Header + req = parsed + } + return t.wrapped.RoundTrip(req) +} diff --git a/internal/auth/credentials.go b/internal/auth/credentials.go new file mode 100644 index 0000000..fc5997f --- /dev/null +++ b/internal/auth/credentials.go @@ -0,0 +1,32 @@ +package auth + +import "time" + +type Credentials struct { + Method string `json:"method"` + GitHubURL string `json:"github_url"` + PAT string `json:"pat,omitempty"` + GitHubApp *GitHubAppCreds `json:"github_app,omitempty"` + CreatedAt time.Time `json:"created_at"` +} + +type GitHubAppCreds struct { + ClientID string `json:"client_id"` + InstallationID int64 `json:"installation_id"` + PrivateKeyPath string `json:"private_key_path"` +} + +type LoadOpts struct { + TokenFlag string +} + +type ValidationResult struct { + Valid bool + Username string + Scopes []string + OrgName string +} + +type githubUserResponse struct { + Login string `json:"login"` +} diff --git a/internal/auth/load.go b/internal/auth/load.go new file mode 100644 index 0000000..7ee8250 --- /dev/null +++ b/internal/auth/load.go @@ -0,0 +1,39 @@ +package auth + +import ( + "fmt" + "os" +) + +func Load(opts LoadOpts) (*Credentials, string, error) { + if opts.TokenFlag != "" { + return &Credentials{ + Method: "pat", + PAT: opts.TokenFlag, + }, "flag (--token)", nil + } + + if token := os.Getenv("GITHUB_TOKEN"); token != "" { + return &Credentials{ + Method: "pat", + PAT: token, + }, "env (GITHUB_TOKEN)", nil + } + + creds, err := loadFromFile() + if err == nil { + return creds, fmt.Sprintf("file (%s)", FilePath()), nil + } + if !os.IsNotExist(err) { + return nil, "", fmt.Errorf("load credentials file: %w", err) + } + + if token := os.Getenv("GITHUB_TOKEN"); token != "" { + return &Credentials{ + Method: "pat", + PAT: token, + }, "env (.env GITHUB_TOKEN)", nil + } + + return nil, "", fmt.Errorf("not authenticated. Run 'ghr login' to set up authentication, or set GITHUB_TOKEN") +} diff --git a/internal/auth/store.go b/internal/auth/store.go new file mode 100644 index 0000000..639fa3d --- /dev/null +++ b/internal/auth/store.go @@ -0,0 +1,65 @@ +package auth + +import ( + "encoding/json" + "fmt" + "os" + "path/filepath" + "time" +) + +func FilePath() string { + if p := os.Getenv("GHR_CREDENTIALS_FILE"); p != "" { + return p + } + if os.Getuid() == 0 { + return "/etc/ghr/credentials.json" + } + home, err := os.UserHomeDir() + if err != nil { + return filepath.Join(".config", "ghr", "credentials.json") + } + return filepath.Join(home, ".config", "ghr", "credentials.json") +} + +func loadFromFile() (*Credentials, error) { + data, err := os.ReadFile(FilePath()) + if err != nil { + return nil, err + } + var creds Credentials + if err := json.Unmarshal(data, &creds); err != nil { + return nil, fmt.Errorf("parse credentials file: %w", err) + } + return &creds, nil +} + +func Save(creds *Credentials) error { + if creds.CreatedAt.IsZero() { + creds.CreatedAt = time.Now() + } + + p := FilePath() + dir := filepath.Dir(p) + if err := os.MkdirAll(dir, 0o700); err != nil { + return fmt.Errorf("create credentials directory %s: %w", dir, err) + } + + data, err := json.MarshalIndent(creds, "", " ") + if err != nil { + return fmt.Errorf("marshal credentials: %w", err) + } + + if err := os.WriteFile(p, data, 0o600); err != nil { + return fmt.Errorf("write credentials file %s: %w", p, err) + } + return nil +} + +func Remove() error { + err := os.Remove(FilePath()) + if err != nil && !os.IsNotExist(err) { + return fmt.Errorf("remove credentials file: %w", err) + } + return nil +} diff --git a/internal/auth/validate.go b/internal/auth/validate.go new file mode 100644 index 0000000..89a2f87 --- /dev/null +++ b/internal/auth/validate.go @@ -0,0 +1,101 @@ +package auth + +import ( + "context" + "encoding/json" + "fmt" + "io" + "net/http" + "os" + "strings" +) + +func Validate(ctx context.Context, creds *Credentials) (*ValidationResult, error) { + switch creds.Method { + case "pat": + return validatePAT(ctx, creds.PAT) + case "github_app": + return validateGitHubApp(creds.GitHubApp) + default: + return nil, fmt.Errorf("validate credentials: unknown method %q", creds.Method) + } +} + +func validatePAT(ctx context.Context, pat string) (*ValidationResult, error) { + req, err := http.NewRequestWithContext(ctx, http.MethodGet, "https://api.github.com/user", http.NoBody) + if err != nil { + return nil, fmt.Errorf("validate PAT: create request: %w", err) + } + req.Header.Set("Authorization", "Bearer "+pat) + req.Header.Set("Accept", "application/vnd.github+json") + + resp, err := http.DefaultClient.Do(req) + if err != nil { + return nil, fmt.Errorf("validate PAT: request failed: %w", err) + } + defer func() { + _, _ = io.Copy(io.Discard, resp.Body) + resp.Body.Close() + }() + + body, err := io.ReadAll(resp.Body) + if err != nil { + return nil, fmt.Errorf("validate PAT: read response: %w", err) + } + + if resp.StatusCode != http.StatusOK { + return nil, fmt.Errorf("validate PAT: GitHub API returned %d: %s", resp.StatusCode, string(body)) + } + + var user githubUserResponse + if err := json.Unmarshal(body, &user); err != nil { + return nil, fmt.Errorf("validate PAT: parse response: %w", err) + } + + scopes := parseScopes(resp.Header.Get("X-OAuth-Scopes")) + + return &ValidationResult{ + Valid: true, + Username: user.Login, + Scopes: scopes, + }, nil +} + +func parseScopes(header string) []string { + if header == "" { + return nil + } + parts := strings.Split(header, ",") + scopes := make([]string, 0, len(parts)) + for _, p := range parts { + s := strings.TrimSpace(p) + if s != "" { + scopes = append(scopes, s) + } + } + return scopes +} + +func validateGitHubApp(app *GitHubAppCreds) (*ValidationResult, error) { + if app == nil { + return nil, fmt.Errorf("validate GitHub App: credentials are nil") + } + f, err := os.Open(app.PrivateKeyPath) + if err != nil { + return nil, fmt.Errorf("validate GitHub App: open private key %s: %w", app.PrivateKeyPath, err) + } + if err := f.Close(); err != nil { + return nil, fmt.Errorf("validate GitHub App: close private key file: %w", err) + } + + return &ValidationResult{ + Valid: true, + }, nil +} + +func MaskedPAT(pat string) string { + if len(pat) < 12 { + return "****" + } + return pat[:4] + "..." + pat[len(pat)-4:] +} diff --git a/internal/cli/auth.go b/internal/cli/auth.go new file mode 100644 index 0000000..a41e1ff --- /dev/null +++ b/internal/cli/auth.go @@ -0,0 +1,61 @@ +package cli + +import ( + "fmt" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/auth" + "github.com/spf13/cobra" +) + +func newAuthCmd() *cobra.Command { + cmd := &cobra.Command{ + Use: "auth", + Short: "Authentication management", + } + + cmd.AddCommand(newAuthStatusCmd()) + return cmd +} + +func newAuthStatusCmd() *cobra.Command { + return &cobra.Command{ + Use: "status", + Short: "Display current authentication state", + RunE: func(cmd *cobra.Command, _ []string) error { + creds, source, loadErr := auth.Load(auth.LoadOpts{TokenFlag: tokenFlag}) + if loadErr != nil { + fmt.Println("Status: not authenticated") + fmt.Println("Run 'ghr login' to authenticate.") + return nil + } + + fmt.Printf("Method: %s\n", creds.Method) + fmt.Printf("Source: %s\n", source) + if creds.GitHubURL != "" { + fmt.Printf("GitHub: %s\n", creds.GitHubURL) + } + if creds.Method == "pat" && creds.PAT != "" { + fmt.Printf("Token: %s\n", auth.MaskedPAT(creds.PAT)) + } + if creds.GitHubApp != nil { + fmt.Printf("Client: %s\n", creds.GitHubApp.ClientID) + fmt.Printf("Install: %d\n", creds.GitHubApp.InstallationID) + fmt.Printf("Key: %s\n", creds.GitHubApp.PrivateKeyPath) + } + + result, err := auth.Validate(cmd.Context(), creds) + if err != nil { + fmt.Printf("Status: validation failed: %v\n", err) + return nil + } + if result.Valid { + fmt.Println("Status: authenticated") + if result.Username != "" { + fmt.Printf("User: @%s\n", result.Username) + } + } + + return nil + }, + } +} diff --git a/internal/cli/daemon.go b/internal/cli/daemon.go new file mode 100644 index 0000000..3170ee5 --- /dev/null +++ b/internal/cli/daemon.go @@ -0,0 +1,188 @@ +package cli + +import ( + "context" + "fmt" + "log/slog" + "os" + "path/filepath" + "strconv" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/api" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/auth" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/config" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/controller" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/github" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/health" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/logging" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/monitoring" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/notification" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/runner" +) + +type daemon struct { + ctrl *controller.GroupController + health *health.Monitor + api *api.Server + logMgr *logging.LogManager + cfg *config.Config + logger *slog.Logger +} + +func buildDaemon(cfg *config.Config, creds *auth.Credentials, githubURL string) (*daemon, error) { + logMgr, err := logging.New(logging.LogConfig{ + Level: cfg.Logging.Level, + Format: cfg.Logging.Format, + Dir: cfg.Logging.Dir, + RetentionDays: cfg.Logging.RetentionDays, + RunnerOutput: cfg.Logging.RunnerOutput != nil && *cfg.Logging.RunnerOutput, + }) + if err != nil { + return nil, fmt.Errorf("setup logging: %w", err) + } + + logger, err := logMgr.DaemonLogger() + if err != nil { + logMgr.Close() + return nil, fmt.Errorf("create daemon logger: %w", err) + } + + if err := logMgr.CleanupOldLogs(); err != nil { + logger.Warn("log cleanup failed", "error", err) + } + + ghClient, err := github.NewClient(creds, githubURL) + if err != nil { + logMgr.Close() + return nil, fmt.Errorf("create github client: %w", err) + } + + binaryMgr := runner.NewBinaryManager(cfg.Runner.CacheDir, logger) + processMgr := runner.NewProcessManager(cfg.Runner.WorkdirBase, logger) + + if err := processMgr.CleanupStale(context.Background()); err != nil { + logger.Warn("stale runner cleanup failed", "error", err) + } + processMgr.KillOrphanRunners(context.Background()) + + notifSvc := buildNotificationService(cfg, logger) + reporters := buildReporters(cfg, logger) + + ctrl := controller.New( + ghClient, binaryMgr, processMgr, notifSvc, logMgr, + cfg.Groups, controller.ControllerConfig{ + RunnerVersion: cfg.Runner.Version, + RunnerGroupID: 1, + }, logger, + ) + + var minDiskSpace int64 + if cfg.Health.MinDiskSpace != "" { + minDiskSpace, _ = config.ParseByteSize(cfg.Health.MinDiskSpace) + } + + healthMon := health.NewMonitor(health.MonitorConfig{ + Enabled: cfg.Health.Enabled, + CheckInterval: cfg.Health.CheckInterval.Duration, + RunnerTimeout: cfg.Health.RunnerTimeout.Duration, + IdleTimeout: cfg.Health.IdleTimeout.Duration, + DivergenceTimeout: cfg.Health.DivergenceTimeout.Duration, + MaxConsecutiveFailures: cfg.Health.MaxConsecutiveFailures, + FailureCooldown: cfg.Health.FailureCooldown.Duration, + MinDiskSpace: minDiskSpace, + GroupMinRunners: buildGroupMinRunners(cfg), + }, notifSvc, ctrl, reporters, ctrl, logger) + + apiServer := api.NewServer(cfg.Daemon.StateDir, ctrl, healthMon, logger) + + return &daemon{ + ctrl: ctrl, + health: healthMon, + api: apiServer, + logMgr: logMgr, + cfg: cfg, + logger: logger, + }, nil +} + +func buildNotificationService(cfg *config.Config, logger *slog.Logger) *notification.Service { + var providers []notification.Provider + filters := make(map[string]notification.EventFilter) + + if cfg.Notifications.Discord.Enabled && cfg.Notifications.Discord.WebhookURL != "" { + providers = append(providers, notification.NewDiscord(¬ification.DiscordConfig{ + WebhookURL: cfg.Notifications.Discord.WebhookURL, + Username: cfg.Notifications.Discord.Username, + AvatarURL: cfg.Notifications.Discord.AvatarURL, + Mentions: notification.DiscordMentions{ + Error: cfg.Notifications.Discord.Mentions.Error, + Critical: cfg.Notifications.Discord.Mentions.Critical, + }, + })) + filters["discord"] = notification.EventFilter{ + Patterns: cfg.Notifications.Discord.Events, + } + } + + return notification.New(providers, filters, logger) +} + +func buildReporters(cfg *config.Config, logger *slog.Logger) []health.Reporter { + var reporters []health.Reporter + + logger.Debug("uptime-kuma config", + "enabled", cfg.Monitoring.UptimeKuma.Enabled, + "base_url_set", cfg.Monitoring.UptimeKuma.BaseURL != "", + "daemon_token_set", cfg.Monitoring.UptimeKuma.DaemonToken != "", + "group_tokens", len(cfg.Monitoring.UptimeKuma.GroupTokens), + ) + if cfg.Monitoring.UptimeKuma.Enabled && cfg.Monitoring.UptimeKuma.BaseURL != "" { + reporters = append(reporters, monitoring.NewUptimeKuma(monitoring.UptimeKumaConfig{ + BaseURL: cfg.Monitoring.UptimeKuma.BaseURL, + DaemonToken: cfg.Monitoring.UptimeKuma.DaemonToken, + GroupTokens: cfg.Monitoring.UptimeKuma.GroupTokens, + DegradedThreshold: cfg.Monitoring.UptimeKuma.DegradedThreshold, + ReportHealthAsPing: cfg.Monitoring.UptimeKuma.ReportHealthAsPing, + }, logger)) + } + + return reporters +} + +func resolveGitHubURL(creds *auth.Credentials, cfg *config.Config) (string, error) { + if creds.GitHubURL != "" { + return creds.GitHubURL, nil + } + if cfg.GitHub.URL != "" { + return cfg.GitHub.URL, nil + } + return "", fmt.Errorf("github URL not configured: set it via 'ghr login' or in config github.url") +} + +func pidFilePath(stateDir string) string { + return filepath.Join(stateDir, "daemon.pid") +} + +func writePIDFile(path string) error { + dir := filepath.Dir(path) + if err := os.MkdirAll(dir, 0o755); err != nil { + return fmt.Errorf("create pid file directory %s: %w", dir, err) + } + pid := strconv.Itoa(os.Getpid()) + if err := os.WriteFile(path, []byte(pid), 0o644); err != nil { + return fmt.Errorf("write pid file %s: %w", path, err) + } + return nil +} + +func removePIDFile(path string) { + _ = os.Remove(path) +} + +func buildGroupMinRunners(cfg *config.Config) map[string]int { + m := make(map[string]int, len(cfg.Groups)) + for _, g := range cfg.Groups { + m[g.Name] = g.MinRunners + } + return m +} diff --git a/internal/cli/login.go b/internal/cli/login.go new file mode 100644 index 0000000..7de8644 --- /dev/null +++ b/internal/cli/login.go @@ -0,0 +1,123 @@ +package cli + +import ( + "bufio" + "fmt" + "os" + "strings" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/auth" + "github.com/spf13/cobra" +) + +func newLoginCmd() *cobra.Command { + cmd := &cobra.Command{ + Use: "login", + Short: "Authenticate with GitHub", + Long: "Interactive wizard to configure GitHub authentication. Supports PAT and GitHub App.", + RunE: runLogin, + } + + cmd.Flags().String("method", "", "auth method: pat or app") + cmd.Flags().String("url", "", "GitHub URL (org, repo, or enterprise)") + cmd.Flags().String("client-id", "", "GitHub App client ID") + cmd.Flags().Int64("installation-id", 0, "GitHub App installation ID") + cmd.Flags().String("private-key", "", "path to GitHub App private key (.pem)") + + return cmd +} + +func runLogin(cmd *cobra.Command, _ []string) error { + method, err := cmd.Flags().GetString("method") + if err != nil { + return fmt.Errorf("get method flag: %w", err) + } + + if method == "" { + reader := bufio.NewReader(os.Stdin) + return interactiveLogin(cmd, reader) + } + + return nonInteractiveLogin(cmd, method) +} + +func nonInteractiveLogin(cmd *cobra.Command, method string) error { + url, err := cmd.Flags().GetString("url") + if err != nil { + return fmt.Errorf("get url flag: %w", err) + } + + var creds *auth.Credentials + + switch method { + case "pat": + if tokenFlag == "" { + return fmt.Errorf("--token is required for PAT authentication") + } + if url == "" { + return fmt.Errorf("--url is required") + } + creds = &auth.Credentials{ + Method: "pat", + GitHubURL: url, + PAT: tokenFlag, + } + + case "app": + clientID, flagErr := cmd.Flags().GetString("client-id") + if flagErr != nil { + return fmt.Errorf("get client-id flag: %w", flagErr) + } + installationID, flagErr := cmd.Flags().GetInt64("installation-id") + if flagErr != nil { + return fmt.Errorf("get installation-id flag: %w", flagErr) + } + privateKey, flagErr := cmd.Flags().GetString("private-key") + if flagErr != nil { + return fmt.Errorf("get private-key flag: %w", flagErr) + } + if clientID == "" || installationID == 0 || privateKey == "" || url == "" { + return fmt.Errorf("--client-id, --installation-id, --private-key, and --url are all required for GitHub App authentication") + } + creds = &auth.Credentials{ + Method: "github_app", + GitHubURL: url, + GitHubApp: &auth.GitHubAppCreds{ + ClientID: clientID, + InstallationID: installationID, + PrivateKeyPath: privateKey, + }, + } + + default: + return fmt.Errorf("unknown method %q: must be 'pat' or 'app'", method) + } + + return validateAndSave(cmd, creds) +} + +func validateAndSave(cmd *cobra.Command, creds *auth.Credentials) error { + fmt.Println(" Validating...") + result, err := auth.Validate(cmd.Context(), creds) + if err != nil { + return fmt.Errorf("validation failed: %w", err) + } + + if !result.Valid { + return fmt.Errorf("credentials are not valid") + } + + if err := auth.Save(creds); err != nil { + return fmt.Errorf("save credentials: %w", err) + } + + if creds.Method == "pat" && result.Username != "" { + fmt.Printf("✓ Authenticated as @%s\n", result.Username) + } + if creds.Method == "pat" && len(result.Scopes) > 0 { + fmt.Printf("✓ Scopes: %s\n", strings.Join(result.Scopes, ", ")) + } + fmt.Printf("✓ Credentials saved to %s\n", auth.FilePath()) + + return nil +} diff --git a/internal/cli/login_wizard.go b/internal/cli/login_wizard.go new file mode 100644 index 0000000..27ab4ac --- /dev/null +++ b/internal/cli/login_wizard.go @@ -0,0 +1,118 @@ +package cli + +import ( + "bufio" + "fmt" + "strconv" + "strings" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/auth" + "github.com/spf13/cobra" +) + +func interactiveLogin(cmd *cobra.Command, reader *bufio.Reader) error { + fmt.Println() + fmt.Println("? Authentication method") + fmt.Println(" 1) Personal Access Token (PAT)") + fmt.Println(" 2) GitHub App") + fmt.Print("> ") + + choice, err := reader.ReadString('\n') + if err != nil { + return fmt.Errorf("read choice: %w", err) + } + choice = strings.TrimSpace(choice) + + switch choice { + case "1": + return interactivePAT(cmd, reader) + case "2": + return interactiveApp(cmd, reader) + default: + return fmt.Errorf("invalid choice: %q (expected 1 or 2)", choice) + } +} + +func interactivePAT(cmd *cobra.Command, reader *bufio.Reader) error { + fmt.Print("? GitHub PAT: ") + token, err := reader.ReadString('\n') + if err != nil { + return fmt.Errorf("read token: %w", err) + } + token = strings.TrimSpace(token) + if token == "" { + return fmt.Errorf("token cannot be empty") + } + + fmt.Print("? GitHub URL (org or repo): ") + url, err := reader.ReadString('\n') + if err != nil { + return fmt.Errorf("read url: %w", err) + } + url = strings.TrimSpace(url) + if url == "" { + return fmt.Errorf("URL cannot be empty") + } + + creds := &auth.Credentials{ + Method: "pat", + GitHubURL: url, + PAT: token, + } + + return validateAndSave(cmd, creds) +} + +func interactiveApp(cmd *cobra.Command, reader *bufio.Reader) error { + fmt.Print("? GitHub App Client ID: ") + clientID, err := reader.ReadString('\n') + if err != nil { + return fmt.Errorf("read client ID: %w", err) + } + clientID = strings.TrimSpace(clientID) + if clientID == "" { + return fmt.Errorf("client ID cannot be empty") + } + + fmt.Print("? Installation ID: ") + installIDStr, err := reader.ReadString('\n') + if err != nil { + return fmt.Errorf("read installation ID: %w", err) + } + installID, err := strconv.ParseInt(strings.TrimSpace(installIDStr), 10, 64) + if err != nil { + return fmt.Errorf("parse installation ID: %w", err) + } + + fmt.Print("? Private key path (.pem): ") + keyPath, err := reader.ReadString('\n') + if err != nil { + return fmt.Errorf("read private key path: %w", err) + } + keyPath = strings.TrimSpace(keyPath) + if keyPath == "" { + return fmt.Errorf("private key path cannot be empty") + } + + fmt.Print("? GitHub URL: ") + url, err := reader.ReadString('\n') + if err != nil { + return fmt.Errorf("read url: %w", err) + } + url = strings.TrimSpace(url) + if url == "" { + return fmt.Errorf("URL cannot be empty") + } + + creds := &auth.Credentials{ + Method: "github_app", + GitHubURL: url, + GitHubApp: &auth.GitHubAppCreds{ + ClientID: clientID, + InstallationID: installID, + PrivateKeyPath: keyPath, + }, + } + + return validateAndSave(cmd, creds) +} diff --git a/internal/cli/logout.go b/internal/cli/logout.go new file mode 100644 index 0000000..bd2eb24 --- /dev/null +++ b/internal/cli/logout.go @@ -0,0 +1,22 @@ +package cli + +import ( + "fmt" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/auth" + "github.com/spf13/cobra" +) + +func newLogoutCmd() *cobra.Command { + return &cobra.Command{ + Use: "logout", + Short: "Remove saved credentials", + RunE: func(_ *cobra.Command, _ []string) error { + if err := auth.Remove(); err != nil { + return err + } + fmt.Println("Credentials removed") + return nil + }, + } +} diff --git a/internal/cli/purge.go b/internal/cli/purge.go new file mode 100644 index 0000000..916d641 --- /dev/null +++ b/internal/cli/purge.go @@ -0,0 +1,181 @@ +package cli + +import ( + "context" + "fmt" + "os" + "path/filepath" + "syscall" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/auth" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/config" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/github" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/launchd" + "github.com/spf13/cobra" +) + +func newPurgeCmd() *cobra.Command { + cmd := &cobra.Command{ + Use: "purge", + Short: "Stop everything, delete scale sets, clean workdirs", + RunE: runPurge, + } + + cmd.Flags().Duration("timeout", 5*time.Minute, "max wait for busy runners") + cmd.Flags().Bool("force", false, "don't wait for busy runners") + + return cmd +} + +func runPurge(cmd *cobra.Command, _ []string) error { + if cfgFile == "" { + return fmt.Errorf("--config is required") + } + + timeout, err := cmd.Flags().GetDuration("timeout") + if err != nil { + return fmt.Errorf("get timeout flag: %w", err) + } + + force, err := cmd.Flags().GetBool("force") + if err != nil { + return fmt.Errorf("get force flag: %w", err) + } + + stopDaemonIfRunning() + + cfg, err := config.Load(cfgFile) + if err != nil { + return fmt.Errorf("load config: %w", err) + } + + creds, _, err := auth.Load(auth.LoadOpts{TokenFlag: tokenFlag}) + if err != nil { + return fmt.Errorf("load credentials: %w", err) + } + + githubURL, err := resolveGitHubURL(creds, cfg) + if err != nil { + return err + } + + ghClient, err := github.NewClient(creds, githubURL) + if err != nil { + return fmt.Errorf("create github client: %w", err) + } + + ctx := context.Background() + deletedSets := purgeScaleSets(ctx, ghClient, cfg, force, timeout) + cleanedDirs := cleanupWorkdirs(cfg.Runner.WorkdirBase) + cleanupStateFiles(cfg.Daemon.StateDir) + + fmt.Printf("purge complete: deleted %d scale sets, cleaned %d workdirs\n", deletedSets, cleanedDirs) + return nil +} + +func stopDaemonIfRunning() { + label := launchd.DefaultLabel() + pid, running := launchd.Status(label) + if !running { + return + } + + fmt.Printf("stopping running daemon (pid=%d)...\n", pid) + sigErr := syscall.Kill(pid, syscall.SIGTERM) + if sigErr != nil { + fmt.Printf(" stop warning: %v\n", sigErr) + } else { + waitForExit(pid, 30*time.Second) + } + + uninstallErr := launchd.Uninstall(label) + if uninstallErr != nil { + fmt.Printf(" uninstall warning: %v\n", uninstallErr) + } +} + +func purgeScaleSets(ctx context.Context, ghClient *github.Client, cfg *config.Config, force bool, timeout time.Duration) int { + deletedSets := 0 + for _, g := range cfg.Groups { + fmt.Printf("purging scale set %q...\n", g.Name) + ss, getErr := ghClient.GetScaleSet(ctx, 1, g.Name) + if getErr != nil { + fmt.Printf(" scale set %q not found, skipping\n", g.Name) + continue + } + if ss == nil { + continue + } + + if !force { + waitForIdleRunners(ctx, ghClient, ss.ID, g.Name, timeout) + } + + if delErr := ghClient.DeleteScaleSet(ctx, ss.ID); delErr != nil { + fmt.Printf(" failed to delete scale set %q: %v\n", g.Name, delErr) + continue + } + deletedSets++ + fmt.Printf(" deleted scale set %q (id=%d)\n", g.Name, ss.ID) + } + return deletedSets +} + +func waitForIdleRunners(ctx context.Context, ghClient *github.Client, scaleSetID int, name string, timeout time.Duration) { + deadline := time.Now().Add(timeout) + pollInterval := 5 * time.Second + + for time.Now().Before(deadline) { + ss, err := ghClient.GetScaleSetByID(ctx, scaleSetID) + if err != nil { + fmt.Printf(" warning: cannot check scale set %q status: %v\n", name, err) + return + } + + if ss.Statistics == nil || ss.Statistics.TotalBusyRunners == 0 { + return + } + + fmt.Printf(" waiting for %d busy runners in %q...\n", ss.Statistics.TotalBusyRunners, name) + + select { + case <-ctx.Done(): + return + case <-time.After(pollInterval): + } + } + + fmt.Printf(" timeout waiting for idle runners in %q, proceeding with delete\n", name) +} + +func cleanupWorkdirs(workdirBase string) int { + entries, err := os.ReadDir(workdirBase) + if err != nil { + return 0 + } + + count := 0 + for _, e := range entries { + if !e.IsDir() { + continue + } + p := filepath.Join(workdirBase, e.Name()) + if rmErr := os.RemoveAll(p); rmErr != nil { + fmt.Printf(" failed to remove workdir %s: %v\n", p, rmErr) + continue + } + count++ + } + return count +} + +func cleanupStateFiles(stateDir string) { + for _, name := range []string{"daemon.pid", "daemon.state.json", "ghr.sock"} { + p := filepath.Join(stateDir, name) + rmErr := os.Remove(p) + if rmErr != nil && !os.IsNotExist(rmErr) { + fmt.Printf(" failed to remove %s: %v\n", p, rmErr) + } + } +} diff --git a/internal/cli/restart.go b/internal/cli/restart.go new file mode 100644 index 0000000..64d61bf --- /dev/null +++ b/internal/cli/restart.go @@ -0,0 +1,46 @@ +package cli + +import ( + "fmt" + "syscall" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/launchd" + "github.com/spf13/cobra" +) + +func newRestartCmd() *cobra.Command { + cmd := &cobra.Command{ + Use: "restart", + Short: "Restart the ghr daemon", + RunE: runRestart, + } + return cmd +} + +func runRestart(cmd *cobra.Command, args []string) error { + if cfgFile == "" { + stateDir := resolveStateDir() + if state, err := readDaemonState(stateDir); err == nil && state.ConfigPath != "" { + cfgFile = state.ConfigPath + } + } + + label := launchd.DefaultLabel() + if launchd.IsRunning(label) { + pid, _ := launchd.Status(label) + fmt.Printf("stopping ghr (pid=%d)...\n", pid) + + if err := syscall.Kill(pid, syscall.SIGTERM); err != nil { + fmt.Printf("stop warning: %v\n", err) + } else { + waitForExit(pid, 30*time.Second) + } + + if err := launchd.Uninstall(label); err != nil { + fmt.Printf("uninstall warning: %v\n", err) + } + } + + return runStart(cmd, args) +} diff --git a/internal/cli/root.go b/internal/cli/root.go new file mode 100644 index 0000000..d9fc581 --- /dev/null +++ b/internal/cli/root.go @@ -0,0 +1,42 @@ +package cli + +import "github.com/spf13/cobra" + +var ( + cfgFile string + tokenFlag string + logLevel string +) + +func newRootCmd() *cobra.Command { + cmd := &cobra.Command{ + Use: "ghr", + Short: "GitHub Actions runner controller for macOS", + Long: "ghr manages ephemeral GitHub Actions runners via scale sets on macOS.", + SilenceUsage: true, + SilenceErrors: true, + } + + cmd.PersistentFlags().StringVar(&cfgFile, "config", "", "path to config file") + cmd.PersistentFlags().StringVar(&tokenFlag, "token", "", "override auth token for this invocation") + cmd.PersistentFlags().StringVar(&logLevel, "log-level", "", "override log level (debug/info/warn/error)") + + cmd.AddCommand( + newStartCmd(), + newStopCmd(), + newRestartCmd(), + newRunCmd(), + newStatusCmd(), + newPurgeCmd(), + newLoginCmd(), + newLogoutCmd(), + newAuthCmd(), + newVersionCmd(), + ) + + return cmd +} + +func Execute() error { + return newRootCmd().Execute() +} diff --git a/internal/cli/run.go b/internal/cli/run.go new file mode 100644 index 0000000..9f0526a --- /dev/null +++ b/internal/cli/run.go @@ -0,0 +1,129 @@ +package cli + +import ( + "context" + "fmt" + "os/signal" + "syscall" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/auth" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/config" + "github.com/oklog/run" + "github.com/spf13/cobra" +) + +func newRunCmd() *cobra.Command { + cmd := &cobra.Command{ + Use: "run", + Short: "Run the ghr daemon in foreground", + RunE: runRun, + } + return cmd +} + +func runRun(_ *cobra.Command, _ []string) error { + if cfgFile == "" { + return fmt.Errorf("--config is required") + } + + cfg, err := config.Load(cfgFile) + if err != nil { + return fmt.Errorf("load config: %w", err) + } + + creds, source, err := auth.Load(auth.LoadOpts{TokenFlag: tokenFlag}) + if err != nil { + return err + } + + githubURL, err := resolveGitHubURL(creds, cfg) + if err != nil { + return err + } + + if logLevel != "" { + cfg.Logging.Level = logLevel + } + + d, err := buildDaemon(cfg, creds, githubURL) + if err != nil { + return err + } + defer d.logMgr.Close() + + d.logger.Info("ghr starting", + "config", cfgFile, + "groups", len(cfg.Groups), + "auth_source", source, + "auth_method", creds.Method, + ) + + pidPath := pidFilePath(cfg.Daemon.StateDir) + if err := writePIDFile(pidPath); err != nil { + return fmt.Errorf("write pid file: %w", err) + } + defer removePIDFile(pidPath) + + if err := writeDaemonState(cfg.Daemon.StateDir, cfgFile); err != nil { + return fmt.Errorf("write daemon state: %w", err) + } + defer removeDaemonState(cfg.Daemon.StateDir) + + return runDaemonGroup(d) +} + +func runDaemonGroup(d *daemon) error { + var g run.Group + + { + ctx, cancel := context.WithCancel(context.Background()) + g.Add( + func() error { return d.ctrl.Run(ctx) }, + func(error) { cancel() }, + ) + } + + { + ctx, cancel := context.WithCancel(context.Background()) + g.Add( + func() error { return d.health.Run(ctx) }, + func(error) { cancel() }, + ) + } + + { + ctx, cancel := context.WithCancel(context.Background()) + g.Add( + func() error { return d.api.Run(ctx) }, + func(error) { cancel() }, + ) + } + + { + ctx, cancel := context.WithCancel(context.Background()) + g.Add( + func() error { return d.logMgr.StartCleanupScheduler(ctx) }, + func(error) { cancel() }, + ) + } + + { + ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM) + g.Add( + func() error { + <-ctx.Done() + return nil + }, + func(error) { cancel() }, + ) + } + + groupErr := g.Run() + + shutdownCtx, cancel := context.WithTimeout(context.Background(), d.cfg.Daemon.ShutdownTimeout.Duration) + defer cancel() + d.ctrl.Shutdown(shutdownCtx) + + d.logger.Info("ghr stopped") + return groupErr +} diff --git a/internal/cli/start.go b/internal/cli/start.go new file mode 100644 index 0000000..458f07f --- /dev/null +++ b/internal/cli/start.go @@ -0,0 +1,111 @@ +package cli + +import ( + "fmt" + "os" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/config" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/launchd" + "github.com/spf13/cobra" +) + +func newStartCmd() *cobra.Command { + cmd := &cobra.Command{ + Use: "start", + Short: "Start the ghr daemon via launchd", + RunE: runStart, + } + + cmd.Flags().Bool("foreground", false, "run in foreground (same as 'ghr run')") + + return cmd +} + +func runStart(cmd *cobra.Command, args []string) error { + foreground, err := cmd.Flags().GetBool("foreground") + if err == nil && foreground { + return runRun(cmd, args) + } + + if cfgFile == "" { + return fmt.Errorf("--config is required") + } + + cfg, err := config.Load(cfgFile) + if err != nil { + return fmt.Errorf("load config: %w", err) + } + + label := launchd.DefaultLabel() + if launchd.IsRunning(label) { + pid, _ := launchd.Status(label) + fmt.Printf("ghr is already running (pid=%d)\n", pid) + return nil + } + + binaryPath, err := os.Executable() + if err != nil { + return fmt.Errorf("resolve binary path: %w", err) + } + + svcCfg := launchd.ServiceConfig{ + Label: label, + BinaryPath: binaryPath, + ConfigPath: cfgFile, + LogDir: cfg.Logging.Dir, + StateDir: cfg.Daemon.StateDir, + } + + if err := launchd.Install(&svcCfg); err != nil { + return fmt.Errorf("install launchd service: %w", err) + } + + pid := waitForPID(cfg.Daemon.StateDir, 5*time.Second) + + serviceType := "LaunchAgent" + if os.Getuid() == 0 { + serviceType = "LaunchDaemon" + } + + if pid > 0 { + fmt.Printf("ghr started (pid=%d)\n", pid) + } else { + fmt.Println("ghr started") + } + fmt.Printf("Service: %s (%s)\n", label, serviceType) + fmt.Printf("Config: %s\n", cfgFile) + fmt.Printf("Groups: %d", len(cfg.Groups)) + if len(cfg.Groups) > 0 { + fmt.Print(" (") + for i, g := range cfg.Groups { + if i > 0 { + fmt.Print(", ") + } + fmt.Print(g.Name) + } + fmt.Print(")") + } + fmt.Println() + fmt.Printf("Logs: %s\n", cfg.Logging.Dir) + + return nil +} + +func waitForPID(stateDir string, timeout time.Duration) int { + pidPath := pidFilePath(stateDir) + deadline := time.Now().Add(timeout) + + for time.Now().Before(deadline) { + data, err := os.ReadFile(pidPath) + if err == nil && len(data) > 0 { + var pid int + if _, scanErr := fmt.Sscanf(string(data), "%d", &pid); scanErr == nil && pid > 0 { + return pid + } + } + time.Sleep(500 * time.Millisecond) + } + + return 0 +} diff --git a/internal/cli/state.go b/internal/cli/state.go new file mode 100644 index 0000000..3dc7d42 --- /dev/null +++ b/internal/cli/state.go @@ -0,0 +1,62 @@ +package cli + +import ( + "encoding/json" + "fmt" + "os" + "path/filepath" + "time" +) + +const stateFileName = "daemon.state.json" + +type daemonState struct { + ConfigPath string `json:"config_path"` + StartedAt time.Time `json:"started_at"` + PID int `json:"pid"` + Groups map[string]int `json:"groups"` +} + +func writeDaemonState(stateDir, configPath string) error { + state := daemonState{ + ConfigPath: configPath, + StartedAt: time.Now(), + PID: os.Getpid(), + Groups: make(map[string]int), + } + + data, err := json.MarshalIndent(state, "", " ") + if err != nil { + return fmt.Errorf("marshal daemon state: %w", err) + } + + dir := stateDir + if err := os.MkdirAll(dir, 0o755); err != nil { + return fmt.Errorf("create state directory %s: %w", dir, err) + } + + path := filepath.Join(dir, stateFileName) + if err := os.WriteFile(path, data, 0o644); err != nil { + return fmt.Errorf("write daemon state %s: %w", path, err) + } + return nil +} + +func readDaemonState(stateDir string) (*daemonState, error) { + path := filepath.Join(stateDir, stateFileName) + data, err := os.ReadFile(path) + if err != nil { + return nil, fmt.Errorf("read daemon state %s: %w", path, err) + } + + var state daemonState + if err := json.Unmarshal(data, &state); err != nil { + return nil, fmt.Errorf("parse daemon state %s: %w", path, err) + } + return &state, nil +} + +func removeDaemonState(stateDir string) { + path := filepath.Join(stateDir, stateFileName) + _ = os.Remove(path) +} diff --git a/internal/cli/status.go b/internal/cli/status.go new file mode 100644 index 0000000..e5af2f5 --- /dev/null +++ b/internal/cli/status.go @@ -0,0 +1,134 @@ +package cli + +import ( + "context" + "fmt" + "io" + "net" + "net/http" + "os" + "path/filepath" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/config" + "github.com/spf13/cobra" +) + +func newStatusCmd() *cobra.Command { + cmd := &cobra.Command{ + Use: "status", + Short: "Show ghr daemon status", + RunE: runStatus, + } + + cmd.Flags().Bool("json", false, "output in JSON format") + cmd.Flags().Bool("watch", false, "live refresh mode") + cmd.Flags().Duration("interval", 5*time.Second, "refresh interval for --watch") + + return cmd +} + +func runStatus(cmd *cobra.Command, _ []string) error { + jsonOutput, err := cmd.Flags().GetBool("json") + if err != nil { + return fmt.Errorf("get json flag: %w", err) + } + + watch, err := cmd.Flags().GetBool("watch") + if err != nil { + return fmt.Errorf("get watch flag: %w", err) + } + + interval, err := cmd.Flags().GetDuration("interval") + if err != nil { + return fmt.Errorf("get interval flag: %w", err) + } + + stateDir := resolveStateDir() + socketPath := filepath.Join(stateDir, "ghr.sock") + + if !watch { + return renderOnce(socketPath, stateDir, jsonOutput) + } + + return runWatch(cmd.Context(), socketPath, stateDir, jsonOutput, interval) +} + +func renderOnce(socketPath, stateDir string, jsonOutput bool) error { + resp, socketErr := querySocket(socketPath, "/status") + if socketErr != nil { + return showOfflineStatus(stateDir, jsonOutput) + } + + if jsonOutput { + fmt.Println(string(resp)) + return nil + } + + return displayStatus(resp) +} + +func runWatch(ctx context.Context, socketPath, stateDir string, jsonOutput bool, interval time.Duration) error { + ticker := time.NewTicker(interval) + defer ticker.Stop() + + for { + if !jsonOutput { + fmt.Print("\033[H\033[2J") + } + + renderErr := renderOnce(socketPath, stateDir, jsonOutput) + if renderErr != nil && !jsonOutput { + fmt.Fprintf(os.Stderr, "status error: %v\n", renderErr) + } + + select { + case <-ctx.Done(): + return nil + case <-ticker.C: + } + } +} + +func resolveStateDir() string { + if cfgFile != "" { + cfg, err := config.Load(cfgFile) + if err == nil { + return cfg.Daemon.StateDir + } + } + + if os.Getuid() == 0 { + return "/var/lib/ghr/state" + } + + home, err := os.UserHomeDir() + if err != nil { + return "." + } + return filepath.Join(home, ".local", "state", "ghr") +} + +func querySocket(socketPath, endpoint string) ([]byte, error) { + client := &http.Client{ + Transport: &http.Transport{ + DialContext: func(_ context.Context, _, _ string) (net.Conn, error) { + return net.Dial("unix", socketPath) + }, + }, + Timeout: 5 * time.Second, + } + + resp, err := client.Get("http://unix" + endpoint) + if err != nil { + return nil, fmt.Errorf("connect to daemon socket: %w", err) + } + defer resp.Body.Close() + + body, err := io.ReadAll(resp.Body) + if err != nil { + return nil, fmt.Errorf("read socket response: %w", err) + } + + return body, nil +} diff --git a/internal/cli/status_render.go b/internal/cli/status_render.go new file mode 100644 index 0000000..95da2ff --- /dev/null +++ b/internal/cli/status_render.go @@ -0,0 +1,169 @@ +package cli + +import ( + "encoding/json" + "fmt" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/launchd" +) + +type statusResponse struct { + Groups map[string][]statusRunner `json:"groups"` + Health statusHealth `json:"health"` +} + +type statusRunner struct { + Name string `json:"name"` + State string `json:"state"` + PID int `json:"pid"` + JobName string `json:"job_name"` +} + +type statusHealthIssue struct { + Level string `json:"level"` + Type string `json:"type"` + Group string `json:"group"` + Runner string `json:"runner"` + Message string `json:"message"` +} + +type statusHealth struct { + LastCheck string `json:"last_check"` + Issues []statusHealthIssue `json:"issues"` +} + +func showOfflineStatus(stateDir string, jsonOutput bool) error { + label := launchd.DefaultLabel() + pid, running := launchd.Status(label) + + if jsonOutput { + status := map[string]interface{}{ + "status": "stopped", + "running": running, + "pid": pid, + } + data, err := json.MarshalIndent(status, "", " ") + if err != nil { + return fmt.Errorf("marshal status: %w", err) + } + fmt.Println(string(data)) + return nil + } + + fmt.Println("Service") + if running { + fmt.Printf(" Status: running (via launchd, pid=%d)\n", pid) + fmt.Println(" Note: daemon socket not available") + } else { + fmt.Println(" Status: stopped") + } + + if state, readErr := readDaemonState(stateDir); readErr == nil { + fmt.Printf(" Config: %s\n", state.ConfigPath) + fmt.Printf(" Started: %s\n", state.StartedAt.Format(time.RFC3339)) + } + + fmt.Println() + fmt.Println("No active groups or runners.") + fmt.Println("Use 'ghr start' to start the daemon.") + + return nil +} + +func displayStatus(data []byte) error { + var status statusResponse + if err := json.Unmarshal(data, &status); err != nil { + return fmt.Errorf("parse status response: %w", err) + } + + label := launchd.DefaultLabel() + pid, _ := launchd.Status(label) + + renderServiceSection(pid, "") + renderGroupsTable(status.Groups) + renderRunnersTable(status.Groups) + renderHealthSection(status.Health) + + return nil +} + +func renderServiceSection(pid int, configPath string) { + fmt.Println("Service") + fmt.Println(" Status: running") + if pid > 0 { + fmt.Printf(" PID: %d\n", pid) + } + if configPath != "" { + fmt.Printf(" Config: %s\n", configPath) + } + fmt.Println() +} + +func renderGroupsTable(groups map[string][]statusRunner) { + fmt.Println("Groups") + fmt.Printf(" %-20s %5s %7s %5s %8s\n", "Name", "Max", "Running", "Idle", "Health") + fmt.Printf(" %-20s %5s %7s %5s %8s\n", "----", "---", "-------", "----", "------") + + totalRunning := 0 + totalIdle := 0 + + for group, runners := range groups { + running := 0 + idle := 0 + for _, r := range runners { + if r.State == "busy" { + running++ + } else { + idle++ + } + } + totalRunning += running + totalIdle += idle + fmt.Printf(" %-20s %5d %7d %5d %8s\n", group, len(runners), running, idle, "OK") + } + + fmt.Printf(" Total: running=%d idle=%d\n", totalRunning, totalIdle) + fmt.Println() +} + +func renderRunnersTable(groups map[string][]statusRunner) { + hasRunners := false + for _, runners := range groups { + if len(runners) > 0 { + hasRunners = true + break + } + } + if !hasRunners { + return + } + + fmt.Println("Runners") + fmt.Printf(" %-30s %-8s %-25s %6s\n", "Runner", "Status", "Job", "PID") + fmt.Printf(" %-30s %-8s %-25s %6s\n", "------", "------", "---", "---") + + for _, runners := range groups { + for _, r := range runners { + job := r.JobName + if job == "" { + job = "-" + } + fmt.Printf(" %-30s %-8s %-25s %6d\n", r.Name, r.State, job, r.PID) + } + } + fmt.Println() +} + +func renderHealthSection(h statusHealth) { + fmt.Println("Health") + if h.LastCheck != "" { + fmt.Printf(" Last check: %s\n", h.LastCheck) + } else { + fmt.Println(" Last check: n/a") + } + fmt.Printf(" Issues: %d\n", len(h.Issues)) + for _, issue := range h.Issues { + fmt.Printf(" [%s] %s: %s\n", issue.Level, issue.Type, issue.Message) + } +} diff --git a/internal/cli/stop.go b/internal/cli/stop.go new file mode 100644 index 0000000..a14e433 --- /dev/null +++ b/internal/cli/stop.go @@ -0,0 +1,78 @@ +package cli + +import ( + "fmt" + "syscall" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/launchd" + "github.com/spf13/cobra" +) + +func newStopCmd() *cobra.Command { + cmd := &cobra.Command{ + Use: "stop", + Short: "Stop the ghr daemon", + RunE: runStop, + } + + cmd.Flags().Duration("timeout", 30*time.Second, "max wait for graceful shutdown") + cmd.Flags().Bool("force", false, "skip SIGTERM, send SIGKILL immediately") + + return cmd +} + +func runStop(cmd *cobra.Command, _ []string) error { + timeout, err := cmd.Flags().GetDuration("timeout") + if err != nil { + return fmt.Errorf("get timeout flag: %w", err) + } + + force, err := cmd.Flags().GetBool("force") + if err != nil { + return fmt.Errorf("get force flag: %w", err) + } + + label := launchd.DefaultLabel() + pid, running := launchd.Status(label) + if !running { + fmt.Println("ghr is not running") + return nil + } + + if force { + if err := syscall.Kill(pid, syscall.SIGKILL); err != nil { + return fmt.Errorf("send SIGKILL to pid %d: %w", pid, err) + } + } else { + if err := syscall.Kill(pid, syscall.SIGTERM); err != nil { + return fmt.Errorf("send SIGTERM to pid %d: %w", pid, err) + } + + if !waitForExit(pid, timeout) { + fmt.Println("graceful shutdown timed out, sending SIGKILL") + if err := syscall.Kill(pid, syscall.SIGKILL); err != nil { + return fmt.Errorf("send SIGKILL to pid %d: %w", pid, err) + } + } + } + + uninstallErr := launchd.Uninstall(label) + if uninstallErr != nil { + return fmt.Errorf("uninstall launchd service: %w", uninstallErr) + } + + fmt.Println("ghr stopped") + return nil +} + +func waitForExit(pid int, timeout time.Duration) bool { + deadline := time.Now().Add(timeout) + for time.Now().Before(deadline) { + if err := syscall.Kill(pid, 0); err != nil { + return true + } + time.Sleep(500 * time.Millisecond) + } + return false +} diff --git a/internal/cli/version.go b/internal/cli/version.go new file mode 100644 index 0000000..abb3c91 --- /dev/null +++ b/internal/cli/version.go @@ -0,0 +1,23 @@ +package cli + +import ( + "fmt" + + "github.com/spf13/cobra" +) + +var ( + version = "dev" + commit = "none" + date = "unknown" +) + +func newVersionCmd() *cobra.Command { + return &cobra.Command{ + Use: "version", + Short: "Print version information", + Run: func(_ *cobra.Command, _ []string) { + fmt.Printf("ghr %s (commit: %s, built: %s)\n", version, commit, date) + }, + } +} diff --git a/internal/config/bytesize.go b/internal/config/bytesize.go new file mode 100644 index 0000000..727a116 --- /dev/null +++ b/internal/config/bytesize.go @@ -0,0 +1,61 @@ +package config + +import ( + "fmt" + "strconv" + "strings" +) + +const ( + bytesPerKB int64 = 1000 + bytesPerMB int64 = 1000 * 1000 + bytesPerGB int64 = 1000 * 1000 * 1000 + bytesPerTB int64 = 1000 * 1000 * 1000 * 1000 +) + +func ParseByteSize(s string) (int64, error) { + s = strings.TrimSpace(s) + if s == "" { + return 0, fmt.Errorf("parse byte size: empty string") + } + + upper := strings.ToUpper(s) + + suffixes := []struct { + suffix string + multiplier int64 + }{ + {"TB", bytesPerTB}, + {"GB", bytesPerGB}, + {"MB", bytesPerMB}, + {"KB", bytesPerKB}, + {"B", 1}, + } + + for _, entry := range suffixes { + if !strings.HasSuffix(upper, entry.suffix) { + continue + } + numStr := strings.TrimSpace(s[:len(s)-len(entry.suffix)]) + if numStr == "" { + return 0, fmt.Errorf("parse byte size %q: missing numeric value", s) + } + n, err := strconv.ParseFloat(numStr, 64) + if err != nil { + return 0, fmt.Errorf("parse byte size %q: %w", s, err) + } + if n < 0 { + return 0, fmt.Errorf("parse byte size %q: negative value", s) + } + return int64(n * float64(entry.multiplier)), nil + } + + n, err := strconv.ParseInt(s, 10, 64) + if err != nil { + return 0, fmt.Errorf("parse byte size %q: %w", s, err) + } + if n < 0 { + return 0, fmt.Errorf("parse byte size %q: negative value", s) + } + return n, nil +} diff --git a/internal/config/bytesize_test.go b/internal/config/bytesize_test.go new file mode 100644 index 0000000..4abdb78 --- /dev/null +++ b/internal/config/bytesize_test.go @@ -0,0 +1,71 @@ +package config + +import ( + "testing" +) + +func TestParseByteSize(t *testing.T) { + tests := []struct { + name string + input string + want int64 + wantErr bool + }{ + // Raw byte values. + {name: "numeric only", input: "1024", want: 1024}, + {name: "zero", input: "0", want: 0}, + {name: "explicit B suffix", input: "1B", want: 1}, + {name: "large bytes", input: "999999", want: 999999}, + + // KB (1000-based). + {name: "1KB", input: "1KB", want: 1000}, + {name: "lowercase kb", input: "1kb", want: 1000}, + {name: "mixed case Kb", input: "1Kb", want: 1000}, + {name: "500KB", input: "500KB", want: 500_000}, + + // MB. + {name: "1MB", input: "1MB", want: 1_000_000}, + {name: "500MB", input: "500MB", want: 500_000_000}, + + // GB. + {name: "1GB", input: "1GB", want: 1_000_000_000}, + {name: "10GB", input: "10GB", want: 10_000_000_000}, + + // TB. + {name: "1TB", input: "1TB", want: 1_000_000_000_000}, + {name: "2TB", input: "2TB", want: 2_000_000_000_000}, + + // Fractional values (supported for suffixed inputs via ParseFloat). + {name: "1.5GB", input: "1.5GB", want: 1_500_000_000}, + {name: "0.5MB", input: "0.5MB", want: 500_000}, + + // Whitespace handling. + {name: "leading/trailing spaces", input: " 100MB ", want: 100_000_000}, + + // Error cases. + {name: "empty string", input: "", wantErr: true}, + {name: "pure alpha", input: "abc", wantErr: true}, + {name: "negative GB", input: "-1GB", wantErr: true}, + {name: "negative raw", input: "-100", wantErr: true}, + {name: "suffix only KB", input: "KB", wantErr: true}, + {name: "suffix only B", input: "B", wantErr: true}, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got, err := ParseByteSize(tt.input) + if tt.wantErr { + if err == nil { + t.Errorf("ParseByteSize(%q) = %d, want error", tt.input, got) + } + return + } + if err != nil { + t.Fatalf("ParseByteSize(%q) unexpected error: %v", tt.input, err) + } + if got != tt.want { + t.Errorf("ParseByteSize(%q) = %d, want %d", tt.input, got, tt.want) + } + }) + } +} diff --git a/internal/config/loader.go b/internal/config/loader.go new file mode 100644 index 0000000..1d69109 --- /dev/null +++ b/internal/config/loader.go @@ -0,0 +1,148 @@ +package config + +import ( + "fmt" + "os" + "path/filepath" + "strings" + "time" + + "github.com/joho/godotenv" + "gopkg.in/yaml.v3" +) + +func Load(path string) (*Config, error) { + _ = godotenv.Load() + configDir := filepath.Dir(path) + _ = godotenv.Load(filepath.Join(configDir, ".env")) + + data, err := os.ReadFile(path) + if err != nil { + return nil, fmt.Errorf("read config file %q: %w", path, err) + } + + cfg := &Config{} + if err := yaml.Unmarshal(data, cfg); err != nil { + return nil, fmt.Errorf("parse config file %q: %w", path, err) + } + + applyDefaults(cfg) + resolveEnvVars(cfg) + + if err := validate(cfg); err != nil { + return nil, fmt.Errorf("validate config: %w", err) + } + + return cfg, nil +} + +func applyDefaults(cfg *Config) { + isRoot := os.Getuid() == 0 + + var dataDir, logDir, stateDir string + if isRoot { + dataDir = "/var/lib/ghr" + logDir = "/var/log/ghr" + stateDir = "/var/lib/ghr/state" + } else { + home, err := os.UserHomeDir() + if err != nil { + home = "." + } + dataDir = filepath.Join(home, ".local", "share", "ghr") + logDir = filepath.Join(home, ".local", "share", "ghr", "logs") + stateDir = filepath.Join(home, ".local", "state", "ghr") + } + + if cfg.GitHub.RunnerGroup == "" { + cfg.GitHub.RunnerGroup = "default" + } + + if cfg.Runner.Version == "" { + cfg.Runner.Version = "latest" + } + if cfg.Runner.CacheDir == "" { + cfg.Runner.CacheDir = filepath.Join(dataDir, "cache") + } + if cfg.Runner.WorkdirBase == "" { + cfg.Runner.WorkdirBase = filepath.Join(dataDir, "runners") + } + + if isHealthZero(cfg.Health) { + cfg.Health.Enabled = true + } + if cfg.Health.CheckInterval.Duration == 0 { + cfg.Health.CheckInterval = Duration{30 * time.Second} + } + if cfg.Health.RunnerTimeout.Duration == 0 { + cfg.Health.RunnerTimeout = Duration{2 * time.Hour} + } + if cfg.Health.DivergenceTimeout.Duration == 0 { + cfg.Health.DivergenceTimeout = Duration{5 * time.Minute} + } + if cfg.Health.MaxConsecutiveFailures == 0 { + cfg.Health.MaxConsecutiveFailures = 5 + } + if cfg.Health.FailureCooldown.Duration == 0 { + cfg.Health.FailureCooldown = Duration{1 * time.Minute} + } + if cfg.Health.MinDiskSpace == "" { + cfg.Health.MinDiskSpace = "1GB" + } + + if cfg.Logging.Level == "" { + cfg.Logging.Level = "info" + } + if cfg.Logging.Format == "" { + cfg.Logging.Format = "text" + } + if cfg.Logging.Dir == "" { + cfg.Logging.Dir = logDir + } + if cfg.Logging.RetentionDays == 0 { + cfg.Logging.RetentionDays = 30 + } + if cfg.Logging.RunnerOutput == nil { + t := true + cfg.Logging.RunnerOutput = &t + } + + if cfg.Notifications.Discord.Username == "" { + cfg.Notifications.Discord.Username = "ghr" + } + + if cfg.Daemon.StateDir == "" { + cfg.Daemon.StateDir = stateDir + } + if cfg.Daemon.ShutdownTimeout.Duration == 0 { + cfg.Daemon.ShutdownTimeout = Duration{30 * time.Second} + } +} + +func resolveEnvVars(cfg *Config) { + if v := os.Getenv("GHR_DISCORD_WEBHOOK_URL"); v != "" { + cfg.Notifications.Discord.WebhookURL = v + } + if v := os.Getenv("GHR_UPTIME_KUMA_URL"); v != "" { + cfg.Monitoring.UptimeKuma.BaseURL = v + } + if v := os.Getenv("GHR_UPTIME_KUMA_DAEMON_TOKEN"); v != "" { + cfg.Monitoring.UptimeKuma.DaemonToken = v + } + resolveUptimeKumaGroupTokens(cfg) +} + +func resolveUptimeKumaGroupTokens(cfg *Config) { + if len(cfg.Groups) == 0 { + return + } + if cfg.Monitoring.UptimeKuma.GroupTokens == nil { + cfg.Monitoring.UptimeKuma.GroupTokens = make(map[string]string, len(cfg.Groups)) + } + for _, g := range cfg.Groups { + envKey := "GHR_UPTIME_KUMA_TOKEN_" + strings.ToUpper(strings.ReplaceAll(g.Name, "-", "_")) + if v := os.Getenv(envKey); v != "" { + cfg.Monitoring.UptimeKuma.GroupTokens[g.Name] = v + } + } +} diff --git a/internal/config/loader_test.go b/internal/config/loader_test.go new file mode 100644 index 0000000..e6f19ee --- /dev/null +++ b/internal/config/loader_test.go @@ -0,0 +1,563 @@ +package config + +import ( + "os" + "path/filepath" + "strings" + "testing" + "time" +) + +// writeConfig writes a YAML string to a temp file and returns its path. +func writeConfig(t *testing.T, yaml string) string { + t.Helper() + dir := t.TempDir() + path := filepath.Join(dir, "config.yaml") + if err := os.WriteFile(path, []byte(yaml), 0644); err != nil { + t.Fatalf("write temp config: %v", err) + } + return path +} + +func TestLoad_MinimalConfig(t *testing.T) { + yaml := ` +groups: + - name: test-group + max_runners: 2 +` + cfg, err := Load(writeConfig(t, yaml)) + if err != nil { + t.Fatalf("Load() unexpected error: %v", err) + } + + // Group values. + if len(cfg.Groups) != 1 { + t.Fatalf("expected 1 group, got %d", len(cfg.Groups)) + } + g := cfg.Groups[0] + if g.Name != "test-group" { + t.Errorf("group name = %q, want %q", g.Name, "test-group") + } + if g.MaxRunners != 2 { + t.Errorf("max_runners = %d, want 2", g.MaxRunners) + } + + // Defaults: runner. + if cfg.Runner.Version != "latest" { + t.Errorf("runner.version = %q, want %q", cfg.Runner.Version, "latest") + } + + // Defaults: github. + if cfg.GitHub.RunnerGroup != "default" { + t.Errorf("github.runner_group = %q, want %q", cfg.GitHub.RunnerGroup, "default") + } + + // Defaults: health. + if !cfg.Health.Enabled { + t.Error("health.enabled = false, want true (default)") + } + if cfg.Health.CheckInterval.Duration != 30*time.Second { + t.Errorf("health.check_interval = %v, want 30s", cfg.Health.CheckInterval.Duration) + } + if cfg.Health.RunnerTimeout.Duration != 2*time.Hour { + t.Errorf("health.runner_timeout = %v, want 2h", cfg.Health.RunnerTimeout.Duration) + } + if cfg.Health.DivergenceTimeout.Duration != 5*time.Minute { + t.Errorf("health.divergence_timeout = %v, want 5m", cfg.Health.DivergenceTimeout.Duration) + } + if cfg.Health.MaxConsecutiveFailures != 5 { + t.Errorf("health.max_consecutive_failures = %d, want 5", cfg.Health.MaxConsecutiveFailures) + } + if cfg.Health.FailureCooldown.Duration != 1*time.Minute { + t.Errorf("health.failure_cooldown = %v, want 1m", cfg.Health.FailureCooldown.Duration) + } + if cfg.Health.MinDiskSpace != "1GB" { + t.Errorf("health.min_disk_space = %q, want %q", cfg.Health.MinDiskSpace, "1GB") + } + + // Defaults: logging. + if cfg.Logging.Level != "info" { + t.Errorf("logging.level = %q, want %q", cfg.Logging.Level, "info") + } + if cfg.Logging.Format != "text" { + t.Errorf("logging.format = %q, want %q", cfg.Logging.Format, "text") + } + if cfg.Logging.RetentionDays != 30 { + t.Errorf("logging.retention_days = %d, want 30", cfg.Logging.RetentionDays) + } + if cfg.Logging.RunnerOutput == nil || !*cfg.Logging.RunnerOutput { + t.Error("logging.runner_output = false/nil, want true (default)") + } + + // Defaults: notifications. + if cfg.Notifications.Discord.Username != "ghr" { + t.Errorf("notifications.discord.username = %q, want %q", cfg.Notifications.Discord.Username, "ghr") + } + + // Defaults: daemon. + if cfg.Daemon.ShutdownTimeout.Duration != 30*time.Second { + t.Errorf("daemon.shutdown_timeout = %v, want 30s", cfg.Daemon.ShutdownTimeout.Duration) + } +} + +func TestLoad_FullConfig(t *testing.T) { + yaml := ` +github: + url: "https://github.example.com" + runner_group: "custom-group" + +runner: + version: "2.320.0" + cache_dir: "/tmp/ghr-cache" + workdir_base: "/tmp/ghr-runners" + +groups: + - name: production + max_runners: 10 + min_runners: 2 + labels: + - self-hosted + - linux + runner_group: "prod-pool" + version: "2.319.0" + - name: staging + max_runners: 5 + min_runners: 0 + labels: + - staging + +health: + enabled: true + check_interval: "1m" + runner_timeout: "3h" + idle_timeout: "30m" + divergence_timeout: "10m" + max_consecutive_failures: 10 + failure_cooldown: "2m" + min_disk_space: "5GB" + +logging: + level: "debug" + format: "json" + dir: "/tmp/ghr-logs" + retention_days: 14 + runner_output: false + +notifications: + discord: + enabled: true + events: + - runner.started + - runner.failed + username: "my-bot" + mentions: + error: "<@&111>" + critical: "<@&222>" + +monitoring: + uptime_kuma: + enabled: true + degraded_threshold: 0.8 + report_health_as_ping: true + +daemon: + state_dir: "/tmp/ghr-state" + shutdown_timeout: "1m" +` + cfg, err := Load(writeConfig(t, yaml)) + if err != nil { + t.Fatalf("Load() unexpected error: %v", err) + } + + // GitHub. + if cfg.GitHub.URL != "https://github.example.com" { + t.Errorf("github.url = %q, want %q", cfg.GitHub.URL, "https://github.example.com") + } + if cfg.GitHub.RunnerGroup != "custom-group" { + t.Errorf("github.runner_group = %q, want %q", cfg.GitHub.RunnerGroup, "custom-group") + } + + // Runner. + if cfg.Runner.Version != "2.320.0" { + t.Errorf("runner.version = %q, want %q", cfg.Runner.Version, "2.320.0") + } + if cfg.Runner.CacheDir != "/tmp/ghr-cache" { + t.Errorf("runner.cache_dir = %q, want %q", cfg.Runner.CacheDir, "/tmp/ghr-cache") + } + if cfg.Runner.WorkdirBase != "/tmp/ghr-runners" { + t.Errorf("runner.workdir_base = %q, want %q", cfg.Runner.WorkdirBase, "/tmp/ghr-runners") + } + + // Groups. + if len(cfg.Groups) != 2 { + t.Fatalf("expected 2 groups, got %d", len(cfg.Groups)) + } + prod := cfg.Groups[0] + if prod.Name != "production" { + t.Errorf("groups[0].name = %q, want %q", prod.Name, "production") + } + if prod.MaxRunners != 10 { + t.Errorf("groups[0].max_runners = %d, want 10", prod.MaxRunners) + } + if prod.MinRunners != 2 { + t.Errorf("groups[0].min_runners = %d, want 2", prod.MinRunners) + } + if len(prod.Labels) != 2 || prod.Labels[0] != "self-hosted" || prod.Labels[1] != "linux" { + t.Errorf("groups[0].labels = %v, want [self-hosted linux]", prod.Labels) + } + if prod.RunnerGroup != "prod-pool" { + t.Errorf("groups[0].runner_group = %q, want %q", prod.RunnerGroup, "prod-pool") + } + if prod.Version != "2.319.0" { + t.Errorf("groups[0].version = %q, want %q", prod.Version, "2.319.0") + } + + staging := cfg.Groups[1] + if staging.Name != "staging" { + t.Errorf("groups[1].name = %q, want %q", staging.Name, "staging") + } + if staging.MaxRunners != 5 { + t.Errorf("groups[1].max_runners = %d, want 5", staging.MaxRunners) + } + + // Health. + if !cfg.Health.Enabled { + t.Error("health.enabled = false, want true") + } + if cfg.Health.CheckInterval.Duration != 1*time.Minute { + t.Errorf("health.check_interval = %v, want 1m", cfg.Health.CheckInterval.Duration) + } + if cfg.Health.RunnerTimeout.Duration != 3*time.Hour { + t.Errorf("health.runner_timeout = %v, want 3h", cfg.Health.RunnerTimeout.Duration) + } + if cfg.Health.IdleTimeout.Duration != 30*time.Minute { + t.Errorf("health.idle_timeout = %v, want 30m", cfg.Health.IdleTimeout.Duration) + } + if cfg.Health.DivergenceTimeout.Duration != 10*time.Minute { + t.Errorf("health.divergence_timeout = %v, want 10m", cfg.Health.DivergenceTimeout.Duration) + } + if cfg.Health.MaxConsecutiveFailures != 10 { + t.Errorf("health.max_consecutive_failures = %d, want 10", cfg.Health.MaxConsecutiveFailures) + } + if cfg.Health.FailureCooldown.Duration != 2*time.Minute { + t.Errorf("health.failure_cooldown = %v, want 2m", cfg.Health.FailureCooldown.Duration) + } + if cfg.Health.MinDiskSpace != "5GB" { + t.Errorf("health.min_disk_space = %q, want %q", cfg.Health.MinDiskSpace, "5GB") + } + + // Logging. + if cfg.Logging.Level != "debug" { + t.Errorf("logging.level = %q, want %q", cfg.Logging.Level, "debug") + } + if cfg.Logging.Format != "json" { + t.Errorf("logging.format = %q, want %q", cfg.Logging.Format, "json") + } + if cfg.Logging.Dir != "/tmp/ghr-logs" { + t.Errorf("logging.dir = %q, want %q", cfg.Logging.Dir, "/tmp/ghr-logs") + } + if cfg.Logging.RetentionDays != 14 { + t.Errorf("logging.retention_days = %d, want 14", cfg.Logging.RetentionDays) + } + // With *bool, runner_output: false in YAML is now respected. + if cfg.Logging.RunnerOutput == nil { + t.Error("logging.runner_output = nil, want false") + } else if *cfg.Logging.RunnerOutput { + t.Error("logging.runner_output = true, want false (explicitly set in YAML)") + } + + // Notifications. + if !cfg.Notifications.Discord.Enabled { + t.Error("notifications.discord.enabled = false, want true") + } + if cfg.Notifications.Discord.Username != "my-bot" { + t.Errorf("notifications.discord.username = %q, want %q", cfg.Notifications.Discord.Username, "my-bot") + } + if len(cfg.Notifications.Discord.Events) != 2 { + t.Errorf("notifications.discord.events len = %d, want 2", len(cfg.Notifications.Discord.Events)) + } + if cfg.Notifications.Discord.Mentions.Error != "<@&111>" { + t.Errorf("notifications.discord.mentions.error = %q, want %q", cfg.Notifications.Discord.Mentions.Error, "<@&111>") + } + if cfg.Notifications.Discord.Mentions.Critical != "<@&222>" { + t.Errorf("notifications.discord.mentions.critical = %q, want %q", cfg.Notifications.Discord.Mentions.Critical, "<@&222>") + } + + // Monitoring. + if !cfg.Monitoring.UptimeKuma.Enabled { + t.Error("monitoring.uptime_kuma.enabled = false, want true") + } + if cfg.Monitoring.UptimeKuma.DegradedThreshold != 0.8 { + t.Errorf("monitoring.uptime_kuma.degraded_threshold = %f, want 0.8", cfg.Monitoring.UptimeKuma.DegradedThreshold) + } + if !cfg.Monitoring.UptimeKuma.ReportHealthAsPing { + t.Error("monitoring.uptime_kuma.report_health_as_ping = false, want true") + } + + // Daemon. + if cfg.Daemon.StateDir != "/tmp/ghr-state" { + t.Errorf("daemon.state_dir = %q, want %q", cfg.Daemon.StateDir, "/tmp/ghr-state") + } + if cfg.Daemon.ShutdownTimeout.Duration != 1*time.Minute { + t.Errorf("daemon.shutdown_timeout = %v, want 1m", cfg.Daemon.ShutdownTimeout.Duration) + } +} + +func TestLoad_ValidationErrors(t *testing.T) { + tests := []struct { + name string + yaml string + wantInErr string // substring expected in the error message + }{ + { + name: "no groups", + yaml: `github: {url: "https://github.com"}`, + wantInErr: "at least one group is required", + }, + { + name: "empty group name", + yaml: ` +groups: + - name: "" + max_runners: 1`, + wantInErr: "name is required", + }, + { + name: "duplicate group names", + yaml: ` +groups: + - name: dup + max_runners: 1 + - name: dup + max_runners: 2`, + wantInErr: "duplicate group name", + }, + { + name: "max_runners less than 1", + yaml: ` +groups: + - name: grp + max_runners: 0`, + wantInErr: "max_runners must be >= 1", + }, + { + name: "min_runners negative", + yaml: ` +groups: + - name: grp + max_runners: 5 + min_runners: -1`, + wantInErr: "min_runners must be >= 0", + }, + { + name: "min_runners greater than max_runners", + yaml: ` +groups: + - name: grp + max_runners: 2 + min_runners: 5`, + wantInErr: "min_runners (5) must be <= max_runners (2)", + }, + { + name: "empty label string", + yaml: ` +groups: + - name: grp + max_runners: 1 + labels: + - ""`, + wantInErr: "labels[0] must not be empty", + }, + { + name: "invalid logging level", + yaml: ` +logging: + level: "verbose" +groups: + - name: grp + max_runners: 1`, + wantInErr: "logging.level must be one of", + }, + { + name: "invalid logging format", + yaml: ` +logging: + format: "xml" +groups: + - name: grp + max_runners: 1`, + wantInErr: "logging.format must be one of", + }, + { + name: "check_interval too small", + yaml: ` +health: + check_interval: "2s" +groups: + - name: grp + max_runners: 1`, + wantInErr: "health.check_interval must be >= 5s", + }, + { + name: "runner_timeout too small", + yaml: ` +health: + runner_timeout: "30s" +groups: + - name: grp + max_runners: 1`, + wantInErr: "health.runner_timeout must be >= 1m", + }, + { + name: "shutdown_timeout too small", + yaml: ` +daemon: + shutdown_timeout: "2s" +groups: + - name: grp + max_runners: 1`, + wantInErr: "daemon.shutdown_timeout must be >= 5s", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + _, err := Load(writeConfig(t, tt.yaml)) + if err == nil { + t.Fatal("Load() expected error, got nil") + } + if !strings.Contains(err.Error(), tt.wantInErr) { + t.Errorf("error = %q, want substring %q", err.Error(), tt.wantInErr) + } + }) + } +} + +func TestLoad_DefaultPaths_NonRoot(t *testing.T) { + // This test runs as a non-root user in development. + if os.Getuid() == 0 { + t.Skip("test requires non-root user") + } + + yaml := ` +groups: + - name: grp + max_runners: 1 +` + cfg, err := Load(writeConfig(t, yaml)) + if err != nil { + t.Fatalf("Load() unexpected error: %v", err) + } + + home, err := os.UserHomeDir() + if err != nil { + t.Fatalf("UserHomeDir() error: %v", err) + } + + expectedDataDir := filepath.Join(home, ".local", "share", "ghr") + + if !strings.HasPrefix(cfg.Runner.CacheDir, expectedDataDir) { + t.Errorf("runner.cache_dir = %q, want prefix %q", cfg.Runner.CacheDir, expectedDataDir) + } + if !strings.HasPrefix(cfg.Runner.WorkdirBase, expectedDataDir) { + t.Errorf("runner.workdir_base = %q, want prefix %q", cfg.Runner.WorkdirBase, expectedDataDir) + } + if !strings.HasPrefix(cfg.Logging.Dir, expectedDataDir) { + t.Errorf("logging.dir = %q, want prefix %q", cfg.Logging.Dir, expectedDataDir) + } + + expectedStateDir := filepath.Join(home, ".local", "state", "ghr") + if !strings.HasPrefix(cfg.Daemon.StateDir, expectedStateDir) { + t.Errorf("daemon.state_dir = %q, want prefix %q", cfg.Daemon.StateDir, expectedStateDir) + } +} + +func TestLoad_FileNotFound(t *testing.T) { + _, err := Load("/nonexistent/path/config.yaml") + if err == nil { + t.Fatal("Load() expected error for non-existent file, got nil") + } + if !strings.Contains(err.Error(), "read config file") { + t.Errorf("error = %q, want substring %q", err.Error(), "read config file") + } +} + +func TestLoad_InvalidYAML(t *testing.T) { + invalidYAML := ` +groups: + - name: test + max_runners: [[[invalid +` + _, err := Load(writeConfig(t, invalidYAML)) + if err == nil { + t.Fatal("Load() expected error for invalid YAML, got nil") + } + if !strings.Contains(err.Error(), "parse config file") { + t.Errorf("error = %q, want substring %q", err.Error(), "parse config file") + } +} + +func TestLoad_DurationParsing(t *testing.T) { + yaml := ` +health: + check_interval: "30s" + runner_timeout: "5m" + idle_timeout: "2h" + divergence_timeout: "10m" + failure_cooldown: "90s" + +daemon: + shutdown_timeout: "1m30s" + +groups: + - name: grp + max_runners: 1 +` + cfg, err := Load(writeConfig(t, yaml)) + if err != nil { + t.Fatalf("Load() unexpected error: %v", err) + } + + tests := []struct { + name string + got time.Duration + want time.Duration + }{ + {"check_interval 30s", cfg.Health.CheckInterval.Duration, 30 * time.Second}, + {"runner_timeout 5m", cfg.Health.RunnerTimeout.Duration, 5 * time.Minute}, + {"idle_timeout 2h", cfg.Health.IdleTimeout.Duration, 2 * time.Hour}, + {"divergence_timeout 10m", cfg.Health.DivergenceTimeout.Duration, 10 * time.Minute}, + {"failure_cooldown 90s", cfg.Health.FailureCooldown.Duration, 90 * time.Second}, + {"shutdown_timeout 1m30s", cfg.Daemon.ShutdownTimeout.Duration, time.Minute + 30*time.Second}, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + if tt.got != tt.want { + t.Errorf("duration = %v, want %v", tt.got, tt.want) + } + }) + } +} + +func TestLoad_EnvVarResolution(t *testing.T) { + t.Setenv("GHR_DISCORD_WEBHOOK_URL", "https://discord.com/api/webhooks/test") + t.Setenv("GHR_UPTIME_KUMA_URL", "https://uptime.example.com/api/push/abc123") + + yaml := ` +groups: + - name: grp + max_runners: 1 +` + cfg, err := Load(writeConfig(t, yaml)) + if err != nil { + t.Fatalf("Load() unexpected error: %v", err) + } + + if cfg.Notifications.Discord.WebhookURL != "https://discord.com/api/webhooks/test" { + t.Errorf("discord.webhook_url = %q, want %q", cfg.Notifications.Discord.WebhookURL, "https://discord.com/api/webhooks/test") + } + if cfg.Monitoring.UptimeKuma.BaseURL != "https://uptime.example.com/api/push/abc123" { + t.Errorf("uptime_kuma.base_url = %q, want %q", cfg.Monitoring.UptimeKuma.BaseURL, "https://uptime.example.com/api/push/abc123") + } +} diff --git a/internal/config/types.go b/internal/config/types.go new file mode 100644 index 0000000..5cb4f24 --- /dev/null +++ b/internal/config/types.go @@ -0,0 +1,120 @@ +package config + +import ( + "fmt" + "time" +) + +type Config struct { + GitHub GitHubConfig `yaml:"github"` + Runner RunnerConfig `yaml:"runner"` + Groups []GroupConfig `yaml:"groups"` + Health HealthConfig `yaml:"health"` + Logging LoggingConfig `yaml:"logging"` + Notifications NotificationsConfig `yaml:"notifications"` + Monitoring MonitoringConfig `yaml:"monitoring"` + Daemon DaemonConfig `yaml:"daemon"` +} + +type GitHubConfig struct { + URL string `yaml:"url"` + RunnerGroup string `yaml:"runner_group"` +} + +type RunnerConfig struct { + Version string `yaml:"version"` + CacheDir string `yaml:"cache_dir"` + WorkdirBase string `yaml:"workdir_base"` +} + +type GroupConfig struct { + Name string `yaml:"name"` + MaxRunners int `yaml:"max_runners"` + MinRunners int `yaml:"min_runners"` + Labels []string `yaml:"labels"` + RunnerGroup string `yaml:"runner_group"` + Version string `yaml:"version"` + Health *GroupHealthConfig `yaml:"health,omitempty"` +} + +type GroupHealthConfig struct { + RunnerTimeout Duration `yaml:"runner_timeout"` +} + +type HealthConfig struct { + Enabled bool `yaml:"enabled"` + CheckInterval Duration `yaml:"check_interval"` + RunnerTimeout Duration `yaml:"runner_timeout"` + IdleTimeout Duration `yaml:"idle_timeout"` + DivergenceTimeout Duration `yaml:"divergence_timeout"` + MaxConsecutiveFailures int `yaml:"max_consecutive_failures"` + FailureCooldown Duration `yaml:"failure_cooldown"` + MinDiskSpace string `yaml:"min_disk_space"` +} + +type LoggingConfig struct { + Level string `yaml:"level"` + Format string `yaml:"format"` + Dir string `yaml:"dir"` + RetentionDays int `yaml:"retention_days"` + RunnerOutput *bool `yaml:"runner_output"` +} + +type NotificationsConfig struct { + Discord DiscordConfig `yaml:"discord"` +} + +type DiscordConfig struct { + Enabled bool `yaml:"enabled"` + WebhookURL string `yaml:"-"` + Events []string `yaml:"events"` + Username string `yaml:"username"` + AvatarURL string `yaml:"avatar_url"` + Mentions struct { + Error string `yaml:"error"` + Critical string `yaml:"critical"` + } `yaml:"mentions"` +} + +type MonitoringConfig struct { + UptimeKuma UptimeKumaConfig `yaml:"uptime_kuma"` +} + +type UptimeKumaConfig struct { + Enabled bool `yaml:"enabled"` + BaseURL string `yaml:"-"` + DaemonToken string `yaml:"-"` + GroupTokens map[string]string `yaml:"-"` + DegradedThreshold float64 `yaml:"degraded_threshold"` + ReportHealthAsPing bool `yaml:"report_health_as_ping"` +} + +type DaemonConfig struct { + StateDir string `yaml:"state_dir"` + ShutdownTimeout Duration `yaml:"shutdown_timeout"` +} + +type Duration struct { + time.Duration +} + +func (d *Duration) UnmarshalYAML(unmarshal func(interface{}) error) error { + var s string + if err := unmarshal(&s); err != nil { + return fmt.Errorf("unmarshaling duration: %w", err) + } + if s == "" || s == "0" { + d.Duration = 0 + return nil + } + dur, err := time.ParseDuration(s) + if err != nil { + return fmt.Errorf("invalid duration %q: %w", s, err) + } + d.Duration = dur + return nil +} + +func (d Duration) MarshalYAML() (interface{}, error) { + return d.String(), nil +} diff --git a/internal/config/validate.go b/internal/config/validate.go new file mode 100644 index 0000000..598ca32 --- /dev/null +++ b/internal/config/validate.go @@ -0,0 +1,92 @@ +package config + +import ( + "errors" + "fmt" + "time" +) + +func validate(cfg *Config) error { + var errs []error + + if len(cfg.Groups) == 0 { + errs = append(errs, errors.New("at least one group is required")) + } + + seenNames := make(map[string]bool, len(cfg.Groups)) + + for i, g := range cfg.Groups { + prefix := fmt.Sprintf("groups[%d]", i) + + switch { + case g.Name == "": + errs = append(errs, fmt.Errorf("%s: name is required", prefix)) + case seenNames[g.Name]: + errs = append(errs, fmt.Errorf("%s: duplicate group name %q", prefix, g.Name)) + default: + seenNames[g.Name] = true + } + + if g.MaxRunners < 1 { + errs = append(errs, fmt.Errorf("%s (%s): max_runners must be >= 1", prefix, g.Name)) + } + + if g.MinRunners < 0 { + errs = append(errs, fmt.Errorf("%s (%s): min_runners must be >= 0", prefix, g.Name)) + } + + if g.MinRunners > g.MaxRunners { + errs = append(errs, fmt.Errorf("%s (%s): min_runners (%d) must be <= max_runners (%d)", prefix, g.Name, g.MinRunners, g.MaxRunners)) + } + + for j, label := range g.Labels { + if label == "" { + errs = append(errs, fmt.Errorf("%s (%s): labels[%d] must not be empty", prefix, g.Name, j)) + } + } + } + + if cfg.Health.CheckInterval.Duration > 0 && cfg.Health.CheckInterval.Duration < 5*time.Second { + errs = append(errs, fmt.Errorf("health.check_interval must be >= 5s, got %s", cfg.Health.CheckInterval.Duration)) + } + if cfg.Health.RunnerTimeout.Duration > 0 && cfg.Health.RunnerTimeout.Duration < 1*time.Minute { + errs = append(errs, fmt.Errorf("health.runner_timeout must be >= 1m, got %s", cfg.Health.RunnerTimeout.Duration)) + } + if cfg.Daemon.ShutdownTimeout.Duration > 0 && cfg.Daemon.ShutdownTimeout.Duration < 5*time.Second { + errs = append(errs, fmt.Errorf("daemon.shutdown_timeout must be >= 5s, got %s", cfg.Daemon.ShutdownTimeout.Duration)) + } + + if cfg.Health.MinDiskSpace != "" { + if _, parseErr := ParseByteSize(cfg.Health.MinDiskSpace); parseErr != nil { + errs = append(errs, fmt.Errorf("health.min_disk_space: %w", parseErr)) + } + } + + switch cfg.Logging.Level { + case "debug", "info", "warn", "error": + default: + errs = append(errs, fmt.Errorf("logging.level must be one of: debug, info, warn, error; got %q", cfg.Logging.Level)) + } + + switch cfg.Logging.Format { + case "text", "json": + default: + errs = append(errs, fmt.Errorf("logging.format must be one of: text, json; got %q", cfg.Logging.Format)) + } + + if len(errs) > 0 { + return errors.Join(errs...) + } + return nil +} + +func isHealthZero(h HealthConfig) bool { + return !h.Enabled && + h.CheckInterval.Duration == 0 && + h.RunnerTimeout.Duration == 0 && + h.IdleTimeout.Duration == 0 && + h.DivergenceTimeout.Duration == 0 && + h.MaxConsecutiveFailures == 0 && + h.FailureCooldown.Duration == 0 && + h.MinDiskSpace == "" +} diff --git a/internal/controller/controller.go b/internal/controller/controller.go new file mode 100644 index 0000000..100da0d --- /dev/null +++ b/internal/controller/controller.go @@ -0,0 +1,149 @@ +package controller + +import ( + "context" + "fmt" + "log/slog" + "sync" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/config" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/logging" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/runner" + "github.com/actions/scaleset" + "github.com/actions/scaleset/listener" +) + +type scaleSetClient interface { + GetScaleSet(ctx context.Context, runnerGroupID int, name string) (*scaleset.RunnerScaleSet, error) + CreateScaleSet(ctx context.Context, name string, runnerGroupID int, labels []string) (*scaleset.RunnerScaleSet, error) + DeleteScaleSet(ctx context.Context, id int) error + GenerateJITConfig(ctx context.Context, scaleSetID int, runnerName string) (string, error) + OpenSession(ctx context.Context, scaleSetID int, owner string) (*scaleset.MessageSessionClient, error) + NewListener(session *scaleset.MessageSessionClient, scaleSetID int, maxRunners int) (*listener.Listener, error) +} + +type notifier interface { + Notify(ctx context.Context, event *model.Event) +} + +type ControllerConfig struct { + RunnerVersion string + RunnerGroupID int +} + +type GroupController struct { + client scaleSetClient + binary *runner.BinaryManager + process *runner.ProcessManager + notifier notifier + logMgr *logging.LogManager + groups []config.GroupConfig + globalCfg ControllerConfig + logger *slog.Logger + + mu sync.Mutex + scalers map[string]*MacOSScaler +} + +func New( + client scaleSetClient, + binary *runner.BinaryManager, + process *runner.ProcessManager, + notifier notifier, + logMgr *logging.LogManager, + groups []config.GroupConfig, + globalCfg ControllerConfig, + logger *slog.Logger, +) *GroupController { + return &GroupController{ + client: client, + binary: binary, + process: process, + notifier: notifier, + logMgr: logMgr, + groups: groups, + globalCfg: globalCfg, + logger: logger, + scalers: make(map[string]*MacOSScaler), + } +} + +func (c *GroupController) Run(ctx context.Context) error { + var wg sync.WaitGroup + errCh := make(chan error, len(c.groups)) + + for _, g := range c.groups { + wg.Add(1) + go func(group *config.GroupConfig) { + defer wg.Done() + if err := c.runGroup(ctx, group); err != nil { + errCh <- err + } + }(&g) + } + + <-ctx.Done() + wg.Wait() + close(errCh) + + for err := range errCh { + if err != nil { + return err + } + } + return nil +} + +func (c *GroupController) Shutdown(ctx context.Context) { + c.mu.Lock() + scalers := make(map[string]*MacOSScaler, len(c.scalers)) + for k, v := range c.scalers { + scalers[k] = v + } + c.mu.Unlock() + + for name, s := range scalers { + c.logger.InfoContext(ctx, "shutting down scaler", "group", name) + s.Shutdown(ctx) + } +} + +func (c *GroupController) Snapshots() map[string][]model.RunnerSnapshot { + c.mu.Lock() + scalers := make(map[string]*MacOSScaler, len(c.scalers)) + for k, v := range c.scalers { + scalers[k] = v + } + c.mu.Unlock() + + result := make(map[string][]model.RunnerSnapshot, len(scalers)) + for name, s := range scalers { + result[name] = s.Snapshots() + } + return result +} + +func (c *GroupController) KillRunner(ctx context.Context, group, runnerName string) error { + c.mu.Lock() + s, ok := c.scalers[group] + c.mu.Unlock() + + if !ok { + return fmt.Errorf("kill runner %s: group %q not found", runnerName, group) + } + + return s.killRunner(ctx, runnerName) +} + +func (c *GroupController) registerScaler(name string, s *MacOSScaler) { + c.mu.Lock() + defer c.mu.Unlock() + c.scalers[name] = s +} + +func (c *GroupController) unregisterScaler(name string) { + c.mu.Lock() + defer c.mu.Unlock() + delete(c.scalers, name) +} diff --git a/internal/controller/group.go b/internal/controller/group.go new file mode 100644 index 0000000..4ba23ea --- /dev/null +++ b/internal/controller/group.go @@ -0,0 +1,191 @@ +package controller + +import ( + "context" + "errors" + "fmt" + "log/slog" + "os" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/config" + "github.com/actions/scaleset" +) + +const ( + backoffMin = 2 * time.Second + backoffMax = 30 * time.Second +) + +func (c *GroupController) runGroup(ctx context.Context, group *config.GroupConfig) error { + version := group.Version + if version == "" { + version = c.globalCfg.RunnerVersion + } + + cachedDir, err := c.binary.EnsureBits(ctx, version) + if err != nil { + return fmt.Errorf("ensure runner bits for group %q: %w", group.Name, err) + } + + groupLogger, err := c.logMgr.GroupLogger(group.Name) + if err != nil { + return fmt.Errorf("create group logger for %q: %w", group.Name, err) + } + + labels := deduplicateLabels(group.Name, group.Labels) + + backoff := backoffMin + for { + err := c.runGroupOnce(ctx, group, cachedDir, labels, groupLogger) + if err == nil || errors.Is(err, context.Canceled) { + return nil + } + + groupLogger.ErrorContext(ctx, "group listener failed, retrying", + "group", group.Name, + "error", err, + "backoff", backoff, + ) + + select { + case <-ctx.Done(): + return nil + case <-time.After(backoff): + } + + backoff = nextBackoff(backoff) + } +} + +func (c *GroupController) runGroupOnce( + ctx context.Context, + group *config.GroupConfig, + cachedDir string, + labels []string, + groupLogger *slog.Logger, +) error { + ss, err := c.resolveScaleSet(ctx, group.Name, labels) + if err != nil { + return fmt.Errorf("resolve scale set %q: %w", group.Name, err) + } + + hostname, err := os.Hostname() + if err != nil { + hostname = "unknown" + } + + session, err := c.client.OpenSession(ctx, ss.ID, hostname) + if err != nil { + return fmt.Errorf("open session for %q: %w", group.Name, err) + } + defer func() { + closeCtx := context.WithoutCancel(ctx) + if closeErr := session.Close(closeCtx); closeErr != nil { + groupLogger.DebugContext(ctx, "session close", + "group", group.Name, + "error", closeErr, + ) + } + }() + + scaler := NewMacOSScaler( + c.client, c.process, c.logMgr, c.notifier, + ss.ID, group.Name, group.MaxRunners, group.MinRunners, + cachedDir, groupLogger, + ) + c.registerScaler(group.Name, scaler) + + l, err := c.client.NewListener(session, ss.ID, group.MaxRunners) + if err != nil { + c.unregisterScaler(group.Name) + return fmt.Errorf("create listener for %q: %w", group.Name, err) + } + + groupLogger.InfoContext(ctx, "group listener started", + "group", group.Name, + "scale_set_id", ss.ID, + ) + + listenerErr := l.Run(ctx, scaler) + + c.unregisterScaler(group.Name) + + if errors.Is(listenerErr, context.Canceled) { + scaler.Shutdown(ctx) + cleanupCtx := context.WithoutCancel(ctx) + deleteErr := c.client.DeleteScaleSet(cleanupCtx, ss.ID) + if deleteErr != nil { + groupLogger.WarnContext(ctx, "failed to delete scale set on shutdown", + "group", group.Name, + "scale_set_id", ss.ID, + "error", deleteErr, + ) + } + return context.Canceled + } + + return listenerErr +} + +func (c *GroupController) resolveScaleSet(ctx context.Context, name string, labels []string) (*resolvedScaleSet, error) { + ss, err := c.client.GetScaleSet(ctx, c.globalCfg.RunnerGroupID, name) + if err == nil && ss != nil { + if labelsChanged(ss.Labels, labels) { + c.logger.WarnContext(ctx, "scale set label mismatch detected, delete and recreate to update", + "group", name, + "scale_set_id", ss.ID, + ) + } + return &resolvedScaleSet{ID: ss.ID, Name: ss.Name}, nil + } + + ss, err = c.client.CreateScaleSet(ctx, name, c.globalCfg.RunnerGroupID, labels) + if err != nil { + return nil, fmt.Errorf("create scale set %q: %w", name, err) + } + return &resolvedScaleSet{ID: ss.ID, Name: ss.Name}, nil +} + +func labelsChanged(existing []scaleset.Label, desired []string) bool { + if len(existing) != len(desired) { + return true + } + have := make(map[string]struct{}, len(existing)) + for _, l := range existing { + have[l.Name] = struct{}{} + } + for _, d := range desired { + if _, ok := have[d]; !ok { + return true + } + } + return false +} + +type resolvedScaleSet struct { + ID int + Name string +} + +func deduplicateLabels(groupName string, extra []string) []string { + seen := make(map[string]struct{}, len(extra)+1) + result := make([]string, 0, len(extra)+1) + + for _, label := range append([]string{groupName}, extra...) { + if _, ok := seen[label]; ok { + continue + } + seen[label] = struct{}{} + result = append(result, label) + } + return result +} + +func nextBackoff(current time.Duration) time.Duration { + next := current * 2 + if next > backoffMax { + return backoffMax + } + return next +} diff --git a/internal/controller/kill_runner_test.go b/internal/controller/kill_runner_test.go new file mode 100644 index 0000000..822979f --- /dev/null +++ b/internal/controller/kill_runner_test.go @@ -0,0 +1,45 @@ +package controller + +import ( + "context" + "log/slog" + "os" + "testing" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/runner" +) + +func testLogger() *slog.Logger { + return slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelError + 1})) +} + +func TestKillRunner_GroupNotFound(t *testing.T) { + c := &GroupController{ + scalers: make(map[string]*MacOSScaler), + logger: testLogger(), + } + + err := c.KillRunner(context.Background(), "missing-group", "r1") + if err == nil { + t.Fatal("expected error for missing group") + } +} + +func TestKillRunner_RunnerNotFound(t *testing.T) { + scaler := &MacOSScaler{ + groupName: "group-a", + idle: make(map[string]*runner.Process), + busy: make(map[string]*runner.Process), + logger: testLogger(), + } + + c := &GroupController{ + scalers: map[string]*MacOSScaler{"group-a": scaler}, + logger: testLogger(), + } + + err := c.KillRunner(context.Background(), "group-a", "r-nonexistent") + if err == nil { + t.Fatal("expected error for missing runner") + } +} diff --git a/internal/controller/scaler.go b/internal/controller/scaler.go new file mode 100644 index 0000000..fe3e45a --- /dev/null +++ b/internal/controller/scaler.go @@ -0,0 +1,187 @@ +package controller + +import ( + "context" + "fmt" + "log/slog" + "sync" + "time" + + "io" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/logging" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/runner" + "github.com/actions/scaleset" +) + +type runnerStarter interface { + Prepare(ctx context.Context, instance *model.RunnerInstance, cachedDir string) (string, error) + Start(ctx context.Context, instance *model.RunnerInstance, workdir, jitConfig string, logFile io.Writer) (*runner.Process, error) + Stop(ctx context.Context, proc *runner.Process) error + Cleanup(proc *runner.Process) error +} + +type MacOSScaler struct { + client scaleSetClient + process runnerStarter + logMgr *logging.LogManager + notifier notifier + scaleSetID int + groupName string + maxRunners int + minRunners int + cachedDir string + logger *slog.Logger + + mu sync.Mutex + idle map[string]*runner.Process + busy map[string]*runner.Process +} + +func NewMacOSScaler( + client scaleSetClient, + process runnerStarter, + logMgr *logging.LogManager, + notifier notifier, + scaleSetID int, + groupName string, + maxRunners int, + minRunners int, + cachedDir string, + logger *slog.Logger, +) *MacOSScaler { + return &MacOSScaler{ + client: client, + process: process, + logMgr: logMgr, + notifier: notifier, + scaleSetID: scaleSetID, + groupName: groupName, + maxRunners: maxRunners, + minRunners: minRunners, + cachedDir: cachedDir, + logger: logger, + idle: make(map[string]*runner.Process), + busy: make(map[string]*runner.Process), + } +} + +func (s *MacOSScaler) HandleDesiredRunnerCount(ctx context.Context, count int) (int, error) { + s.mu.Lock() + defer s.mu.Unlock() + + target := s.minRunners + count + if target > s.maxRunners { + target = s.maxRunners + } + + current := len(s.idle) + len(s.busy) + for i := 0; i < target-current; i++ { + if err := s.startRunner(ctx); err != nil { + s.logger.ErrorContext(ctx, "failed to start runner", + "group", s.groupName, + "error", err, + ) + } + } + + return len(s.idle) + len(s.busy), nil +} + +func (s *MacOSScaler) HandleJobStarted(ctx context.Context, jobInfo *scaleset.JobStarted) error { + s.mu.Lock() + defer s.mu.Unlock() + + proc, ok := s.idle[jobInfo.RunnerName] + if !ok { + s.logger.WarnContext(ctx, "job started for unknown runner", + "runner", jobInfo.RunnerName, + "group", s.groupName, + ) + return nil + } + + delete(s.idle, jobInfo.RunnerName) + s.busy[jobInfo.RunnerName] = proc + + s.logger.InfoContext(ctx, "job started", + "runner", jobInfo.RunnerName, + "group", s.groupName, + "job", jobInfo.JobDisplayName, + ) + + s.notifier.Notify(ctx, &model.Event{ + Type: model.EventRunnerStarted, + Level: model.LevelInfo, + Group: s.groupName, + Runner: jobInfo.RunnerName, + Message: fmt.Sprintf("Job started: %s", jobInfo.JobDisplayName), + Timestamp: time.Now(), + }) + + return nil +} + +func (s *MacOSScaler) HandleJobCompleted(ctx context.Context, jobInfo *scaleset.JobCompleted) error { + s.mu.Lock() + proc := s.busy[jobInfo.RunnerName] + if proc == nil { + proc = s.idle[jobInfo.RunnerName] + } + delete(s.busy, jobInfo.RunnerName) + delete(s.idle, jobInfo.RunnerName) + s.mu.Unlock() + + if proc != nil { + stopErr := s.process.Stop(ctx, proc) + if stopErr != nil { + s.logger.WarnContext(ctx, "failed to stop runner", + "runner", jobInfo.RunnerName, + "error", stopErr, + ) + } + cleanupErr := s.process.Cleanup(proc) + if cleanupErr != nil { + s.logger.WarnContext(ctx, "failed to cleanup runner", + "runner", jobInfo.RunnerName, + "error", cleanupErr, + ) + } + } else { + s.logger.WarnContext(ctx, "job completed for unknown runner", + "runner", jobInfo.RunnerName, + "group", s.groupName, + ) + } + + eventType := model.EventRunnerCompleted + if jobInfo.Result != "succeeded" { + eventType = model.EventRunnerFailed + } + + logArgs := []any{ + "runner", jobInfo.RunnerName, + "group", s.groupName, + "result", jobInfo.Result, + } + if !jobInfo.FinishTime.IsZero() && !jobInfo.RunnerAssignTime.IsZero() { + logArgs = append(logArgs, "duration", jobInfo.FinishTime.Sub(jobInfo.RunnerAssignTime).String()) + } + if !jobInfo.QueueTime.IsZero() && !jobInfo.RunnerAssignTime.IsZero() { + logArgs = append(logArgs, "queue_wait", jobInfo.RunnerAssignTime.Sub(jobInfo.QueueTime).String()) + } + + s.logger.InfoContext(ctx, "job completed", logArgs...) + + s.notifier.Notify(ctx, &model.Event{ + Type: eventType, + Level: model.LevelInfo, + Group: s.groupName, + Runner: jobInfo.RunnerName, + Message: fmt.Sprintf("Job completed: %s", jobInfo.Result), + Timestamp: time.Now(), + }) + + return nil +} diff --git a/internal/controller/scaler_ops.go b/internal/controller/scaler_ops.go new file mode 100644 index 0000000..cee674c --- /dev/null +++ b/internal/controller/scaler_ops.go @@ -0,0 +1,144 @@ +package controller + +import ( + "context" + "crypto/rand" + "encoding/hex" + "fmt" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/runner" +) + +func (s *MacOSScaler) startRunner(ctx context.Context) error { + randBytes := make([]byte, 4) + if _, err := rand.Read(randBytes); err != nil { + return fmt.Errorf("generate runner ID: %w", err) + } + id := hex.EncodeToString(randBytes) + name := fmt.Sprintf("%s-%s", s.groupName, id) + + jitConfig, err := s.client.GenerateJITConfig(ctx, s.scaleSetID, name) + if err != nil { + return fmt.Errorf("generate JIT config for %q: %w", name, err) + } + + instance := model.RunnerInstance{ + ID: id, + Name: name, + Group: s.groupName, + } + + workdir, err := s.process.Prepare(ctx, &instance, s.cachedDir) + if err != nil { + return fmt.Errorf("prepare runner %q: %w", name, err) + } + + logFile, err := s.logMgr.RunnerOutputFile(s.groupName, name) + if err != nil { + return fmt.Errorf("open runner log for %q: %w", name, err) + } + + proc, err := s.process.Start(ctx, &instance, workdir, jitConfig, logFile) + if err != nil { + return fmt.Errorf("start runner %q: %w", name, err) + } + + s.idle[name] = proc + + s.logger.InfoContext(ctx, "runner provisioned", + "runner", name, + "group", s.groupName, + "pid", proc.PID, + ) + + return nil +} + +func (s *MacOSScaler) killRunner(ctx context.Context, runnerName string) error { + s.mu.Lock() + proc := s.idle[runnerName] + if proc == nil { + proc = s.busy[runnerName] + } + delete(s.idle, runnerName) + delete(s.busy, runnerName) + s.mu.Unlock() + + if proc == nil { + return fmt.Errorf("runner %q not found in group %q", runnerName, s.groupName) + } + + stopErr := s.process.Stop(ctx, proc) + if stopErr != nil { + s.logger.WarnContext(ctx, "failed to stop runner during kill", + "runner", runnerName, + "error", stopErr, + ) + } + + cleanupErr := s.process.Cleanup(proc) + if cleanupErr != nil { + return fmt.Errorf("cleanup runner %q: %w", runnerName, cleanupErr) + } + + s.logger.InfoContext(ctx, "killed runner", "runner", runnerName, "group", s.groupName) + return nil +} + +func (s *MacOSScaler) Shutdown(ctx context.Context) { + s.mu.Lock() + allProcs := make([]*runner.Process, 0, len(s.idle)+len(s.busy)) + for _, p := range s.idle { + allProcs = append(allProcs, p) + } + for _, p := range s.busy { + allProcs = append(allProcs, p) + } + s.idle = make(map[string]*runner.Process) + s.busy = make(map[string]*runner.Process) + s.mu.Unlock() + + for _, proc := range allProcs { + stopErr := s.process.Stop(ctx, proc) + if stopErr != nil { + s.logger.WarnContext(ctx, "failed to stop runner during shutdown", + "runner", proc.Name, + "error", stopErr, + ) + } + cleanupErr := s.process.Cleanup(proc) + if cleanupErr != nil { + s.logger.WarnContext(ctx, "failed to cleanup runner during shutdown", + "runner", proc.Name, + "error", cleanupErr, + ) + } + } +} + +func (s *MacOSScaler) Snapshots() []model.RunnerSnapshot { + s.mu.Lock() + defer s.mu.Unlock() + + snapshots := make([]model.RunnerSnapshot, 0, len(s.idle)+len(s.busy)) + for name, proc := range s.idle { + snapshots = append(snapshots, model.RunnerSnapshot{ + Name: name, + Group: s.groupName, + State: "idle", + PID: proc.PID, + StartedAt: proc.StartedAt, + }) + } + for name, proc := range s.busy { + snapshots = append(snapshots, model.RunnerSnapshot{ + Name: name, + Group: s.groupName, + State: "busy", + PID: proc.PID, + StartedAt: proc.StartedAt, + }) + } + return snapshots +} diff --git a/internal/controller/scaler_test.go b/internal/controller/scaler_test.go new file mode 100644 index 0000000..a725a39 --- /dev/null +++ b/internal/controller/scaler_test.go @@ -0,0 +1,229 @@ +package controller + +import ( + "context" + "testing" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" + "github.com/RedBoardDev/gh-runners-tool/v2/internal/runner" + "github.com/actions/scaleset" +) + +type mockNotifier struct { + events []model.Event +} + +func (m *mockNotifier) Notify(_ context.Context, event *model.Event) { + m.events = append(m.events, *event) +} + +func newTestScaler(opts ...func(*MacOSScaler)) *MacOSScaler { + s := &MacOSScaler{ + groupName: "test-group", + maxRunners: 5, + minRunners: 0, + logger: testLogger(), + notifier: &mockNotifier{}, + idle: make(map[string]*runner.Process), + busy: make(map[string]*runner.Process), + } + for _, opt := range opts { + opt(s) + } + return s +} + +func TestSnapshots(t *testing.T) { + now := time.Date(2026, 1, 15, 12, 0, 0, 0, time.UTC) + + tests := []struct { + name string + idle map[string]*runner.Process + busy map[string]*runner.Process + wantLen int + wantIdle int + wantBusy int + }{ + { + name: "empty maps", + idle: map[string]*runner.Process{}, + busy: map[string]*runner.Process{}, + wantLen: 0, + wantIdle: 0, + wantBusy: 0, + }, + { + name: "one idle one busy", + idle: map[string]*runner.Process{ + "r-idle": {Name: "r-idle", Group: "test-group", PID: 100, StartedAt: now}, + }, + busy: map[string]*runner.Process{ + "r-busy": {Name: "r-busy", Group: "test-group", PID: 200, StartedAt: now}, + }, + wantLen: 2, + wantIdle: 1, + wantBusy: 1, + }, + { + name: "all idle", + idle: map[string]*runner.Process{ + "r-1": {Name: "r-1", Group: "test-group", PID: 100, StartedAt: now}, + "r-2": {Name: "r-2", Group: "test-group", PID: 101, StartedAt: now}, + }, + busy: map[string]*runner.Process{}, + wantLen: 2, + wantIdle: 2, + wantBusy: 0, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + s := newTestScaler(func(scaler *MacOSScaler) { + scaler.idle = tt.idle + scaler.busy = tt.busy + }) + + snapshots := s.Snapshots() + if len(snapshots) != tt.wantLen { + t.Fatalf("expected %d snapshots, got %d", tt.wantLen, len(snapshots)) + } + + idleCount := 0 + busyCount := 0 + for _, snap := range snapshots { + switch snap.State { + case "idle": + idleCount++ + case "busy": + busyCount++ + default: + t.Fatalf("unexpected state %q", snap.State) + } + if snap.Group != "test-group" { + t.Fatalf("expected group test-group, got %q", snap.Group) + } + } + if idleCount != tt.wantIdle { + t.Fatalf("expected %d idle, got %d", tt.wantIdle, idleCount) + } + if busyCount != tt.wantBusy { + t.Fatalf("expected %d busy, got %d", tt.wantBusy, busyCount) + } + }) + } +} + +func TestHandleDesiredRunnerCount_Noop(t *testing.T) { + now := time.Date(2026, 1, 15, 12, 0, 0, 0, time.UTC) + + s := newTestScaler(func(scaler *MacOSScaler) { + scaler.minRunners = 0 + scaler.maxRunners = 5 + scaler.idle = map[string]*runner.Process{ + "r-1": {Name: "r-1", Group: "test-group", PID: 100, StartedAt: now}, + "r-2": {Name: "r-2", Group: "test-group", PID: 101, StartedAt: now}, + } + }) + + got, err := s.HandleDesiredRunnerCount(context.Background(), 2) + if err != nil { + t.Fatalf("HandleDesiredRunnerCount: %v", err) + } + + if got != 2 { + t.Fatalf("expected current count 2, got %d", got) + } +} + +func TestHandleDesiredRunnerCount_CappedByMax(t *testing.T) { + now := time.Date(2026, 1, 15, 12, 0, 0, 0, time.UTC) + + s := newTestScaler(func(scaler *MacOSScaler) { + scaler.minRunners = 0 + scaler.maxRunners = 3 + scaler.idle = map[string]*runner.Process{ + "r-1": {Name: "r-1", Group: "test-group", PID: 100, StartedAt: now}, + "r-2": {Name: "r-2", Group: "test-group", PID: 101, StartedAt: now}, + "r-3": {Name: "r-3", Group: "test-group", PID: 102, StartedAt: now}, + } + }) + + got, err := s.HandleDesiredRunnerCount(context.Background(), 10) + if err != nil { + t.Fatalf("HandleDesiredRunnerCount: %v", err) + } + + if got != 3 { + t.Fatalf("expected count 3 (capped by max), got %d", got) + } +} + +func TestHandleJobStarted_NotFound(t *testing.T) { + s := newTestScaler() + + err := s.HandleJobStarted(context.Background(), &scaleset.JobStarted{ + RunnerName: "unknown-runner", + }) + if err != nil { + t.Fatalf("expected nil error for unknown runner, got %v", err) + } +} + +func TestHandleJobStarted_MovesToBusy(t *testing.T) { + now := time.Date(2026, 1, 15, 12, 0, 0, 0, time.UTC) + proc := &runner.Process{Name: "r-1", Group: "test-group", PID: 100, StartedAt: now} + + s := newTestScaler(func(scaler *MacOSScaler) { + scaler.idle = map[string]*runner.Process{"r-1": proc} + }) + + err := s.HandleJobStarted(context.Background(), &scaleset.JobStarted{ + RunnerName: "r-1", + }) + if err != nil { + t.Fatalf("HandleJobStarted: %v", err) + } + + if _, ok := s.idle["r-1"]; ok { + t.Fatal("expected runner to be removed from idle") + } + if _, ok := s.busy["r-1"]; !ok { + t.Fatal("expected runner to be in busy") + } +} + +func TestHandleJobCompleted_NotFound(t *testing.T) { + s := newTestScaler() + + err := s.HandleJobCompleted(context.Background(), &scaleset.JobCompleted{ + RunnerName: "unknown-runner", + Result: "succeeded", + }) + if err != nil { + t.Fatalf("expected nil error for unknown runner, got %v", err) + } +} + +func TestHandleJobCompleted_NotifiesEvent(t *testing.T) { + n := &mockNotifier{} + s := newTestScaler(func(scaler *MacOSScaler) { + scaler.notifier = n + }) + + err := s.HandleJobCompleted(context.Background(), &scaleset.JobCompleted{ + RunnerName: "unknown-runner", + Result: "failed", + }) + if err != nil { + t.Fatalf("HandleJobCompleted: %v", err) + } + + if len(n.events) != 1 { + t.Fatalf("expected 1 notification event, got %d", len(n.events)) + } + if n.events[0].Type != model.EventRunnerFailed { + t.Fatalf("expected event type %q, got %q", model.EventRunnerFailed, n.events[0].Type) + } +} diff --git a/internal/github/client.go b/internal/github/client.go new file mode 100644 index 0000000..c27346d --- /dev/null +++ b/internal/github/client.go @@ -0,0 +1,138 @@ +package github + +import ( + "context" + "fmt" + "os" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/auth" + "github.com/actions/scaleset" + "github.com/actions/scaleset/listener" +) + +var systemInfo = scaleset.SystemInfo{ + System: "ghr", + Version: "2.0", +} + +type Client struct { + inner *scaleset.Client +} + +func NewClient(creds *auth.Credentials, githubURL string) (*Client, error) { + switch creds.Method { + case "pat": + return newPATClient(creds.PAT, githubURL) + case "github_app": + return newAppClient(creds.GitHubApp, githubURL) + default: + return nil, fmt.Errorf("new github client: unknown auth method %q", creds.Method) + } +} + +func newPATClient(token, githubURL string) (*Client, error) { + inner, err := scaleset.NewClientWithPersonalAccessToken(scaleset.NewClientWithPersonalAccessTokenConfig{ + GitHubConfigURL: githubURL, + PersonalAccessToken: token, + SystemInfo: systemInfo, + }) + if err != nil { + return nil, fmt.Errorf("create PAT client: %w", err) + } + return &Client{inner: inner}, nil +} + +func newAppClient(app *auth.GitHubAppCreds, githubURL string) (*Client, error) { + if app == nil { + return nil, fmt.Errorf("create app client: github_app credentials are nil") + } + + pemBytes, err := os.ReadFile(app.PrivateKeyPath) + if err != nil { + return nil, fmt.Errorf("read private key %s: %w", app.PrivateKeyPath, err) + } + + inner, err := scaleset.NewClientWithGitHubApp(scaleset.ClientWithGitHubAppConfig{ + GitHubConfigURL: githubURL, + GitHubAppAuth: scaleset.GitHubAppAuth{ + ClientID: app.ClientID, + InstallationID: app.InstallationID, + PrivateKey: string(pemBytes), + }, + SystemInfo: systemInfo, + }) + if err != nil { + return nil, fmt.Errorf("create app client: %w", err) + } + return &Client{inner: inner}, nil +} + +func (c *Client) CreateScaleSet(ctx context.Context, name string, runnerGroupID int, labels []string) (*scaleset.RunnerScaleSet, error) { + sdkLabels := make([]scaleset.Label, len(labels)) + for i, l := range labels { + sdkLabels[i] = scaleset.Label{Type: "System", Name: l} + } + + ss, err := c.inner.CreateRunnerScaleSet(ctx, &scaleset.RunnerScaleSet{ + Name: name, + RunnerGroupID: runnerGroupID, + Labels: sdkLabels, + RunnerSetting: scaleset.RunnerSetting{DisableUpdate: true}, + }) + if err != nil { + return nil, fmt.Errorf("create scale set %q: %w", name, err) + } + return ss, nil +} + +func (c *Client) GetScaleSet(ctx context.Context, runnerGroupID int, name string) (*scaleset.RunnerScaleSet, error) { + ss, err := c.inner.GetRunnerScaleSet(ctx, runnerGroupID, name) + if err != nil { + return nil, fmt.Errorf("get scale set %q: %w", name, err) + } + return ss, nil +} + +func (c *Client) GetScaleSetByID(ctx context.Context, id int) (*scaleset.RunnerScaleSet, error) { + ss, err := c.inner.GetRunnerScaleSetByID(ctx, id) + if err != nil { + return nil, fmt.Errorf("get scale set by id %d: %w", id, err) + } + return ss, nil +} + +func (c *Client) DeleteScaleSet(ctx context.Context, id int) error { + if err := c.inner.DeleteRunnerScaleSet(ctx, id); err != nil { + return fmt.Errorf("delete scale set %d: %w", id, err) + } + return nil +} + +func (c *Client) GenerateJITConfig(ctx context.Context, scaleSetID int, runnerName string) (string, error) { + jit, err := c.inner.GenerateJitRunnerConfig(ctx, &scaleset.RunnerScaleSetJitRunnerSetting{ + Name: runnerName, + }, scaleSetID) + if err != nil { + return "", fmt.Errorf("generate JIT config for %q: %w", runnerName, err) + } + return jit.EncodedJITConfig, nil +} + +func (c *Client) OpenSession(ctx context.Context, scaleSetID int, owner string) (*scaleset.MessageSessionClient, error) { + session, err := c.inner.MessageSessionClient(ctx, scaleSetID, owner) + if err != nil { + return nil, fmt.Errorf("open session for scale set %d: %w", scaleSetID, err) + } + return session, nil +} + +func (c *Client) NewListener(session *scaleset.MessageSessionClient, scaleSetID, maxRunners int) (*listener.Listener, error) { + l, err := listener.New(session, listener.Config{ + ScaleSetID: scaleSetID, + MaxRunners: maxRunners, + }) + if err != nil { + return nil, fmt.Errorf("create listener for scale set %d: %w", scaleSetID, err) + } + return l, nil +} diff --git a/internal/github/client_test.go b/internal/github/client_test.go new file mode 100644 index 0000000..6f39636 --- /dev/null +++ b/internal/github/client_test.go @@ -0,0 +1,143 @@ +package github + +import ( + "os" + "path/filepath" + "testing" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/auth" +) + +func TestNewClient_PAT(t *testing.T) { + creds := &auth.Credentials{ + Method: "pat", + PAT: "ghp_test1234567890", + } + + client, err := NewClient(creds, "https://github.com/test-org") + if err != nil { + t.Fatalf("expected no error, got %v", err) + } + if client == nil { + t.Fatal("expected non-nil client") + } + if client.inner == nil { + t.Fatal("expected non-nil inner client") + } +} + +func TestNewClient_GitHubApp(t *testing.T) { + keyContent := `-----BEGIN RSA PRIVATE KEY----- +MIIEpAIBAAKCAQEA0Z3VS5JJcds3xfn/ygWyF8PbnGy0AHB7MhgHcTz6sE2I2yPB +aNlRtQ8aXEr55FZgMvemuafJoqfiN2OkXvMPMID2KJHnfxJPMSdMoBRk7GkLVOH +OBnG9gVmZ5A6iNFwHGO9BKnL7P7iCfxWJCFxdF0qNGBJjqMJjHb6cDAVJfb0Q5K +xHE6UKJhne1RDmaoW/4Vh+M3OAv8MXPqp0qhBkJYYlTpjRkLjF2MOqMmGKO7UmB +dVjr3HvaGRFnRlq5mzv2JlFjQFPXYiRgDrU/K3Y2MnsQfJGP7TW0j5FFsiZp7vTV +P0WBLZEQy2mvVz9y9X78JiK64ijr4EDRqKi1NQIDAQABAoIBAC5RgZ+hBx7xHNaM +pPgwGMnCd2vHsHwAaXkeAzSdRnLBDqPWJGJmaCF3B/cQHan5IMnVEL2T0KDiWqjh +ax7GiMAPPkCgarSHMC4sPXTR0NHHZxC5bED5z98rIqabSChzmZjDe6FMqpljhdJR +0K/gUVLqCRJjHNdGIFsmi2amEMGdlxEJmH3FvSmhaxAhIfxmSGNNEPzMCQl5mmFM +OqoB3BtMdn/qxg9grs08PHshqJdH6QilaRy6KfDEuHpgMZav2RI7sjChTQaI+MUN +FzkaOq1M2C17xjIT3vlQ3WJkQXZrYJP5FGGxgI2RfVROaGE8+BiFKzIGudPJ8NpB +OCSrUmECgYEA7wO6fDL+S6YJdAJ8YTBOfNny/VkECzk2sxhvP5pKGp0tzGAYq3BM +uRjdrR7Cj+cW1gi6DRezMX+r5jMXnBQkmRqyZ6u3r9XSvuEyiGmd+qNWm7iFt6FX +3VdANYsl6xMOPNmAzKm0ZFb0J9J3BHL+F+1adij6YqN+OlTIRLpbpzUCgYEA4G/W +9T1XT/dPIHr7PGBFuJ3vkLNU1ITk2LCPTCkghq9vFf+/F8RQ/eDa9fugVDJnHlMm +qiFUWHfBmoANRrAQKbw8kN6E8Oij1F5Y09mW0fqzlMF1bRUxOJ0SXdyp8RIIYO9n +g5UlD1UqRCsAWxJN7vE1VX/bZb3OIEQ0C+YfKkECgYBHPCA22lpjsJGIbgIEkk9Q +Cm1WlCXBH7SgXBMoJwJfKSIqn4TRJ9RLfMqFLVTJDNIGdIkLUJPR78VR8qJwqifz +LnGPEjMTIZEfHvJlUDI6dEe6n5ENZB9evRQ0MflIsNkGHQ0qzLGLPYGWmJ0TBy8J +aIFZ1GfwBlSPI/4ffNV8bQKBgQCFDMcMJoB+urH7sMFEgH5P3fHEQHjfJNrDaBPM +YCUWa8DTQD9/7HzIepcWKEVr4jSBK2D0B0sFqgHhD0UIc/WW7IQKyKlmEjz7oSR +7YR2FUycBRTxZ6EmGlK5E67z1Q2FHeFJgIq2ip1Rb6VLFy8yAaDPxPQ8YIBNlQdp +S+hkAQKBgQDR4LJibkXz+U/5MhQT+IhEVeEBH5fTkOD6oIOJHd17DMQ5mi+zBPf0 +hB+sQ+zl3lOKJGjTTqdapnJeT8v5JD1TvVCDBii6niUoR6TFB3qxaOjv/VEL1Cf3 +G5FadRKM/l54xfA+mEHxkO/nGxH7fBatEJRE3l6K9MmIq2gOMCF0MQ== +-----END RSA PRIVATE KEY-----` + + tmpDir := t.TempDir() + keyPath := filepath.Join(tmpDir, "test-key.pem") + if err := os.WriteFile(keyPath, []byte(keyContent), 0600); err != nil { + t.Fatalf("write test key: %v", err) + } + + creds := &auth.Credentials{ + Method: "github_app", + GitHubApp: &auth.GitHubAppCreds{ + ClientID: "Iv1.test123", + InstallationID: 12345, + PrivateKeyPath: keyPath, + }, + } + + client, err := NewClient(creds, "https://github.com/test-org") + if err != nil { + t.Fatalf("expected no error, got %v", err) + } + if client == nil { + t.Fatal("expected non-nil client") + } +} + +func TestNewClient_UnknownMethod(t *testing.T) { + creds := &auth.Credentials{ + Method: "oauth", + } + + client, err := NewClient(creds, "https://github.com/test-org") + if err == nil { + t.Fatal("expected error for unknown method") + } + if client != nil { + t.Fatal("expected nil client on error") + } +} + +func TestNewClient_AppNilCreds(t *testing.T) { + creds := &auth.Credentials{ + Method: "github_app", + GitHubApp: nil, + } + + client, err := NewClient(creds, "https://github.com/test-org") + if err == nil { + t.Fatal("expected error for nil github_app creds") + } + if client != nil { + t.Fatal("expected nil client on error") + } +} + +func TestNewClient_AppMissingKeyFile(t *testing.T) { + creds := &auth.Credentials{ + Method: "github_app", + GitHubApp: &auth.GitHubAppCreds{ + ClientID: "Iv1.test123", + InstallationID: 12345, + PrivateKeyPath: "/nonexistent/path/key.pem", + }, + } + + client, err := NewClient(creds, "https://github.com/test-org") + if err == nil { + t.Fatal("expected error for missing key file") + } + if client != nil { + t.Fatal("expected nil client on error") + } +} + +func TestNewClient_InvalidGitHubURL(t *testing.T) { + creds := &auth.Credentials{ + Method: "pat", + PAT: "ghp_test1234567890", + } + + client, err := NewClient(creds, "://invalid-url") + if err == nil { + t.Fatal("expected error for invalid URL") + } + if client != nil { + t.Fatal("expected nil client on error") + } +} diff --git a/internal/health/checks.go b/internal/health/checks.go new file mode 100644 index 0000000..3a9a94b --- /dev/null +++ b/internal/health/checks.go @@ -0,0 +1,213 @@ +package health + +import ( + "context" + "fmt" + "syscall" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +func (m *Monitor) runChecks(ctx context.Context) { + start := time.Now() + + m.mu.Lock() + defer m.mu.Unlock() + + m.issues = m.issues[:0] + + snapshots := m.runners.Snapshots() + totalActual := 0 + totalDesired := 0 + + for group, snaps := range snapshots { + m.checkRunnerLiveness(ctx, group, snaps) + m.checkRunnerTimeouts(ctx, group, snaps) + m.checkIdleTimeouts(ctx, group, snaps) + gs := m.getOrCreateGroup(group) + m.checkGroupDivergence(group, len(snaps), gs) + m.checkConsecutiveFailures(group, gs) + totalActual += len(snaps) + totalDesired += gs.lastDesiredCount + } + + m.checkDiskSpace() + m.lastCheck = time.Now() + checkDuration := time.Since(start) + + for _, r := range m.reporters { + r.ReportDaemonHealth(ctx, len(snapshots), totalActual, totalDesired, checkDuration) + } + for group, snaps := range snapshots { + gs := m.getOrCreateGroup(group) + for _, r := range m.reporters { + r.ReportGroupHealth(ctx, group, len(snaps), gs.lastDesiredCount) + } + } + + for _, issue := range m.issues { + m.notifier.Notify(ctx, &model.Event{ + Type: issue.Type, + Level: issue.Level, + Group: issue.Group, + Runner: issue.Runner, + Message: issue.Message, + Timestamp: issue.DetectedAt, + }) + } +} + +func (m *Monitor) checkRunnerLiveness(ctx context.Context, group string, snapshots []model.RunnerSnapshot) { + for _, snap := range snapshots { + if snap.PID <= 0 { + continue + } + if err := syscall.Kill(snap.PID, 0); err != nil { + m.issues = append(m.issues, model.HealthIssue{ + Level: model.LevelError, + Type: model.EventHealthZombieRunner, + Group: group, + Runner: snap.Name, + Message: fmt.Sprintf("runner %s (pid %d) is no longer alive", snap.Name, snap.PID), + DetectedAt: time.Now(), + }) + if m.killer != nil { + if killErr := m.killer.KillRunner(ctx, group, snap.Name); killErr != nil { + m.logger.ErrorContext(ctx, "failed to kill zombie runner", "group", group, "runner", snap.Name, "error", killErr) + } + } + } + } +} + +func (m *Monitor) checkRunnerTimeouts(ctx context.Context, group string, snapshots []model.RunnerSnapshot) { + if m.cfg.RunnerTimeout <= 0 { + return + } + + now := time.Now() + for _, snap := range snapshots { + if snap.State != "busy" { + continue + } + if snap.StartedAt.IsZero() { + continue + } + if now.Sub(snap.StartedAt) <= m.cfg.RunnerTimeout { + continue + } + m.issues = append(m.issues, model.HealthIssue{ + Level: model.LevelWarning, + Type: model.EventHealthRunnerTimeout, + Group: group, + Runner: snap.Name, + Message: fmt.Sprintf("runner %s has been busy for %s (timeout: %s)", snap.Name, now.Sub(snap.StartedAt).Round(time.Second), m.cfg.RunnerTimeout), + DetectedAt: now, + }) + if m.killer != nil { + if killErr := m.killer.KillRunner(ctx, group, snap.Name); killErr != nil { + m.logger.ErrorContext(ctx, "failed to kill timed-out runner", "group", group, "runner", snap.Name, "error", killErr) + } + } + } +} + +func (m *Monitor) checkIdleTimeouts(ctx context.Context, group string, snapshots []model.RunnerSnapshot) { + if m.cfg.IdleTimeout <= 0 { + return + } + + minRunners := 0 + if m.cfg.GroupMinRunners != nil { + minRunners = m.cfg.GroupMinRunners[group] + } + + now := time.Now() + var timedOut []model.RunnerSnapshot + for _, snap := range snapshots { + if snap.State != "idle" || snap.StartedAt.IsZero() { + continue + } + if now.Sub(snap.StartedAt) > m.cfg.IdleTimeout { + timedOut = append(timedOut, snap) + } + } + + idleCount := 0 + for _, snap := range snapshots { + if snap.State == "idle" { + idleCount++ + } + } + + killable := idleCount - minRunners + for _, snap := range timedOut { + if killable <= 0 { + break + } + m.issues = append(m.issues, model.HealthIssue{ + Level: model.LevelWarning, + Type: model.EventHealthIdleTimeout, + Group: group, + Runner: snap.Name, + Message: fmt.Sprintf("runner %s has been idle for %s (timeout: %s)", snap.Name, now.Sub(snap.StartedAt).Round(time.Second), m.cfg.IdleTimeout), + DetectedAt: now, + }) + if m.killer != nil { + if killErr := m.killer.KillRunner(ctx, group, snap.Name); killErr != nil { + m.logger.ErrorContext(ctx, "failed to kill idle runner", "group", group, "runner", snap.Name, "error", killErr) + } + } + killable-- + } +} + +func (m *Monitor) checkGroupDivergence(group string, actualCount int, gs *groupState) { + if m.cfg.DivergenceTimeout <= 0 { + return + } + if gs.lastDesiredCount == 0 { + return + } + + if actualCount == gs.lastDesiredCount { + gs.degradedSince = nil + return + } + + now := time.Now() + if gs.degradedSince == nil { + gs.degradedSince = &now + return + } + + if now.Sub(*gs.degradedSince) < m.cfg.DivergenceTimeout { + return + } + + m.issues = append(m.issues, model.HealthIssue{ + Level: model.LevelWarning, + Type: model.EventHealthGroupDegraded, + Group: group, + Message: fmt.Sprintf("group %s has %d runners but %d desired for %s", group, actualCount, gs.lastDesiredCount, now.Sub(*gs.degradedSince).Round(time.Second)), + DetectedAt: now, + }) +} + +func (m *Monitor) checkConsecutiveFailures(group string, gs *groupState) { + if m.cfg.MaxConsecutiveFailures <= 0 { + return + } + if gs.consecutiveFailures <= m.cfg.MaxConsecutiveFailures { + return + } + + m.issues = append(m.issues, model.HealthIssue{ + Level: model.LevelCritical, + Type: model.EventHealthGroupFailing, + Group: group, + Message: fmt.Sprintf("group %s has %d consecutive start failures (threshold: %d)", group, gs.consecutiveFailures, m.cfg.MaxConsecutiveFailures), + DetectedAt: time.Now(), + }) +} diff --git a/internal/health/checks_disk.go b/internal/health/checks_disk.go new file mode 100644 index 0000000..68665d2 --- /dev/null +++ b/internal/health/checks_disk.go @@ -0,0 +1,33 @@ +package health + +import ( + "fmt" + "syscall" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +func (m *Monitor) checkDiskSpace() { + if m.cfg.MinDiskSpace <= 0 { + return + } + + var stat syscall.Statfs_t + if err := syscall.Statfs("/", &stat); err != nil { + m.logger.Warn("failed to check disk space", "error", err) + return + } + + available := int64(stat.Bavail) * int64(stat.Bsize) //nolint:unconvert // Bsize type varies by OS + if available < m.cfg.MinDiskSpace { + m.issues = append(m.issues, model.HealthIssue{ + Level: model.LevelWarning, + Type: model.EventHealthDiskLow, + Group: "", + Runner: "", + Message: fmt.Sprintf("available disk space %d bytes is below minimum %d bytes", available, m.cfg.MinDiskSpace), + DetectedAt: time.Now(), + }) + } +} diff --git a/internal/health/checks_test.go b/internal/health/checks_test.go new file mode 100644 index 0000000..2f99716 --- /dev/null +++ b/internal/health/checks_test.go @@ -0,0 +1,317 @@ +package health + +import ( + "context" + "fmt" + "log/slog" + "os" + "testing" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +type noopNotifier struct { + events []model.Event +} + +func (n *noopNotifier) Notify(_ context.Context, event *model.Event) { + n.events = append(n.events, *event) +} + +type fakeRunnerState struct { + snapshots map[string][]model.RunnerSnapshot +} + +func (f *fakeRunnerState) Snapshots() map[string][]model.RunnerSnapshot { + return f.snapshots +} + +type fakeKiller struct { + killed []string + err error +} + +func (f *fakeKiller) KillRunner(_ context.Context, group string, runner string) error { + f.killed = append(f.killed, fmt.Sprintf("%s/%s", group, runner)) + return f.err +} + +func noopLogger() *slog.Logger { + return slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelError + 1})) +} + +func TestCheckIdleTimeouts(t *testing.T) { + tests := []struct { + name string + idleTimeout time.Duration + snapshots []model.RunnerSnapshot + wantIssues int + }{ + { + name: "disabled when timeout is zero", + idleTimeout: 0, + snapshots: []model.RunnerSnapshot{ + {Name: "r1", State: "idle", StartedAt: time.Now().Add(-1 * time.Hour)}, + }, + wantIssues: 0, + }, + { + name: "no issue when under timeout", + idleTimeout: 30 * time.Minute, + snapshots: []model.RunnerSnapshot{ + {Name: "r1", State: "idle", StartedAt: time.Now().Add(-10 * time.Minute)}, + }, + wantIssues: 0, + }, + { + name: "issue when over timeout", + idleTimeout: 30 * time.Minute, + snapshots: []model.RunnerSnapshot{ + {Name: "r1", State: "idle", StartedAt: time.Now().Add(-1 * time.Hour)}, + }, + wantIssues: 1, + }, + { + name: "busy runners are skipped", + idleTimeout: 30 * time.Minute, + snapshots: []model.RunnerSnapshot{ + {Name: "r1", State: "busy", StartedAt: time.Now().Add(-1 * time.Hour)}, + }, + wantIssues: 0, + }, + { + name: "zero StartedAt is skipped", + idleTimeout: 30 * time.Minute, + snapshots: []model.RunnerSnapshot{ + {Name: "r1", State: "idle"}, + }, + wantIssues: 0, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + m := newTestMonitor(nil, nil, nil) + m.cfg.IdleTimeout = tt.idleTimeout + m.issues = m.issues[:0] + + m.checkIdleTimeouts(context.Background(), "test-group", tt.snapshots) + + if len(m.issues) != tt.wantIssues { + t.Errorf("expected %d issues, got %d", tt.wantIssues, len(m.issues)) + } + if tt.wantIssues > 0 && m.issues[0].Type != model.EventHealthIdleTimeout { + t.Errorf("expected type %s, got %s", model.EventHealthIdleTimeout, m.issues[0].Type) + } + }) + } +} + +func TestCheckGroupDivergence(t *testing.T) { + tests := []struct { + name string + divergenceTimeout time.Duration + actualCount int + desiredCount int + degradedSince *time.Time + wantIssues int + wantDegraded bool + }{ + { + name: "disabled when timeout is zero", + divergenceTimeout: 0, + actualCount: 1, + desiredCount: 3, + wantIssues: 0, + }, + { + name: "no issue when counts match", + divergenceTimeout: 5 * time.Minute, + actualCount: 3, + desiredCount: 3, + wantIssues: 0, + }, + { + name: "no issue when desired is zero", + divergenceTimeout: 5 * time.Minute, + actualCount: 1, + desiredCount: 0, + wantIssues: 0, + }, + { + name: "first divergence sets degradedSince", + divergenceTimeout: 5 * time.Minute, + actualCount: 1, + desiredCount: 3, + wantIssues: 0, + wantDegraded: true, + }, + { + name: "issue after timeout exceeded", + divergenceTimeout: 5 * time.Minute, + actualCount: 1, + desiredCount: 3, + degradedSince: timePtr(time.Now().Add(-10 * time.Minute)), + wantIssues: 1, + }, + { + name: "no issue before timeout", + divergenceTimeout: 5 * time.Minute, + actualCount: 1, + desiredCount: 3, + degradedSince: timePtr(time.Now().Add(-2 * time.Minute)), + wantIssues: 0, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + m := newTestMonitor(nil, nil, nil) + m.cfg.DivergenceTimeout = tt.divergenceTimeout + m.issues = m.issues[:0] + + gs := &groupState{ + lastDesiredCount: tt.desiredCount, + degradedSince: tt.degradedSince, + } + + m.checkGroupDivergence("test-group", tt.actualCount, gs) + + if len(m.issues) != tt.wantIssues { + t.Errorf("expected %d issues, got %d", tt.wantIssues, len(m.issues)) + } + if tt.wantDegraded && gs.degradedSince == nil { + t.Error("expected degradedSince to be set") + } + if tt.wantIssues > 0 && m.issues[0].Type != model.EventHealthGroupDegraded { + t.Errorf("expected type %s, got %s", model.EventHealthGroupDegraded, m.issues[0].Type) + } + }) + } +} + +func TestCheckConsecutiveFailures(t *testing.T) { + tests := []struct { + name string + maxFailures int + failures int + wantIssues int + }{ + { + name: "disabled when max is zero", + maxFailures: 0, + failures: 10, + wantIssues: 0, + }, + { + name: "no issue at threshold", + maxFailures: 5, + failures: 5, + wantIssues: 0, + }, + { + name: "issue above threshold", + maxFailures: 5, + failures: 6, + wantIssues: 1, + }, + { + name: "no issue below threshold", + maxFailures: 5, + failures: 3, + wantIssues: 0, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + m := newTestMonitor(nil, nil, nil) + m.cfg.MaxConsecutiveFailures = tt.maxFailures + m.issues = m.issues[:0] + + gs := &groupState{consecutiveFailures: tt.failures} + m.checkConsecutiveFailures("test-group", gs) + + if len(m.issues) != tt.wantIssues { + t.Errorf("expected %d issues, got %d", tt.wantIssues, len(m.issues)) + } + if tt.wantIssues > 0 { + if m.issues[0].Type != model.EventHealthGroupFailing { + t.Errorf("expected type %s, got %s", model.EventHealthGroupFailing, m.issues[0].Type) + } + if m.issues[0].Level != model.LevelCritical { + t.Errorf("expected level %s, got %s", model.LevelCritical, m.issues[0].Level) + } + } + }) + } +} + +func TestCheckRunnerTimeouts_KillsRunner(t *testing.T) { + killer := &fakeKiller{} + m := NewMonitor( + MonitorConfig{ + Enabled: true, + RunnerTimeout: 1 * time.Hour, + }, + &noopNotifier{}, + nil, + nil, + killer, + noopLogger(), + ) + + snaps := []model.RunnerSnapshot{ + {Name: "r1", State: "busy", PID: 1, StartedAt: time.Now().Add(-2 * time.Hour)}, + } + + m.checkRunnerTimeouts(context.Background(), "group-a", snaps) + + if len(killer.killed) != 1 { + t.Fatalf("expected 1 kill call, got %d", len(killer.killed)) + } + if killer.killed[0] != "group-a/r1" { + t.Errorf("expected kill group-a/r1, got %s", killer.killed[0]) + } +} + +func TestRunChecks_IntegrationWithNotifier(t *testing.T) { + notif := &noopNotifier{} + state := &fakeRunnerState{ + snapshots: map[string][]model.RunnerSnapshot{ + "group-a": { + {Name: "r1", State: "idle", PID: 99999999, StartedAt: time.Now().Add(-2 * time.Hour)}, + }, + }, + } + + m := NewMonitor( + MonitorConfig{ + Enabled: true, + CheckInterval: time.Second, + IdleTimeout: 30 * time.Minute, + }, + notif, + state, + nil, + nil, + noopLogger(), + ) + + m.runChecks(context.Background()) + + foundIdle := false + for _, e := range notif.events { + if e.Type == model.EventHealthIdleTimeout { + foundIdle = true + } + } + if !foundIdle { + t.Error("expected idle timeout event to be notified") + } +} + +func timePtr(t time.Time) *time.Time { + return &t +} diff --git a/internal/health/group_state.go b/internal/health/group_state.go new file mode 100644 index 0000000..fe19267 --- /dev/null +++ b/internal/health/group_state.go @@ -0,0 +1,42 @@ +package health + +import "time" + +type groupState struct { + consecutiveFailures int + degradedSince *time.Time + lastDesiredCount int +} + +func (m *Monitor) getOrCreateGroup(name string) *groupState { + gs, ok := m.groups[name] + if !ok { + gs = &groupState{} + m.groups[name] = gs + } + return gs +} + +func (m *Monitor) UpdateGroupStats(group string, desired int) { + m.mu.Lock() + defer m.mu.Unlock() + + gs := m.getOrCreateGroup(group) + gs.lastDesiredCount = desired +} + +func (m *Monitor) RecordStartFailure(group string) { + m.mu.Lock() + defer m.mu.Unlock() + + gs := m.getOrCreateGroup(group) + gs.consecutiveFailures++ +} + +func (m *Monitor) RecordStartSuccess(group string) { + m.mu.Lock() + defer m.mu.Unlock() + + gs := m.getOrCreateGroup(group) + gs.consecutiveFailures = 0 +} diff --git a/internal/health/group_state_test.go b/internal/health/group_state_test.go new file mode 100644 index 0000000..0681c4d --- /dev/null +++ b/internal/health/group_state_test.go @@ -0,0 +1,94 @@ +package health + +import ( + "testing" +) + +func TestUpdateGroupStats(t *testing.T) { + m := newTestMonitor(nil, nil, nil) + + m.UpdateGroupStats("group-a", 3) + + m.mu.RLock() + gs, ok := m.groups["group-a"] + m.mu.RUnlock() + + if !ok { + t.Fatal("expected group-a to exist in groups map") + } + if gs.lastDesiredCount != 3 { + t.Errorf("expected lastDesiredCount=3, got %d", gs.lastDesiredCount) + } +} + +func TestRecordStartFailure(t *testing.T) { + m := newTestMonitor(nil, nil, nil) + + m.RecordStartFailure("group-a") + m.RecordStartFailure("group-a") + m.RecordStartFailure("group-a") + + m.mu.RLock() + gs := m.groups["group-a"] + m.mu.RUnlock() + + if gs.consecutiveFailures != 3 { + t.Errorf("expected 3 consecutive failures, got %d", gs.consecutiveFailures) + } +} + +func TestRecordStartSuccess_ResetsFailures(t *testing.T) { + m := newTestMonitor(nil, nil, nil) + + m.RecordStartFailure("group-a") + m.RecordStartFailure("group-a") + m.RecordStartSuccess("group-a") + + m.mu.RLock() + gs := m.groups["group-a"] + m.mu.RUnlock() + + if gs.consecutiveFailures != 0 { + t.Errorf("expected 0 consecutive failures after success, got %d", gs.consecutiveFailures) + } +} + +func TestGetOrCreateGroup_CreatesIfMissing(t *testing.T) { + m := newTestMonitor(nil, nil, nil) + + m.mu.Lock() + gs := m.getOrCreateGroup("new-group") + m.mu.Unlock() + + if gs == nil { + t.Fatal("expected non-nil groupState") + } + if gs.consecutiveFailures != 0 { + t.Errorf("expected 0 consecutive failures for new group, got %d", gs.consecutiveFailures) + } +} + +func TestGetOrCreateGroup_ReturnsExisting(t *testing.T) { + m := newTestMonitor(nil, nil, nil) + + m.RecordStartFailure("group-a") + + m.mu.Lock() + gs := m.getOrCreateGroup("group-a") + m.mu.Unlock() + + if gs.consecutiveFailures != 1 { + t.Errorf("expected 1 consecutive failure for existing group, got %d", gs.consecutiveFailures) + } +} + +func newTestMonitor(runners RunnerStateProvider, killer RunnerKiller, reporters []Reporter) *Monitor { + return NewMonitor( + MonitorConfig{Enabled: true}, + &noopNotifier{}, + runners, + reporters, + killer, + noopLogger(), + ) +} diff --git a/internal/health/monitor.go b/internal/health/monitor.go new file mode 100644 index 0000000..f0e722f --- /dev/null +++ b/internal/health/monitor.go @@ -0,0 +1,103 @@ +package health + +import ( + "context" + "log/slog" + "sync" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +type RunnerStateProvider interface { + Snapshots() map[string][]model.RunnerSnapshot +} + +type Notifier interface { + Notify(ctx context.Context, event *model.Event) +} + +type Reporter interface { + ReportDaemonHealth(ctx context.Context, groups int, totalActual int, totalDesired int, checkDuration time.Duration) + ReportGroupHealth(ctx context.Context, group string, actual int, desired int) +} + +type RunnerKiller interface { + KillRunner(ctx context.Context, group string, runner string) error +} + +type MonitorConfig struct { + Enabled bool + CheckInterval time.Duration + RunnerTimeout time.Duration + IdleTimeout time.Duration + DivergenceTimeout time.Duration + MaxConsecutiveFailures int + FailureCooldown time.Duration + MinDiskSpace int64 + GroupMinRunners map[string]int +} + +type Monitor struct { + cfg MonitorConfig + logger *slog.Logger + notifier Notifier + runners RunnerStateProvider + reporters []Reporter + killer RunnerKiller + + mu sync.RWMutex + lastCheck time.Time + issues []model.HealthIssue + groups map[string]*groupState +} + +func NewMonitor( + cfg MonitorConfig, + notifier Notifier, + runners RunnerStateProvider, + reporters []Reporter, + killer RunnerKiller, + logger *slog.Logger, +) *Monitor { + return &Monitor{ + cfg: cfg, + logger: logger, + notifier: notifier, + runners: runners, + reporters: reporters, + killer: killer, + groups: make(map[string]*groupState), + } +} + +func (m *Monitor) Run(ctx context.Context) error { + if !m.cfg.Enabled { + return nil + } + + ticker := time.NewTicker(m.cfg.CheckInterval) + defer ticker.Stop() + + for { + select { + case <-ctx.Done(): + return nil + case <-ticker.C: + m.runChecks(ctx) + } + } +} + +func (m *Monitor) Status() HealthStatus { + m.mu.RLock() + defer m.mu.RUnlock() + + copied := make([]model.HealthIssue, len(m.issues)) + copy(copied, m.issues) + + return HealthStatus{ + LastCheck: m.lastCheck, + Issues: copied, + } +} diff --git a/internal/health/status.go b/internal/health/status.go new file mode 100644 index 0000000..ad0942b --- /dev/null +++ b/internal/health/status.go @@ -0,0 +1,12 @@ +package health + +import ( + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +type HealthStatus struct { + LastCheck time.Time + Issues []model.HealthIssue +} diff --git a/internal/launchd/launchctl.go b/internal/launchd/launchctl.go new file mode 100644 index 0000000..0f1d44f --- /dev/null +++ b/internal/launchd/launchctl.go @@ -0,0 +1,38 @@ +package launchd + +import ( + "fmt" + "os/exec" +) + +func launchctlLoad(plistPath string) error { + out, err := exec.Command("launchctl", "load", plistPath).CombinedOutput() + if err != nil { + return fmt.Errorf("launchctl load: %w: %s", err, string(out)) + } + return nil +} + +func launchctlUnload(plistPath string) error { + out, err := exec.Command("launchctl", "unload", plistPath).CombinedOutput() + if err != nil { + return fmt.Errorf("launchctl unload: %w: %s", err, string(out)) + } + return nil +} + +func launchctlStart(label string) error { + out, err := exec.Command("launchctl", "start", label).CombinedOutput() + if err != nil { + return fmt.Errorf("launchctl start: %w: %s", err, string(out)) + } + return nil +} + +func launchctlStop(label string) error { + out, err := exec.Command("launchctl", "stop", label).CombinedOutput() + if err != nil { + return fmt.Errorf("launchctl stop: %w: %s", err, string(out)) + } + return nil +} diff --git a/internal/launchd/plist.go b/internal/launchd/plist.go new file mode 100644 index 0000000..84f2b9d --- /dev/null +++ b/internal/launchd/plist.go @@ -0,0 +1,56 @@ +package launchd + +import ( + "bytes" + "fmt" + "text/template" +) + +const plistTemplate = ` + + + + Label + {{.Label}} + ProgramArguments + + {{.BinaryPath}} + run + --config + {{.ConfigPath}} + + RunAtLoad + + KeepAlive + + SuccessfulExit + + + StandardOutPath + {{.LogDir}}/daemon.log + StandardErrorPath + {{.LogDir}}/daemon.err + WorkingDirectory + {{.StateDir}} + EnvironmentVariables + + PATH + /usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin + + + +` + +func generatePlist(cfg *ServiceConfig) ([]byte, error) { + tmpl, err := template.New("plist").Parse(plistTemplate) + if err != nil { + return nil, fmt.Errorf("parse plist template: %w", err) + } + + var buf bytes.Buffer + if err := tmpl.Execute(&buf, cfg); err != nil { + return nil, fmt.Errorf("execute plist template: %w", err) + } + + return buf.Bytes(), nil +} diff --git a/internal/launchd/plist_test.go b/internal/launchd/plist_test.go new file mode 100644 index 0000000..ee8ff4b --- /dev/null +++ b/internal/launchd/plist_test.go @@ -0,0 +1,67 @@ +package launchd + +import ( + "strings" + "testing" +) + +func TestGeneratePlist_ValidConfig(t *testing.T) { + cfg := ServiceConfig{ + Label: "com.ghr.daemon", + BinaryPath: "/usr/local/bin/ghr", + ConfigPath: "/etc/ghr/config.yaml", + LogDir: "/var/log/ghr", + StateDir: "/var/lib/ghr/state", + } + + data, err := generatePlist(&cfg) + if err != nil { + t.Fatalf("generatePlist() error = %v", err) + } + + result := string(data) + + checks := []struct { + name string + expected string + }{ + {"xml header", ``}, + {"label", `com.ghr.daemon`}, + {"binary path", `/usr/local/bin/ghr`}, + {"run command", `run`}, + {"config flag", `--config`}, + {"config path", `/etc/ghr/config.yaml`}, + {"stdout path", `/var/log/ghr/daemon.log`}, + {"stderr path", `/var/log/ghr/daemon.err`}, + {"workdir", `/var/lib/ghr/state`}, + {"run at load", ``}, + {"keep alive", `SuccessfulExit`}, + } + + for _, tc := range checks { + t.Run(tc.name, func(t *testing.T) { + if !strings.Contains(result, tc.expected) { + t.Errorf("plist missing %q", tc.expected) + } + }) + } +} + +func TestGeneratePlist_SpecialChars(t *testing.T) { + cfg := ServiceConfig{ + Label: "com.ghr.test", + BinaryPath: "/path/with spaces/ghr", + ConfigPath: "/config/test.yaml", + LogDir: "/tmp/logs", + StateDir: "/tmp/state", + } + + data, err := generatePlist(&cfg) + if err != nil { + t.Fatalf("generatePlist() error = %v", err) + } + + if !strings.Contains(string(data), "/path/with spaces/ghr") { + t.Error("plist should preserve paths with spaces") + } +} diff --git a/internal/launchd/service.go b/internal/launchd/service.go new file mode 100644 index 0000000..0bf494b --- /dev/null +++ b/internal/launchd/service.go @@ -0,0 +1,103 @@ +package launchd + +import ( + "fmt" + "os" + "os/exec" + "path/filepath" + "strconv" + "strings" +) + +type ServiceConfig struct { + Label string + BinaryPath string + ConfigPath string + LogDir string + StateDir string +} + +func DefaultLabel() string { return "com.ghr.daemon" } + +func PlistPath(label string) string { + if os.Getuid() == 0 { + return filepath.Join("/Library", "LaunchDaemons", label+".plist") + } + home, err := os.UserHomeDir() + if err != nil { + home = "." + } + return filepath.Join(home, "Library", "LaunchAgents", label+".plist") +} + +func Install(cfg *ServiceConfig) error { + data, err := generatePlist(cfg) + if err != nil { + return fmt.Errorf("generate plist: %w", err) + } + + plistPath := PlistPath(cfg.Label) + dir := filepath.Dir(plistPath) + if err := os.MkdirAll(dir, 0o755); err != nil { + return fmt.Errorf("create plist directory %s: %w", dir, err) + } + + if err := os.WriteFile(plistPath, data, 0o644); err != nil { + return fmt.Errorf("write plist %s: %w", plistPath, err) + } + + if err := launchctlLoad(plistPath); err != nil { + return fmt.Errorf("launchctl load: %w", err) + } + + if err := launchctlStart(cfg.Label); err != nil { + return fmt.Errorf("launchctl start: %w", err) + } + + return nil +} + +func Uninstall(label string) error { + plistPath := PlistPath(label) + + _ = launchctlStop(label) + _ = launchctlUnload(plistPath) + + if err := os.Remove(plistPath); err != nil && !os.IsNotExist(err) { + return fmt.Errorf("remove plist %s: %w", plistPath, err) + } + + return nil +} + +func IsRunning(label string) bool { + _, running := Status(label) + return running +} + +func Status(label string) (int, bool) { + out, err := exec.Command("launchctl", "list").Output() + if err != nil { + return 0, false + } + + for _, line := range strings.Split(string(out), "\n") { + if !strings.Contains(line, label) { + continue + } + fields := strings.Fields(line) + if len(fields) < 3 { + continue + } + if fields[2] != label { + continue + } + pid, parseErr := strconv.Atoi(fields[0]) + if parseErr != nil || pid <= 0 { + return 0, false + } + return pid, true + } + + return 0, false +} diff --git a/internal/launchd/service_test.go b/internal/launchd/service_test.go new file mode 100644 index 0000000..29882e9 --- /dev/null +++ b/internal/launchd/service_test.go @@ -0,0 +1,56 @@ +package launchd + +import ( + "strings" + "testing" +) + +func TestDefaultLabel(t *testing.T) { + label := DefaultLabel() + if label != "com.ghr.daemon" { + t.Errorf("DefaultLabel() = %q, want %q", label, "com.ghr.daemon") + } +} + +func TestPlistPath_NonRoot(t *testing.T) { + path := PlistPath("com.ghr.daemon") + if !strings.HasSuffix(path, "Library/LaunchAgents/com.ghr.daemon.plist") && + !strings.HasSuffix(path, "Library/LaunchDaemons/com.ghr.daemon.plist") { + t.Errorf("PlistPath() = %q, expected LaunchAgents or LaunchDaemons suffix", path) + } +} + +func TestPlistPath_ContainsLabel(t *testing.T) { + tests := []struct { + name string + label string + }{ + {"default label", "com.ghr.daemon"}, + {"custom label", "com.ghr.test"}, + } + + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + path := PlistPath(tc.label) + if !strings.Contains(path, tc.label+".plist") { + t.Errorf("PlistPath(%q) = %q, missing label in path", tc.label, path) + } + }) + } +} + +func TestStatus_NotRunning(t *testing.T) { + pid, running := Status("com.ghr.test.nonexistent.label.12345") + if running { + t.Errorf("Status() running = true for nonexistent label") + } + if pid != 0 { + t.Errorf("Status() pid = %d, want 0", pid) + } +} + +func TestIsRunning_NotRunning(t *testing.T) { + if IsRunning("com.ghr.test.nonexistent.label.12345") { + t.Error("IsRunning() = true for nonexistent label") + } +} diff --git a/internal/logging/handler.go b/internal/logging/handler.go new file mode 100644 index 0000000..7a8c901 --- /dev/null +++ b/internal/logging/handler.go @@ -0,0 +1,52 @@ +package logging + +import ( + "context" + "fmt" + "log/slog" + "os" +) + +type MultiHandler struct { + handlers []slog.Handler +} + +func NewMultiHandler(handlers ...slog.Handler) *MultiHandler { + h := make([]slog.Handler, len(handlers)) + copy(h, handlers) + return &MultiHandler{handlers: h} +} + +func (h *MultiHandler) Enabled(ctx context.Context, level slog.Level) bool { + for _, handler := range h.handlers { + if handler.Enabled(ctx, level) { + return true + } + } + return false +} + +func (h *MultiHandler) Handle(ctx context.Context, r slog.Record) error { + for _, handler := range h.handlers { + if err := handler.Handle(ctx, r); err != nil { + fmt.Fprintf(os.Stderr, "logging: handler error: %v\n", err) + } + } + return nil +} + +func (h *MultiHandler) WithAttrs(attrs []slog.Attr) slog.Handler { + cloned := make([]slog.Handler, len(h.handlers)) + for i, handler := range h.handlers { + cloned[i] = handler.WithAttrs(attrs) + } + return &MultiHandler{handlers: cloned} +} + +func (h *MultiHandler) WithGroup(name string) slog.Handler { + cloned := make([]slog.Handler, len(h.handlers)) + for i, handler := range h.handlers { + cloned[i] = handler.WithGroup(name) + } + return &MultiHandler{handlers: cloned} +} diff --git a/internal/logging/level.go b/internal/logging/level.go new file mode 100644 index 0000000..d79c2f2 --- /dev/null +++ b/internal/logging/level.go @@ -0,0 +1,29 @@ +package logging + +import ( + "log/slog" + "strings" +) + +type LogConfig struct { + Level string + Format string + Dir string + RetentionDays int + RunnerOutput bool +} + +func ParseLevel(s string) slog.Level { + switch strings.ToLower(s) { + case "debug": + return slog.LevelDebug + case "info": + return slog.LevelInfo + case "warn": + return slog.LevelWarn + case "error": + return slog.LevelError + default: + return slog.LevelInfo + } +} diff --git a/internal/logging/logger_test.go b/internal/logging/logger_test.go new file mode 100644 index 0000000..900c355 --- /dev/null +++ b/internal/logging/logger_test.go @@ -0,0 +1,602 @@ +package logging + +import ( + "bytes" + "context" + "encoding/json" + "log/slog" + "os" + "path/filepath" + "strings" + "testing" + "time" +) + +// --------------------------------------------------------------------------- +// TestParseLevel +// --------------------------------------------------------------------------- + +func TestParseLevel(t *testing.T) { + tests := []struct { + name string + input string + want slog.Level + }{ + {"debug lowercase", "debug", slog.LevelDebug}, + {"info lowercase", "info", slog.LevelInfo}, + {"warn lowercase", "warn", slog.LevelWarn}, + {"error lowercase", "error", slog.LevelError}, + {"DEBUG uppercase", "DEBUG", slog.LevelDebug}, + {"Info mixed case", "Info", slog.LevelInfo}, + {"unknown defaults to info", "unknown", slog.LevelInfo}, + {"empty defaults to info", "", slog.LevelInfo}, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got := ParseLevel(tt.input) + if got != tt.want { + t.Errorf("ParseLevel(%q) = %v, want %v", tt.input, got, tt.want) + } + }) + } +} + +// --------------------------------------------------------------------------- +// TestNew +// --------------------------------------------------------------------------- + +func TestNew(t *testing.T) { + t.Run("valid config creates dirs", func(t *testing.T) { + dir := t.TempDir() + cfg := LogConfig{ + Level: "info", + Format: "json", + Dir: dir, + } + + mgr, err := New(cfg) + if err != nil { + t.Fatalf("New() error = %v", err) + } + defer mgr.Close() + + daemonDir := filepath.Join(dir, "daemon") + groupsDir := filepath.Join(dir, "groups") + + if info, statErr := os.Stat(daemonDir); statErr != nil || !info.IsDir() { + t.Errorf("daemon directory not created at %s", daemonDir) + } + if info, statErr := os.Stat(groupsDir); statErr != nil || !info.IsDir() { + t.Errorf("groups directory not created at %s", groupsDir) + } + }) + + t.Run("empty Dir returns error", func(t *testing.T) { + cfg := LogConfig{Dir: ""} + _, err := New(cfg) + if err == nil { + t.Fatal("New() with empty Dir should return error") + } + if !strings.Contains(err.Error(), "dir must not be empty") { + t.Errorf("unexpected error message: %v", err) + } + }) +} + +// --------------------------------------------------------------------------- +// TestMultiHandler +// --------------------------------------------------------------------------- + +func TestMultiHandler(t *testing.T) { + t.Run("fans out to all handlers", func(t *testing.T) { + var buf1, buf2 bytes.Buffer + h1 := slog.NewJSONHandler(&buf1, &slog.HandlerOptions{Level: slog.LevelDebug}) + h2 := slog.NewJSONHandler(&buf2, &slog.HandlerOptions{Level: slog.LevelDebug}) + + multi := NewMultiHandler(h1, h2) + logger := slog.New(multi) + logger.Info("hello multi") + + for i, buf := range []*bytes.Buffer{&buf1, &buf2} { + content := buf.String() + if content == "" { + t.Errorf("buffer %d is empty, expected log output", i) + continue + } + var entry map[string]interface{} + if err := json.Unmarshal([]byte(strings.TrimSpace(content)), &entry); err != nil { + t.Errorf("buffer %d: failed to parse JSON: %v", i, err) + continue + } + if msg, ok := entry["msg"].(string); !ok || msg != "hello multi" { + t.Errorf("buffer %d: msg = %v, want %q", i, entry["msg"], "hello multi") + } + } + }) + + t.Run("WithAttrs propagates to all handlers", func(t *testing.T) { + var buf1, buf2 bytes.Buffer + h1 := slog.NewJSONHandler(&buf1, &slog.HandlerOptions{Level: slog.LevelDebug}) + h2 := slog.NewJSONHandler(&buf2, &slog.HandlerOptions{Level: slog.LevelDebug}) + + multi := NewMultiHandler(h1, h2) + withAttrs := multi.WithAttrs([]slog.Attr{slog.String("key", "val")}) + logger := slog.New(withAttrs) + logger.Info("with attrs") + + for i, buf := range []*bytes.Buffer{&buf1, &buf2} { + content := buf.String() + var entry map[string]interface{} + if err := json.Unmarshal([]byte(strings.TrimSpace(content)), &entry); err != nil { + t.Errorf("buffer %d: failed to parse JSON: %v", i, err) + continue + } + if v, ok := entry["key"].(string); !ok || v != "val" { + t.Errorf("buffer %d: key = %v, want %q", i, entry["key"], "val") + } + } + }) + + t.Run("Enabled returns true if any handler is enabled", func(t *testing.T) { + // h1 only enabled at Error, h2 enabled at Debug + h1 := slog.NewJSONHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelError}) + h2 := slog.NewJSONHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelDebug}) + multi := NewMultiHandler(h1, h2) + + // Debug should be enabled because h2 accepts it + if !multi.Enabled(context.TODO(), slog.LevelDebug) { + t.Error("Enabled(Debug) = false, want true (h2 accepts Debug)") + } + // Info should be enabled because h2 accepts it + if !multi.Enabled(context.TODO(), slog.LevelInfo) { + t.Error("Enabled(Info) = false, want true") + } + }) + + t.Run("Enabled returns false when no handler is enabled", func(t *testing.T) { + h1 := slog.NewJSONHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelError}) + h2 := slog.NewJSONHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelError}) + multi := NewMultiHandler(h1, h2) + + if multi.Enabled(context.TODO(), slog.LevelDebug) { + t.Error("Enabled(Debug) = true, want false (both require Error)") + } + }) +} + +// --------------------------------------------------------------------------- +// helpers +// --------------------------------------------------------------------------- + +// newTestManager creates a LogManager in a temporary directory with debug level. +func newTestManager(t *testing.T) *LogManager { + t.Helper() + dir := t.TempDir() + cfg := LogConfig{ + Level: "debug", + Format: "json", + Dir: dir, + RunnerOutput: true, + } + mgr, err := New(cfg) + if err != nil { + t.Fatalf("newTestManager: %v", err) + } + t.Cleanup(func() { mgr.Close() }) + return mgr +} + +// readJSONLines reads a file and returns each line as a parsed JSON map. +func readJSONLines(t *testing.T, path string) []map[string]interface{} { + t.Helper() + data, err := os.ReadFile(path) + if err != nil { + t.Fatalf("readJSONLines: read %s: %v", path, err) + } + var result []map[string]interface{} + for _, line := range strings.Split(strings.TrimSpace(string(data)), "\n") { + if line == "" { + continue + } + var entry map[string]interface{} + if err := json.Unmarshal([]byte(line), &entry); err != nil { + t.Fatalf("readJSONLines: parse line %q: %v", line, err) + } + result = append(result, entry) + } + return result +} + +// todayFile returns the log filename for the current (possibly mocked) date. +func todayFile() string { + return nowFunc().Format("2006-01-02") + ".json" +} + +// --------------------------------------------------------------------------- +// TestDaemonLogger +// --------------------------------------------------------------------------- + +func TestDaemonLogger(t *testing.T) { + mgr := newTestManager(t) + + logger, err := mgr.DaemonLogger() + if err != nil { + t.Fatalf("DaemonLogger() error = %v", err) + } + + logger.Info("daemon test message") + + // Flush: close the manager so files are flushed. + if err := mgr.Close(); err != nil { + t.Fatalf("Close() error = %v", err) + } + + logFile := filepath.Join(mgr.rootDir, "daemon", todayFile()) + entries := readJSONLines(t, logFile) + if len(entries) == 0 { + t.Fatal("expected at least one log entry in daemon log") + } + + found := false + for _, e := range entries { + if msg, ok := e["msg"].(string); ok && msg == "daemon test message" { + found = true + if comp, ok := e["component"].(string); !ok || comp != "daemon" { + t.Errorf("component = %v, want %q", e["component"], "daemon") + } + } + } + if !found { + t.Error("did not find 'daemon test message' in daemon log file") + } +} + +// --------------------------------------------------------------------------- +// TestGroupLogger +// --------------------------------------------------------------------------- + +func TestGroupLogger(t *testing.T) { + mgr := newTestManager(t) + + logger, err := mgr.GroupLogger("test-group") + if err != nil { + t.Fatalf("GroupLogger() error = %v", err) + } + + logger.Info("group test message") + + if err := mgr.Close(); err != nil { + t.Fatalf("Close() error = %v", err) + } + + // Check group log file. + groupFile := filepath.Join(mgr.rootDir, "groups", "test-group", todayFile()) + groupEntries := readJSONLines(t, groupFile) + found := false + for _, e := range groupEntries { + if msg, ok := e["msg"].(string); ok && msg == "group test message" { + found = true + if comp, ok := e["component"].(string); !ok || comp != "group" { + t.Errorf("component = %v, want %q", e["component"], "group") + } + if g, ok := e["group"].(string); !ok || g != "test-group" { + t.Errorf("group = %v, want %q", e["group"], "test-group") + } + } + } + if !found { + t.Errorf("did not find 'group test message' in group log file %s", groupFile) + } + + // Check propagation to daemon log. + daemonFile := filepath.Join(mgr.rootDir, "daemon", todayFile()) + daemonEntries := readJSONLines(t, daemonFile) + found = false + for _, e := range daemonEntries { + if msg, ok := e["msg"].(string); ok && msg == "group test message" { + found = true + } + } + if !found { + t.Error("group message did not propagate to daemon log file") + } +} + +// --------------------------------------------------------------------------- +// TestRunnerLogger +// --------------------------------------------------------------------------- + +func TestRunnerLogger(t *testing.T) { + mgr := newTestManager(t) + + logger, err := mgr.RunnerLogger("test-group", "runner-abc") + if err != nil { + t.Fatalf("RunnerLogger() error = %v", err) + } + + logger.Info("runner test message") + + if err := mgr.Close(); err != nil { + t.Fatalf("Close() error = %v", err) + } + + today := todayFile() + + // Verify message in runner log. + runnerFile := filepath.Join(mgr.rootDir, "groups", "test-group", "runners", "runner-abc", today) + runnerEntries := readJSONLines(t, runnerFile) + found := false + for _, e := range runnerEntries { + if msg, ok := e["msg"].(string); ok && msg == "runner test message" { + found = true + if comp, ok := e["component"].(string); !ok || comp != "runner" { + t.Errorf("runner log: component = %v, want %q", e["component"], "runner") + } + if g, ok := e["group"].(string); !ok || g != "test-group" { + t.Errorf("runner log: group = %v, want %q", e["group"], "test-group") + } + if r, ok := e["runner"].(string); !ok || r != "runner-abc" { + t.Errorf("runner log: runner = %v, want %q", e["runner"], "runner-abc") + } + } + } + if !found { + t.Errorf("did not find 'runner test message' in runner log file %s", runnerFile) + } + + // Verify propagation to group log. + groupFile := filepath.Join(mgr.rootDir, "groups", "test-group", today) + groupEntries := readJSONLines(t, groupFile) + found = false + for _, e := range groupEntries { + if msg, ok := e["msg"].(string); ok && msg == "runner test message" { + found = true + } + } + if !found { + t.Error("runner message did not propagate to group log file") + } + + // Verify propagation to daemon log. + daemonFile := filepath.Join(mgr.rootDir, "daemon", today) + daemonEntries := readJSONLines(t, daemonFile) + found = false + for _, e := range daemonEntries { + if msg, ok := e["msg"].(string); ok && msg == "runner test message" { + found = true + } + } + if !found { + t.Error("runner message did not propagate to daemon log file") + } +} + +// --------------------------------------------------------------------------- +// TestDateRotation +// --------------------------------------------------------------------------- + +func TestDateRotation(t *testing.T) { + orig := nowFunc + defer func() { nowFunc = orig }() + + day1 := time.Date(2024, 1, 15, 12, 0, 0, 0, time.UTC) + day2 := time.Date(2024, 1, 16, 12, 0, 0, 0, time.UTC) + + nowFunc = func() time.Time { return day1 } + + dir := t.TempDir() + w, err := newDateAwareWriter(dir) + if err != nil { + t.Fatalf("newDateAwareWriter() error = %v", err) + } + defer w.Close() + + // Write on day 1. + _, err = w.Write([]byte("day1 line\n")) + if err != nil { + t.Fatalf("Write day1: %v", err) + } + + file1 := filepath.Join(dir, "2024-01-15.json") + if _, statErr := os.Stat(file1); statErr != nil { + t.Errorf("expected file %s to exist after day1 write", file1) + } + + // Advance to day 2. + nowFunc = func() time.Time { return day2 } + + _, err = w.Write([]byte("day2 line\n")) + if err != nil { + t.Fatalf("Write day2: %v", err) + } + + file2 := filepath.Join(dir, "2024-01-16.json") + if _, statErr := os.Stat(file2); statErr != nil { + t.Errorf("expected file %s to exist after day2 write", file2) + } + + // Verify contents. + data1, err := os.ReadFile(file1) + if err != nil { + t.Fatalf("ReadFile day1: %v", err) + } + if !strings.Contains(string(data1), "day1 line") { + t.Errorf("day1 file content = %q, want to contain %q", data1, "day1 line") + } + + data2, err := os.ReadFile(file2) + if err != nil { + t.Fatalf("ReadFile day2: %v", err) + } + if !strings.Contains(string(data2), "day2 line") { + t.Errorf("day2 file content = %q, want to contain %q", data2, "day2 line") + } +} + +// --------------------------------------------------------------------------- +// TestRunnerOutputFile +// --------------------------------------------------------------------------- + +func TestRunnerOutputFile(t *testing.T) { + mgr := newTestManager(t) + + wc, err := mgr.RunnerOutputFile("group", "runner") + if err != nil { + t.Fatalf("RunnerOutputFile() error = %v", err) + } + + payload := []byte("some runner output\n") + n, err := wc.Write(payload) + if err != nil { + t.Fatalf("Write() error = %v", err) + } + if n != len(payload) { + t.Errorf("Write() wrote %d bytes, want %d", n, len(payload)) + } + + outFile := filepath.Join(mgr.rootDir, "groups", "group", "runners", "runner", todayFile()) + if _, statErr := os.Stat(outFile); statErr != nil { + t.Errorf("expected output file at %s", outFile) + } + + if err := wc.Close(); err != nil { + t.Errorf("Close() error = %v", err) + } + + data, err := os.ReadFile(outFile) + if err != nil { + t.Fatalf("ReadFile: %v", err) + } + if !strings.Contains(string(data), "some runner output") { + t.Errorf("output file content = %q, want to contain %q", data, "some runner output") + } +} + +// --------------------------------------------------------------------------- +// TestCleanupOldLogs +// --------------------------------------------------------------------------- + +func TestCleanupOldLogs(t *testing.T) { + dir := t.TempDir() + cfg := LogConfig{ + Level: "info", + Format: "json", + Dir: dir, + RetentionDays: 1, + } + mgr, err := New(cfg) + if err != nil { + t.Fatalf("New() error = %v", err) + } + defer mgr.Close() + + daemonDir := filepath.Join(dir, "daemon") + + // Create an old log file (modification time 3 days ago). + oldFile := filepath.Join(daemonDir, "2024-01-10.json") + if err := os.WriteFile(oldFile, []byte(`{"msg":"old"}`+"\n"), 0o644); err != nil { + t.Fatalf("WriteFile old: %v", err) + } + oldTime := time.Now().AddDate(0, 0, -3) + if err := os.Chtimes(oldFile, oldTime, oldTime); err != nil { + t.Fatalf("Chtimes old: %v", err) + } + + // Create a fresh log file (modification time is now). + freshFile := filepath.Join(daemonDir, "2024-01-15.json") + if err := os.WriteFile(freshFile, []byte(`{"msg":"fresh"}`+"\n"), 0o644); err != nil { + t.Fatalf("WriteFile fresh: %v", err) + } + + if err := mgr.CleanupOldLogs(); err != nil { + t.Fatalf("CleanupOldLogs() error = %v", err) + } + + // Old file should be deleted. + if _, statErr := os.Stat(oldFile); !os.IsNotExist(statErr) { + t.Errorf("old file %s should have been deleted", oldFile) + } + + // Fresh file should remain. + if _, statErr := os.Stat(freshFile); statErr != nil { + t.Errorf("fresh file %s should still exist: %v", freshFile, statErr) + } +} + +// --------------------------------------------------------------------------- +// TestCleanupOldLogs_Disabled +// --------------------------------------------------------------------------- + +func TestCleanupOldLogs_Disabled(t *testing.T) { + dir := t.TempDir() + cfg := LogConfig{ + Level: "info", + Format: "json", + Dir: dir, + RetentionDays: 0, // disabled + } + mgr, err := New(cfg) + if err != nil { + t.Fatalf("New() error = %v", err) + } + defer mgr.Close() + + daemonDir := filepath.Join(dir, "daemon") + + // Create an old file. + oldFile := filepath.Join(daemonDir, "2020-01-01.json") + if err := os.WriteFile(oldFile, []byte(`{"msg":"ancient"}`+"\n"), 0o644); err != nil { + t.Fatalf("WriteFile: %v", err) + } + oldTime := time.Now().AddDate(-4, 0, 0) + if err := os.Chtimes(oldFile, oldTime, oldTime); err != nil { + t.Fatalf("Chtimes: %v", err) + } + + if err := mgr.CleanupOldLogs(); err != nil { + t.Fatalf("CleanupOldLogs() error = %v", err) + } + + // File should NOT be deleted when RetentionDays=0. + if _, statErr := os.Stat(oldFile); statErr != nil { + t.Errorf("old file %s should NOT have been deleted (RetentionDays=0): %v", oldFile, statErr) + } +} + +// --------------------------------------------------------------------------- +// TestClose +// --------------------------------------------------------------------------- + +func TestClose(t *testing.T) { + mgr := newTestManager(t) + + // Create several loggers to open multiple writers. + if _, err := mgr.DaemonLogger(); err != nil { + t.Fatalf("DaemonLogger: %v", err) + } + if _, err := mgr.GroupLogger("group-a"); err != nil { + t.Fatalf("GroupLogger: %v", err) + } + if _, err := mgr.RunnerLogger("group-a", "runner-1"); err != nil { + t.Fatalf("RunnerLogger: %v", err) + } + + mgr.mu.Lock() + writerCount := len(mgr.writers) + mgr.mu.Unlock() + if writerCount == 0 { + t.Error("expected writers to be tracked before Close()") + } + + if err := mgr.Close(); err != nil { + t.Fatalf("Close() error = %v", err) + } + + mgr.mu.Lock() + writersAfter := mgr.writers + mgr.mu.Unlock() + if writersAfter != nil { + t.Errorf("expected writers to be nil after Close(), got len=%d", len(writersAfter)) + } +} diff --git a/internal/logging/manager.go b/internal/logging/manager.go new file mode 100644 index 0000000..1e1abb9 --- /dev/null +++ b/internal/logging/manager.go @@ -0,0 +1,180 @@ +package logging + +import ( + "context" + "fmt" + "io" + "log/slog" + "os" + "path/filepath" + "strings" + "sync" + "time" +) + +type LogManager struct { + cfg LogConfig + rootDir string + level slog.Level + + mu sync.Mutex + writers []*dateAwareWriter +} + +func New(cfg LogConfig) (*LogManager, error) { + if cfg.Dir == "" { + return nil, fmt.Errorf("logging: dir must not be empty") + } + + daemonDir := filepath.Join(cfg.Dir, "daemon") + groupsDir := filepath.Join(cfg.Dir, "groups") + + if err := os.MkdirAll(daemonDir, 0o755); err != nil { + return nil, fmt.Errorf("logging: create daemon dir: %w", err) + } + if err := os.MkdirAll(groupsDir, 0o755); err != nil { + return nil, fmt.Errorf("logging: create groups dir: %w", err) + } + + return &LogManager{ + cfg: cfg, + rootDir: cfg.Dir, + level: ParseLevel(cfg.Level), + }, nil +} + +func (m *LogManager) Close() error { + m.mu.Lock() + defer m.mu.Unlock() + var firstErr error + for _, w := range m.writers { + if err := w.Close(); err != nil && firstErr == nil { + firstErr = err + } + } + m.writers = nil + return firstErr +} + +func (m *LogManager) trackWriter(w *dateAwareWriter) { + m.mu.Lock() + defer m.mu.Unlock() + m.writers = append(m.writers, w) +} + +func (m *LogManager) consoleHandler() slog.Handler { + opts := &slog.HandlerOptions{Level: m.level} + if strings.EqualFold(m.cfg.Format, "json") { + return slog.NewJSONHandler(os.Stderr, opts) + } + return slog.NewTextHandler(os.Stderr, opts) +} + +func (m *LogManager) fileHandler(subdir string) (slog.Handler, error) { + dir := filepath.Join(m.rootDir, subdir) + w, err := newDateAwareWriter(dir) + if err != nil { + return nil, err + } + m.trackWriter(w) + opts := &slog.HandlerOptions{Level: m.level} + return slog.NewJSONHandler(w, opts), nil +} + +func (m *LogManager) DaemonLogger() (*slog.Logger, error) { + daemonFileH, err := m.fileHandler("daemon") + if err != nil { + return nil, fmt.Errorf("logging: daemon file handler: %w", err) + } + multi := NewMultiHandler(daemonFileH, m.consoleHandler()) + return slog.New(multi).With("component", "daemon"), nil +} + +func (m *LogManager) GroupLogger(group string) (*slog.Logger, error) { + groupDir := filepath.Join("groups", group) + groupFileH, err := m.fileHandler(groupDir) + if err != nil { + return nil, fmt.Errorf("logging: group file handler for %q: %w", group, err) + } + + daemonFileH, err := m.fileHandler("daemon") + if err != nil { + return nil, fmt.Errorf("logging: daemon file handler (group %q): %w", group, err) + } + + multi := NewMultiHandler(groupFileH, daemonFileH, m.consoleHandler()) + return slog.New(multi).With("component", "group", "group", group), nil +} + +func (m *LogManager) RunnerLogger(group, runner string) (*slog.Logger, error) { + runnerDir := filepath.Join("groups", group, "runners", runner) + runnerFileH, err := m.fileHandler(runnerDir) + if err != nil { + return nil, fmt.Errorf("logging: runner file handler for %q/%q: %w", group, runner, err) + } + + groupDir := filepath.Join("groups", group) + groupFileH, err := m.fileHandler(groupDir) + if err != nil { + return nil, fmt.Errorf("logging: group file handler for runner %q/%q: %w", group, runner, err) + } + + daemonFileH, err := m.fileHandler("daemon") + if err != nil { + return nil, fmt.Errorf("logging: daemon file handler (runner %q/%q): %w", group, runner, err) + } + + multi := NewMultiHandler(runnerFileH, groupFileH, daemonFileH, m.consoleHandler()) + return slog.New(multi).With("component", "runner", "group", group, "runner", runner), nil +} + +func (m *LogManager) RunnerOutputFile(group, runner string) (io.WriteCloser, error) { + dir := filepath.Join(m.rootDir, "groups", group, "runners", runner) + w, err := newDateAwareWriter(dir) + if err != nil { + return nil, fmt.Errorf("logging: runner output file for %q/%q: %w", group, runner, err) + } + m.trackWriter(w) + return w, nil +} + +func (m *LogManager) StartCleanupScheduler(ctx context.Context) error { + ticker := time.NewTicker(24 * time.Hour) + defer ticker.Stop() + + for { + select { + case <-ctx.Done(): + return nil + case <-ticker.C: + if err := m.CleanupOldLogs(); err != nil { + fmt.Fprintf(os.Stderr, "log cleanup error: %v\n", err) + } + } + } +} + +func (m *LogManager) CleanupOldLogs() error { + if m.cfg.RetentionDays <= 0 { + return nil + } + cutoff := nowFunc().AddDate(0, 0, -m.cfg.RetentionDays) + + return filepath.Walk(m.rootDir, func(path string, info os.FileInfo, err error) error { + if err != nil { + return fmt.Errorf("logging: walk %s: %w", path, err) + } + if info.IsDir() { + return nil + } + if !strings.HasSuffix(info.Name(), ".json") { + return nil + } + if info.ModTime().Before(cutoff) { + if removeErr := os.Remove(path); removeErr != nil { + return fmt.Errorf("logging: remove old log %s: %w", path, removeErr) + } + } + return nil + }) +} diff --git a/internal/logging/writer.go b/internal/logging/writer.go new file mode 100644 index 0000000..5a14652 --- /dev/null +++ b/internal/logging/writer.go @@ -0,0 +1,65 @@ +package logging + +import ( + "fmt" + "os" + "path/filepath" + "sync" + "time" +) + +var nowFunc = time.Now + +type dateAwareWriter struct { + mu sync.Mutex + dir string + current *os.File + today string +} + +func newDateAwareWriter(dir string) (*dateAwareWriter, error) { + if err := os.MkdirAll(dir, 0o755); err != nil { + return nil, fmt.Errorf("logging: create dir %s: %w", dir, err) + } + w := &dateAwareWriter{dir: dir} + if err := w.rotate(); err != nil { + return nil, err + } + return w, nil +} + +func (w *dateAwareWriter) rotate() error { + today := nowFunc().Format("2006-01-02") + if w.current != nil && w.today == today { + return nil + } + if w.current != nil { + w.current.Close() + } + path := filepath.Join(w.dir, today+".json") + f, err := os.OpenFile(path, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0o644) + if err != nil { + return fmt.Errorf("logging: open %s: %w", path, err) + } + w.current = f + w.today = today + return nil +} + +func (w *dateAwareWriter) Write(p []byte) (int, error) { + w.mu.Lock() + defer w.mu.Unlock() + if err := w.rotate(); err != nil { + return 0, err + } + return w.current.Write(p) +} + +func (w *dateAwareWriter) Close() error { + w.mu.Lock() + defer w.mu.Unlock() + if w.current != nil { + return w.current.Close() + } + return nil +} diff --git a/internal/model/event.go b/internal/model/event.go new file mode 100644 index 0000000..5b9258d --- /dev/null +++ b/internal/model/event.go @@ -0,0 +1,47 @@ +package model + +import "time" + +type EventLevel string + +const ( + LevelInfo EventLevel = "info" + LevelWarning EventLevel = "warning" + LevelError EventLevel = "error" + LevelCritical EventLevel = "critical" +) + +const ( + EventDaemonStart = "daemon.start" + EventDaemonStop = "daemon.stop" + EventDaemonCrash = "daemon.crash" + + EventGroupCreated = "group.created" + EventGroupDeleted = "group.deleted" + EventGroupScaleUp = "group.scale_up" + EventGroupScaleDown = "group.scale_down" + + EventRunnerStarted = "runner.started" + EventRunnerCompleted = "runner.completed" + EventRunnerFailed = "runner.failed" + EventRunnerTimeout = "runner.timeout" + + EventHealthZombieRunner = "health.zombie_runner" + EventHealthRunnerTimeout = "health.runner_timeout" + EventHealthGroupDegraded = "health.group_degraded" + EventHealthGroupDisconnected = "health.group_disconnected" + EventHealthGroupFailing = "health.group_failing" + EventHealthDiskLow = "health.disk_low" + EventHealthOrphanKilled = "health.orphan_killed" + EventHealthIdleTimeout = "health.idle_timeout" +) + +type Event struct { + Type string + Level EventLevel + Group string + Runner string + Message string + Details map[string]string + Timestamp time.Time +} diff --git a/internal/model/group.go b/internal/model/group.go new file mode 100644 index 0000000..8e7e366 --- /dev/null +++ b/internal/model/group.go @@ -0,0 +1,29 @@ +package model + +import "time" + +type Group struct { + Name string + MaxRunners int + MinRunners int + Labels []string + RunnerGroup string +} + +type RunnerInstance struct { + ID string + Name string + Group string + WorkDir string + Version string +} + +type RunnerSnapshot struct { + Name string `json:"name"` + Group string `json:"group"` + State string `json:"state"` + PID int `json:"pid"` + StartedAt time.Time `json:"started_at"` + JobName string `json:"job_name"` + JobID string `json:"job_id"` +} diff --git a/internal/model/health.go b/internal/model/health.go new file mode 100644 index 0000000..fd10efa --- /dev/null +++ b/internal/model/health.go @@ -0,0 +1,21 @@ +package model + +import "time" + +type GroupHealthStatus struct { + Actual int + Desired int + Max int + Min int + Healthy bool + Issues []HealthIssue +} + +type HealthIssue struct { + Level EventLevel `json:"level"` + Type string `json:"type"` + Group string `json:"group"` + Runner string `json:"runner"` + Message string `json:"message"` + DetectedAt time.Time `json:"detected_at"` +} diff --git a/internal/monitoring/uptimekuma.go b/internal/monitoring/uptimekuma.go new file mode 100644 index 0000000..1803037 --- /dev/null +++ b/internal/monitoring/uptimekuma.go @@ -0,0 +1,118 @@ +package monitoring + +import ( + "context" + "fmt" + "log/slog" + "net/http" + "net/url" + "strings" + "time" +) + +type UptimeKumaConfig struct { + BaseURL string + DaemonToken string + GroupTokens map[string]string + DegradedThreshold float64 + ReportHealthAsPing bool +} + +type UptimeKuma struct { + cfg UptimeKumaConfig + client *http.Client + logger *slog.Logger +} + +func NewUptimeKuma(cfg UptimeKumaConfig, logger *slog.Logger) *UptimeKuma { + return &UptimeKuma{ + cfg: cfg, + client: &http.Client{ + Timeout: 10 * time.Second, + }, + logger: logger, + } +} + +func (u *UptimeKuma) ReportDaemonHealth(ctx context.Context, groups, totalActual, totalDesired int, checkDuration time.Duration) { + if u.cfg.DaemonToken == "" { + return + } + + msg := fmt.Sprintf("groups=%d runners=%d/%d", groups, totalActual, totalDesired) + ping := float64(checkDuration.Milliseconds()) + + pushErr := u.push(ctx, u.cfg.DaemonToken, "up", msg, ping) + if pushErr != nil { + u.logger.Warn("uptime-kuma daemon push failed", "error", pushErr) + } +} + +func (u *UptimeKuma) ReportGroupHealth(ctx context.Context, group string, actual, desired int) { + token, ok := u.cfg.GroupTokens[group] + if !ok || token == "" { + return + } + + status, msg := groupStatus(actual, desired, u.cfg.DegradedThreshold) + ping := -1.0 + if u.cfg.ReportHealthAsPing && desired > 0 { + ping = (float64(actual) / float64(desired)) * 100 + } + + pushErr := u.push(ctx, token, status, msg, ping) + if pushErr != nil { + u.logger.Warn("uptime-kuma group push failed", "group", group, "error", pushErr) + } +} + +func (u *UptimeKuma) push(ctx context.Context, token, status, msg string, ping float64) error { + baseURL := strings.TrimRight(u.cfg.BaseURL, "/") + pushURL := fmt.Sprintf("%s/api/push/%s?status=%s&msg=%s", + baseURL, token, status, url.QueryEscape(truncateMsg(msg, 250))) + + if ping >= 0 { + pushURL += fmt.Sprintf("&ping=%.1f", ping) + } + + req, err := http.NewRequestWithContext(ctx, http.MethodGet, pushURL, http.NoBody) + if err != nil { + return fmt.Errorf("create push request: %w", err) + } + + resp, err := u.client.Do(req) + if err != nil { + return fmt.Errorf("push request: %w", err) + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + return fmt.Errorf("push failed: HTTP %d", resp.StatusCode) + } + return nil +} + +func groupStatus(actual, desired int, threshold float64) (status, msg string) { + if desired == 0 { + return "up", "idle (0 desired)" + } + if actual == 0 { + return "down", fmt.Sprintf("0/%d runners (outage)", desired) + } + + ratio := float64(actual) / float64(desired) + if ratio < threshold { + return "down", fmt.Sprintf("%d/%d runners (critical)", actual, desired) + } + if actual < desired { + return "up", fmt.Sprintf("%d/%d runners (degraded)", actual, desired) + } + return "up", fmt.Sprintf("%d/%d runners", actual, desired) +} + +func truncateMsg(s string, maxLen int) string { + if len(s) <= maxLen { + return s + } + return s[:maxLen] +} diff --git a/internal/notification/discord.go b/internal/notification/discord.go new file mode 100644 index 0000000..38b157e --- /dev/null +++ b/internal/notification/discord.go @@ -0,0 +1,122 @@ +package notification + +import ( + "bytes" + "context" + "encoding/json" + "fmt" + "io" + "net/http" + "strconv" + "sync" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +const discordMinInterval = 400 * time.Millisecond + +type DiscordConfig struct { + WebhookURL string + Username string + AvatarURL string + Mentions DiscordMentions +} + +type DiscordMentions struct { + Error string + Critical string +} + +type DiscordProvider struct { + cfg DiscordConfig + client *http.Client + mu sync.Mutex + lastSend time.Time +} + +func NewDiscord(cfg *DiscordConfig) *DiscordProvider { + return &DiscordProvider{ + cfg: *cfg, + client: &http.Client{}, + } +} + +func (d *DiscordProvider) Name() string { return "discord" } + +func (d *DiscordProvider) Send(ctx context.Context, event *model.Event) error { + d.throttle() + + payload := d.buildPayload(event) + + body, err := json.Marshal(payload) + if err != nil { + return fmt.Errorf("marshal discord payload: %w", err) + } + + resp, err := d.doPost(ctx, body) + if err != nil { + return err + } + defer resp.Body.Close() + + if resp.StatusCode == http.StatusTooManyRequests { + retryAfter := parseRetryAfter(resp.Header.Get("Retry-After")) + _, _ = io.Copy(io.Discard, resp.Body) + resp.Body.Close() + + select { + case <-ctx.Done(): + return fmt.Errorf("discord rate limited, context canceled: %w", ctx.Err()) + case <-time.After(retryAfter): + } + + resp, err = d.doPost(ctx, body) + if err != nil { + return err + } + defer resp.Body.Close() + } + + if resp.StatusCode < 200 || resp.StatusCode >= 300 { + return fmt.Errorf("discord webhook returned status %d", resp.StatusCode) + } + + return nil +} + +func (d *DiscordProvider) throttle() { + d.mu.Lock() + defer d.mu.Unlock() + + elapsed := time.Since(d.lastSend) + if elapsed < discordMinInterval { + time.Sleep(discordMinInterval - elapsed) + } + d.lastSend = time.Now() +} + +func (d *DiscordProvider) doPost(ctx context.Context, body []byte) (*http.Response, error) { + req, err := http.NewRequestWithContext(ctx, http.MethodPost, d.cfg.WebhookURL, bytes.NewReader(body)) + if err != nil { + return nil, fmt.Errorf("create discord request: %w", err) + } + req.Header.Set("Content-Type", "application/json") + + resp, err := d.client.Do(req) + if err != nil { + return nil, fmt.Errorf("send discord webhook: %w", err) + } + return resp, nil +} + +func parseRetryAfter(value string) time.Duration { + if value == "" { + return time.Second + } + seconds, err := strconv.ParseFloat(value, 64) + if err != nil { + return time.Second + } + return time.Duration(seconds * float64(time.Second)) +} diff --git a/internal/notification/discord_payload.go b/internal/notification/discord_payload.go new file mode 100644 index 0000000..496f5d1 --- /dev/null +++ b/internal/notification/discord_payload.go @@ -0,0 +1,107 @@ +package notification + +import ( + "sort" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +type discordPayload struct { + Username string `json:"username,omitempty"` + AvatarURL string `json:"avatar_url,omitempty"` + Content string `json:"content,omitempty"` + Embeds []discordEmbed `json:"embeds"` +} + +type discordEmbed struct { + Title string `json:"title"` + Description string `json:"description"` + Color int `json:"color"` + Fields []discordField `json:"fields,omitempty"` + Footer *discordFooter `json:"footer,omitempty"` + Timestamp string `json:"timestamp,omitempty"` +} + +type discordField struct { + Name string `json:"name"` + Value string `json:"value"` + Inline bool `json:"inline"` +} + +type discordFooter struct { + Text string `json:"text"` +} + +func (d *DiscordProvider) buildPayload(event *model.Event) discordPayload { + fields := d.buildFields(event) + embed := discordEmbed{ + Title: event.Type, + Description: event.Message, + Color: colorForLevel(event.Level), + Fields: fields, + Footer: &discordFooter{Text: "ghr"}, + Timestamp: event.Timestamp.UTC().Format("2006-01-02T15:04:05Z"), + } + + payload := discordPayload{ + Username: d.cfg.Username, + AvatarURL: d.cfg.AvatarURL, + Embeds: []discordEmbed{embed}, + } + + mention := d.mentionForLevel(event.Level) + if mention != "" { + payload.Content = mention + } + + return payload +} + +func (d *DiscordProvider) buildFields(event *model.Event) []discordField { + var fields []discordField + + if event.Group != "" { + fields = append(fields, discordField{Name: "Group", Value: event.Group, Inline: true}) + } + if event.Runner != "" { + fields = append(fields, discordField{Name: "Runner", Value: event.Runner, Inline: true}) + } + + keys := make([]string, 0, len(event.Details)) + for k := range event.Details { + keys = append(keys, k) + } + sort.Strings(keys) + + for _, k := range keys { + fields = append(fields, discordField{Name: k, Value: event.Details[k], Inline: false}) + } + + return fields +} + +func (d *DiscordProvider) mentionForLevel(level model.EventLevel) string { + switch level { + case model.LevelError: + return d.cfg.Mentions.Error + case model.LevelCritical: + return d.cfg.Mentions.Critical + default: + return "" + } +} + +func colorForLevel(level model.EventLevel) int { + switch level { + case model.LevelInfo: + return 0x3498DB + case model.LevelWarning: + return 0xF39C12 + case model.LevelError: + return 0xE74C3C + case model.LevelCritical: + return 0x992D22 + default: + return 0x3498DB + } +} diff --git a/internal/notification/discord_test.go b/internal/notification/discord_test.go new file mode 100644 index 0000000..4572e6b --- /dev/null +++ b/internal/notification/discord_test.go @@ -0,0 +1,245 @@ +package notification + +import ( + "context" + "encoding/json" + "net/http" + "net/http/httptest" + "testing" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +func TestDiscordProvider_Name(t *testing.T) { + d := NewDiscord(&DiscordConfig{}) + if d.Name() != "discord" { + t.Errorf("Name() = %q, want %q", d.Name(), "discord") + } +} + +func TestDiscordProvider_Send(t *testing.T) { + baseEvent := model.Event{ + Type: "health.zombie_runner", + Level: model.LevelError, + Group: "backend", + Runner: "runner-abc", + Message: "Zombie runner detected", + Details: map[string]string{"pid": "12345", "action": "killed"}, + Timestamp: time.Date(2025, 1, 15, 14, 30, 0, 0, time.UTC), + } + + t.Run("sends valid payload", func(t *testing.T) { + var received discordPayload + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if r.Method != http.MethodPost { + t.Errorf("method = %s, want POST", r.Method) + } + if ct := r.Header.Get("Content-Type"); ct != "application/json" { + t.Errorf("Content-Type = %q, want application/json", ct) + } + if err := json.NewDecoder(r.Body).Decode(&received); err != nil { + t.Fatalf("decode body: %v", err) + } + w.WriteHeader(http.StatusNoContent) + })) + defer srv.Close() + + d := NewDiscord(&DiscordConfig{ + WebhookURL: srv.URL, + Username: "ghr-test", + Mentions: DiscordMentions{Error: "<@&123>"}, + }) + + err := d.Send(context.Background(), &baseEvent) + if err != nil { + t.Fatalf("Send() error = %v", err) + } + + if received.Username != "ghr-test" { + t.Errorf("username = %q, want %q", received.Username, "ghr-test") + } + if received.Content != "<@&123>" { + t.Errorf("content = %q, want %q", received.Content, "<@&123>") + } + if len(received.Embeds) != 1 { + t.Fatalf("len(embeds) = %d, want 1", len(received.Embeds)) + } + embed := received.Embeds[0] + if embed.Title != "health.zombie_runner" { + t.Errorf("title = %q, want %q", embed.Title, "health.zombie_runner") + } + if embed.Description != "Zombie runner detected" { + t.Errorf("description = %q, want %q", embed.Description, "Zombie runner detected") + } + if embed.Color != 0xE74C3C { + t.Errorf("color = %d, want %d", embed.Color, 0xE74C3C) + } + if embed.Timestamp != "2025-01-15T14:30:00Z" { + t.Errorf("timestamp = %q, want %q", embed.Timestamp, "2025-01-15T14:30:00Z") + } + }) + + t.Run("includes group and runner fields", func(t *testing.T) { + var received discordPayload + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if err := json.NewDecoder(r.Body).Decode(&received); err != nil { + t.Fatalf("decode: %v", err) + } + w.WriteHeader(http.StatusNoContent) + })) + defer srv.Close() + + d := NewDiscord(&DiscordConfig{WebhookURL: srv.URL}) + if err := d.Send(context.Background(), &baseEvent); err != nil { + t.Fatalf("Send() error = %v", err) + } + + fields := received.Embeds[0].Fields + if len(fields) < 2 { + t.Fatalf("got %d fields, want at least 2", len(fields)) + } + if fields[0].Name != "Group" || fields[0].Value != "backend" { + t.Errorf("field[0] = %v, want Group=backend", fields[0]) + } + if fields[1].Name != "Runner" || fields[1].Value != "runner-abc" { + t.Errorf("field[1] = %v, want Runner=runner-abc", fields[1]) + } + }) + + t.Run("no mention for info level", func(t *testing.T) { + var received discordPayload + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if err := json.NewDecoder(r.Body).Decode(&received); err != nil { + t.Fatalf("decode: %v", err) + } + w.WriteHeader(http.StatusNoContent) + })) + defer srv.Close() + + d := NewDiscord(&DiscordConfig{ + WebhookURL: srv.URL, + Mentions: DiscordMentions{Error: "<@&123>", Critical: "@everyone"}, + }) + + infoEvent := baseEvent + infoEvent.Level = model.LevelInfo + + if err := d.Send(context.Background(), &infoEvent); err != nil { + t.Fatalf("Send() error = %v", err) + } + + if received.Content != "" { + t.Errorf("content = %q, want empty for info level", received.Content) + } + }) + + t.Run("critical mention", func(t *testing.T) { + var received discordPayload + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if err := json.NewDecoder(r.Body).Decode(&received); err != nil { + t.Fatalf("decode: %v", err) + } + w.WriteHeader(http.StatusNoContent) + })) + defer srv.Close() + + d := NewDiscord(&DiscordConfig{ + WebhookURL: srv.URL, + Mentions: DiscordMentions{Critical: "@everyone"}, + }) + + critEvent := baseEvent + critEvent.Level = model.LevelCritical + + if err := d.Send(context.Background(), &critEvent); err != nil { + t.Fatalf("Send() error = %v", err) + } + + if received.Content != "@everyone" { + t.Errorf("content = %q, want @everyone", received.Content) + } + }) + + t.Run("rate limit returns error", func(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusTooManyRequests) + })) + defer srv.Close() + + d := NewDiscord(&DiscordConfig{WebhookURL: srv.URL}) + err := d.Send(context.Background(), &baseEvent) + if err == nil { + t.Fatal("expected error for 429") + } + }) + + t.Run("non-2xx returns error", func(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusInternalServerError) + })) + defer srv.Close() + + d := NewDiscord(&DiscordConfig{WebhookURL: srv.URL}) + err := d.Send(context.Background(), &baseEvent) + if err == nil { + t.Fatal("expected error for 500") + } + }) + + t.Run("empty group and runner omits those fields", func(t *testing.T) { + var received discordPayload + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if err := json.NewDecoder(r.Body).Decode(&received); err != nil { + t.Fatalf("decode: %v", err) + } + w.WriteHeader(http.StatusNoContent) + })) + defer srv.Close() + + d := NewDiscord(&DiscordConfig{WebhookURL: srv.URL}) + evt := model.Event{ + Type: "daemon.start", + Level: model.LevelInfo, + Message: "started", + Timestamp: time.Now(), + } + + if err := d.Send(context.Background(), &evt); err != nil { + t.Fatalf("Send() error = %v", err) + } + + for _, f := range received.Embeds[0].Fields { + if f.Name == "Group" || f.Name == "Runner" { + t.Errorf("unexpected field %q for empty group/runner", f.Name) + } + } + }) +} + +func TestColorForLevel(t *testing.T) { + tests := []struct { + level model.EventLevel + want int + }{ + {model.LevelInfo, 0x3498DB}, + {model.LevelWarning, 0xF39C12}, + {model.LevelError, 0xE74C3C}, + {model.LevelCritical, 0x992D22}, + {model.EventLevel("unknown"), 0x3498DB}, + } + + for _, tt := range tests { + t.Run(string(tt.level), func(t *testing.T) { + got := colorForLevel(tt.level) + if got != tt.want { + t.Errorf("colorForLevel(%q) = %d, want %d", tt.level, got, tt.want) + } + }) + } +} diff --git a/internal/notification/filter.go b/internal/notification/filter.go new file mode 100644 index 0000000..642781f --- /dev/null +++ b/internal/notification/filter.go @@ -0,0 +1,33 @@ +package notification + +import "strings" + +type EventFilter struct { + Patterns []string +} + +func (f EventFilter) Matches(eventType, level string) bool { + if len(f.Patterns) == 0 { + return true + } + + for _, p := range f.Patterns { + if matchesPattern(p, eventType, level) { + return true + } + } + return false +} + +func matchesPattern(pattern, eventType, level string) bool { + if strings.HasPrefix(pattern, "*:") { + return strings.EqualFold(pattern[2:], level) + } + + if strings.HasSuffix(pattern, ".*") { + prefix := pattern[:len(pattern)-2] + return strings.HasPrefix(eventType, prefix+".") + } + + return pattern == eventType +} diff --git a/internal/notification/filter_test.go b/internal/notification/filter_test.go new file mode 100644 index 0000000..489e89f --- /dev/null +++ b/internal/notification/filter_test.go @@ -0,0 +1,115 @@ +package notification + +import "testing" + +func TestEventFilter_Matches(t *testing.T) { + tests := []struct { + name string + patterns []string + eventType string + level string + want bool + }{ + { + name: "empty patterns matches everything", + patterns: nil, + eventType: "daemon.start", + level: "info", + want: true, + }, + { + name: "exact match", + patterns: []string{"daemon.start"}, + eventType: "daemon.start", + level: "info", + want: true, + }, + { + name: "exact match no match", + patterns: []string{"daemon.stop"}, + eventType: "daemon.start", + level: "info", + want: false, + }, + { + name: "wildcard matches prefix", + patterns: []string{"health.*"}, + eventType: "health.zombie_runner", + level: "error", + want: true, + }, + { + name: "wildcard does not match different prefix", + patterns: []string{"health.*"}, + eventType: "daemon.start", + level: "info", + want: false, + }, + { + name: "wildcard does not match partial prefix", + patterns: []string{"health.*"}, + eventType: "healthcheck.run", + level: "info", + want: false, + }, + { + name: "level filter matches", + patterns: []string{"*:error"}, + eventType: "health.zombie_runner", + level: "error", + want: true, + }, + { + name: "level filter does not match different level", + patterns: []string{"*:error"}, + eventType: "daemon.start", + level: "info", + want: false, + }, + { + name: "level filter case insensitive", + patterns: []string{"*:Error"}, + eventType: "anything", + level: "error", + want: true, + }, + { + name: "multiple patterns any match succeeds", + patterns: []string{"daemon.start", "health.*", "*:critical"}, + eventType: "health.disk_low", + level: "warning", + want: true, + }, + { + name: "multiple patterns none match", + patterns: []string{"daemon.start", "runner.failed"}, + eventType: "health.zombie_runner", + level: "error", + want: false, + }, + { + name: "empty patterns list explicit", + patterns: []string{}, + eventType: "daemon.start", + level: "info", + want: true, + }, + { + name: "wildcard matches exact prefix dot event", + patterns: []string{"runner.*"}, + eventType: "runner.started", + level: "info", + want: true, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + f := EventFilter{Patterns: tt.patterns} + got := f.Matches(tt.eventType, tt.level) + if got != tt.want { + t.Errorf("Matches(%q, %q) = %v, want %v", tt.eventType, tt.level, got, tt.want) + } + }) + } +} diff --git a/internal/notification/service.go b/internal/notification/service.go new file mode 100644 index 0000000..40e4f33 --- /dev/null +++ b/internal/notification/service.go @@ -0,0 +1,54 @@ +package notification + +import ( + "context" + "log/slog" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +type Provider interface { + Name() string + Send(ctx context.Context, event *model.Event) error +} + +type Service struct { + logger *slog.Logger + providers []providerEntry +} + +type providerEntry struct { + provider Provider + filter EventFilter +} + +func New(providers []Provider, filters map[string]EventFilter, logger *slog.Logger) *Service { + entries := make([]providerEntry, 0, len(providers)) + for _, p := range providers { + f := filters[p.Name()] + entries = append(entries, providerEntry{ + provider: p, + filter: f, + }) + } + return &Service{ + logger: logger, + providers: entries, + } +} + +func (s *Service) Notify(ctx context.Context, event *model.Event) { + for _, entry := range s.providers { + if !entry.filter.Matches(event.Type, string(event.Level)) { + continue + } + + if err := entry.provider.Send(ctx, event); err != nil { + s.logger.Warn("notification send failed", + "provider", entry.provider.Name(), + "event", event.Type, + "error", err, + ) + } + } +} diff --git a/internal/notification/service_test.go b/internal/notification/service_test.go new file mode 100644 index 0000000..1c86a84 --- /dev/null +++ b/internal/notification/service_test.go @@ -0,0 +1,142 @@ +package notification + +import ( + "context" + "errors" + "log/slog" + "sync" + "testing" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +type fakeProvider struct { + name string + mu sync.Mutex + events []model.Event + err error +} + +func (f *fakeProvider) Name() string { return f.name } + +func (f *fakeProvider) Send(_ context.Context, event *model.Event) error { + f.mu.Lock() + defer f.mu.Unlock() + f.events = append(f.events, *event) + return f.err +} + +func (f *fakeProvider) received() []model.Event { + f.mu.Lock() + defer f.mu.Unlock() + cp := make([]model.Event, len(f.events)) + copy(cp, f.events) + return cp +} + +func TestService_Notify(t *testing.T) { + event := model.Event{ + Type: "daemon.start", + Level: model.LevelInfo, + Message: "started", + Timestamp: time.Now(), + } + + t.Run("sends to all matching providers", func(t *testing.T) { + p1 := &fakeProvider{name: "p1"} + p2 := &fakeProvider{name: "p2"} + + svc := New( + []Provider{p1, p2}, + map[string]EventFilter{}, + slog.Default(), + ) + + svc.Notify(context.Background(), &event) + + if len(p1.received()) != 1 { + t.Errorf("p1 got %d events, want 1", len(p1.received())) + } + if len(p2.received()) != 1 { + t.Errorf("p2 got %d events, want 1", len(p2.received())) + } + }) + + t.Run("filters events per provider", func(t *testing.T) { + p1 := &fakeProvider{name: "p1"} + p2 := &fakeProvider{name: "p2"} + + svc := New( + []Provider{p1, p2}, + map[string]EventFilter{ + "p1": {Patterns: []string{"daemon.*"}}, + "p2": {Patterns: []string{"health.*"}}, + }, + slog.Default(), + ) + + svc.Notify(context.Background(), &event) + + if len(p1.received()) != 1 { + t.Errorf("p1 got %d events, want 1", len(p1.received())) + } + if len(p2.received()) != 0 { + t.Errorf("p2 got %d events, want 0", len(p2.received())) + } + }) + + t.Run("no filter means all events", func(t *testing.T) { + p := &fakeProvider{name: "p1"} + + svc := New( + []Provider{p}, + map[string]EventFilter{}, + slog.Default(), + ) + + svc.Notify(context.Background(), &event) + + if len(p.received()) != 1 { + t.Errorf("got %d events, want 1", len(p.received())) + } + }) + + t.Run("provider error is logged not propagated", func(t *testing.T) { + p := &fakeProvider{name: "failing", err: errors.New("connection refused")} + + svc := New( + []Provider{p}, + map[string]EventFilter{}, + slog.Default(), + ) + + svc.Notify(context.Background(), &event) + + if len(p.received()) != 1 { + t.Errorf("got %d events, want 1", len(p.received())) + } + }) + + t.Run("continues to next provider after error", func(t *testing.T) { + p1 := &fakeProvider{name: "fail", err: errors.New("boom")} + p2 := &fakeProvider{name: "ok"} + + svc := New( + []Provider{p1, p2}, + map[string]EventFilter{}, + slog.Default(), + ) + + svc.Notify(context.Background(), &event) + + if len(p2.received()) != 1 { + t.Errorf("p2 got %d events, want 1", len(p2.received())) + } + }) + + t.Run("no providers does not panic", func(t *testing.T) { + svc := New(nil, nil, slog.Default()) + svc.Notify(context.Background(), &event) + }) +} diff --git a/internal/notification/webhook.go b/internal/notification/webhook.go new file mode 100644 index 0000000..bf7a735 --- /dev/null +++ b/internal/notification/webhook.go @@ -0,0 +1,68 @@ +package notification + +import ( + "bytes" + "context" + "encoding/json" + "fmt" + "net/http" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +type WebhookConfig struct { + URL string + Method string + Headers map[string]string +} + +type WebhookProvider struct { + cfg WebhookConfig + client *http.Client +} + +func NewWebhook(cfg WebhookConfig) *WebhookProvider { + method := cfg.Method + if method == "" { + method = http.MethodPost + } + return &WebhookProvider{ + cfg: WebhookConfig{ + URL: cfg.URL, + Method: method, + Headers: cfg.Headers, + }, + client: &http.Client{}, + } +} + +func (w *WebhookProvider) Name() string { return "webhook" } + +func (w *WebhookProvider) Send(ctx context.Context, event *model.Event) error { + body, err := json.Marshal(event) + if err != nil { + return fmt.Errorf("marshal webhook payload: %w", err) + } + + req, err := http.NewRequestWithContext(ctx, w.cfg.Method, w.cfg.URL, bytes.NewReader(body)) + if err != nil { + return fmt.Errorf("create webhook request: %w", err) + } + + for k, v := range w.cfg.Headers { + req.Header.Set(k, v) + } + req.Header.Set("Content-Type", "application/json") + + resp, err := w.client.Do(req) + if err != nil { + return fmt.Errorf("send webhook: %w", err) + } + defer resp.Body.Close() + + if resp.StatusCode < 200 || resp.StatusCode >= 300 { + return fmt.Errorf("webhook returned status %d", resp.StatusCode) + } + + return nil +} diff --git a/internal/notification/webhook_test.go b/internal/notification/webhook_test.go new file mode 100644 index 0000000..491f8e3 --- /dev/null +++ b/internal/notification/webhook_test.go @@ -0,0 +1,144 @@ +package notification + +import ( + "context" + "encoding/json" + "net/http" + "net/http/httptest" + "testing" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +func TestWebhookProvider_Name(t *testing.T) { + w := NewWebhook(WebhookConfig{}) + if w.Name() != "webhook" { + t.Errorf("Name() = %q, want %q", w.Name(), "webhook") + } +} + +func TestWebhookProvider_Send(t *testing.T) { + baseEvent := model.Event{ + Type: "runner.started", + Level: model.LevelInfo, + Group: "ci", + Runner: "runner-x1", + Message: "Runner started", + Details: map[string]string{"version": "2.320"}, + Timestamp: time.Date(2025, 3, 10, 8, 0, 0, 0, time.UTC), + } + + t.Run("sends JSON payload with POST", func(t *testing.T) { + var received model.Event + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if r.Method != http.MethodPost { + t.Errorf("method = %s, want POST", r.Method) + } + if ct := r.Header.Get("Content-Type"); ct != "application/json" { + t.Errorf("Content-Type = %q, want application/json", ct) + } + if err := json.NewDecoder(r.Body).Decode(&received); err != nil { + t.Fatalf("decode body: %v", err) + } + w.WriteHeader(http.StatusOK) + })) + defer srv.Close() + + wp := NewWebhook(WebhookConfig{URL: srv.URL}) + err := wp.Send(context.Background(), &baseEvent) + if err != nil { + t.Fatalf("Send() error = %v", err) + } + + if received.Type != "runner.started" { + t.Errorf("type = %q, want %q", received.Type, "runner.started") + } + if received.Message != "Runner started" { + t.Errorf("message = %q, want %q", received.Message, "Runner started") + } + }) + + t.Run("uses configured method", func(t *testing.T) { + var gotMethod string + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + gotMethod = r.Method + w.WriteHeader(http.StatusOK) + })) + defer srv.Close() + + wp := NewWebhook(WebhookConfig{URL: srv.URL, Method: http.MethodPut}) + if err := wp.Send(context.Background(), &baseEvent); err != nil { + t.Fatalf("Send() error = %v", err) + } + + if gotMethod != http.MethodPut { + t.Errorf("method = %s, want PUT", gotMethod) + } + }) + + t.Run("sets configured headers", func(t *testing.T) { + var gotAuth string + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + gotAuth = r.Header.Get("Authorization") + w.WriteHeader(http.StatusOK) + })) + defer srv.Close() + + wp := NewWebhook(WebhookConfig{ + URL: srv.URL, + Headers: map[string]string{"Authorization": "Bearer tok123"}, + }) + + if err := wp.Send(context.Background(), &baseEvent); err != nil { + t.Fatalf("Send() error = %v", err) + } + + if gotAuth != "Bearer tok123" { + t.Errorf("Authorization = %q, want %q", gotAuth, "Bearer tok123") + } + }) + + t.Run("non-2xx returns error", func(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusForbidden) + })) + defer srv.Close() + + wp := NewWebhook(WebhookConfig{URL: srv.URL}) + err := wp.Send(context.Background(), &baseEvent) + if err == nil { + t.Fatal("expected error for 403") + } + }) + + t.Run("defaults to POST when method empty", func(t *testing.T) { + var gotMethod string + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + gotMethod = r.Method + w.WriteHeader(http.StatusOK) + })) + defer srv.Close() + + wp := NewWebhook(WebhookConfig{URL: srv.URL, Method: ""}) + if err := wp.Send(context.Background(), &baseEvent); err != nil { + t.Fatalf("Send() error = %v", err) + } + + if gotMethod != http.MethodPost { + t.Errorf("method = %s, want POST", gotMethod) + } + }) + + t.Run("connection error returns wrapped error", func(t *testing.T) { + wp := NewWebhook(WebhookConfig{URL: "http://127.0.0.1:1"}) + err := wp.Send(context.Background(), &baseEvent) + if err == nil { + t.Fatal("expected error for unreachable host") + } + }) +} diff --git a/internal/runner/binary.go b/internal/runner/binary.go new file mode 100644 index 0000000..983b304 --- /dev/null +++ b/internal/runner/binary.go @@ -0,0 +1,102 @@ +package runner + +import ( + "context" + "encoding/json" + "fmt" + "log/slog" + "net/http" + "os" + "path/filepath" + "runtime" + "strings" +) + +type BinaryManager struct { + cacheDir string + logger *slog.Logger + httpClient *http.Client +} + +func NewBinaryManager(cacheDir string, logger *slog.Logger) *BinaryManager { + return &BinaryManager{ + cacheDir: cacheDir, + logger: logger, + httpClient: &http.Client{}, + } +} + +func (m *BinaryManager) EnsureBits(ctx context.Context, version string) (string, error) { + resolved := version + if resolved == "latest" { + v, err := m.resolveLatestVersion(ctx) + if err != nil { + return "", fmt.Errorf("resolve latest runner version: %w", err) + } + resolved = v + m.logger.InfoContext(ctx, "resolved latest runner version", "version", resolved) + } + + destDir := filepath.Join(m.cacheDir, resolved) + runShPath := filepath.Join(destDir, "run.sh") + + if _, err := os.Stat(runShPath); err == nil { + m.logger.DebugContext(ctx, "runner binary cached", "version", resolved, "path", destDir) + return destDir, nil + } + + m.logger.InfoContext(ctx, "downloading runner binary", "version", resolved) + + if err := os.MkdirAll(destDir, 0o755); err != nil { + return "", fmt.Errorf("create cache dir %s: %w", destDir, err) + } + + if err := downloadAndExtract(ctx, m.httpClient, resolved, destDir); err != nil { + rmErr := os.RemoveAll(destDir) + if rmErr != nil { + m.logger.WarnContext(ctx, "failed to clean partial download", "path", destDir, "error", rmErr) + } + return "", fmt.Errorf("download runner %s: %w", resolved, err) + } + + m.logger.InfoContext(ctx, "runner binary ready", "version", resolved, "path", destDir) + return destDir, nil +} + +func (m *BinaryManager) resolveLatestVersion(ctx context.Context) (string, error) { + req, err := http.NewRequestWithContext(ctx, http.MethodGet, "https://api.github.com/repos/actions/runner/releases/latest", http.NoBody) + if err != nil { + return "", fmt.Errorf("create request: %w", err) + } + req.Header.Set("Accept", "application/vnd.github+json") + + resp, err := m.httpClient.Do(req) + if err != nil { + return "", fmt.Errorf("fetch latest release: %w", err) + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + return "", fmt.Errorf("github releases API returned %d", resp.StatusCode) + } + + var release struct { + TagName string `json:"tag_name"` + } + if err := json.NewDecoder(resp.Body).Decode(&release); err != nil { + return "", fmt.Errorf("decode release response: %w", err) + } + + if release.TagName == "" { + return "", fmt.Errorf("empty tag_name in release response") + } + + return strings.TrimPrefix(release.TagName, "v"), nil +} + +func runnerArch() string { + if runtime.GOARCH == "arm64" { + return "arm64" + } + return "x64" +} diff --git a/internal/runner/binary_test.go b/internal/runner/binary_test.go new file mode 100644 index 0000000..c651851 --- /dev/null +++ b/internal/runner/binary_test.go @@ -0,0 +1,272 @@ +package runner + +import ( + "archive/tar" + "compress/gzip" + "context" + "encoding/json" + "log/slog" + "net/http" + "net/http/httptest" + "os" + "path/filepath" + "runtime" + "testing" +) + +func silentLogger() *slog.Logger { + return slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelError + 1})) +} + +func createFakeTarGz(t *testing.T) string { + t.Helper() + + tmpFile := filepath.Join(t.TempDir(), "runner.tar.gz") + f, err := os.Create(tmpFile) + if err != nil { + t.Fatalf("create tar.gz file: %v", err) + } + defer f.Close() + + gw := gzip.NewWriter(f) + tw := tar.NewWriter(gw) + + content := []byte("#!/bin/bash\necho hello\n") + if err := tw.WriteHeader(&tar.Header{ + Name: "run.sh", + Mode: 0o755, + Size: int64(len(content)), + Typeflag: tar.TypeReg, + }); err != nil { + t.Fatalf("write tar header: %v", err) + } + if _, err := tw.Write(content); err != nil { + t.Fatalf("write tar content: %v", err) + } + + subContent := []byte("config data\n") + if err := tw.WriteHeader(&tar.Header{ + Name: "config.sh", + Mode: 0o755, + Size: int64(len(subContent)), + Typeflag: tar.TypeReg, + }); err != nil { + t.Fatalf("write tar header: %v", err) + } + if _, err := tw.Write(subContent); err != nil { + t.Fatalf("write tar content: %v", err) + } + + if err := tw.Close(); err != nil { + t.Fatalf("close tar writer: %v", err) + } + if err := gw.Close(); err != nil { + t.Fatalf("close gzip writer: %v", err) + } + + return tmpFile +} + +func TestEnsureBits_Cached(t *testing.T) { + cacheDir := t.TempDir() + version := "2.320.0" + + versionDir := filepath.Join(cacheDir, version) + if err := os.MkdirAll(versionDir, 0o755); err != nil { + t.Fatalf("create version dir: %v", err) + } + runSh := filepath.Join(versionDir, "run.sh") + if err := os.WriteFile(runSh, []byte("#!/bin/bash\n"), 0o755); err != nil { + t.Fatalf("write run.sh: %v", err) + } + + bm := NewBinaryManager(cacheDir, silentLogger()) + + got, err := bm.EnsureBits(context.Background(), version) + if err != nil { + t.Fatalf("EnsureBits: %v", err) + } + + if got != versionDir { + t.Fatalf("expected path %q, got %q", versionDir, got) + } +} + +func TestEnsureBits_Download(t *testing.T) { + tarGzPath := createFakeTarGz(t) + tarGzData, err := os.ReadFile(tarGzPath) + if err != nil { + t.Fatalf("read tar.gz: %v", err) + } + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.Header().Set("Content-Type", "application/gzip") + if _, writeErr := w.Write(tarGzData); writeErr != nil { + t.Errorf("write response: %v", writeErr) + } + })) + defer srv.Close() + + cacheDir := t.TempDir() + version := "2.320.0" + versionDir := filepath.Join(cacheDir, version) + + if err := os.MkdirAll(versionDir, 0o755); err != nil { + t.Fatalf("create version dir: %v", err) + } + + req, err := http.NewRequestWithContext(context.Background(), http.MethodGet, srv.URL+"/runner.tar.gz", nil) + if err != nil { + t.Fatalf("create request: %v", err) + } + resp, err := srv.Client().Do(req) + if err != nil { + t.Fatalf("do request: %v", err) + } + defer resp.Body.Close() + + if err := extractTarGz(resp.Body, versionDir); err != nil { + t.Fatalf("extract tar.gz: %v", err) + } + + runSh := filepath.Join(versionDir, "run.sh") + if _, statErr := os.Stat(runSh); statErr != nil { + t.Fatalf("run.sh not found after extraction: %v", statErr) + } + + configSh := filepath.Join(versionDir, "config.sh") + if _, statErr := os.Stat(configSh); statErr != nil { + t.Fatalf("config.sh not found after extraction: %v", statErr) + } + + bm := NewBinaryManager(cacheDir, silentLogger()) + got, err := bm.EnsureBits(context.Background(), version) + if err != nil { + t.Fatalf("EnsureBits on extracted dir: %v", err) + } + if got != versionDir { + t.Fatalf("expected path %q, got %q", versionDir, got) + } +} + +func TestRunnerArch(t *testing.T) { + got := runnerArch() + switch runtime.GOARCH { + case "arm64": + if got != "arm64" { + t.Fatalf("expected arm64, got %q", got) + } + default: + if got != "x64" { + t.Fatalf("expected x64, got %q", got) + } + } +} + +func TestResolveLatestVersion(t *testing.T) { + tests := []struct { + name string + response any + statusCode int + wantVer string + wantErr bool + }{ + { + name: "valid release with v prefix", + response: map[string]string{"tag_name": "v2.320.0"}, + statusCode: http.StatusOK, + wantVer: "2.320.0", + }, + { + name: "valid release without v prefix", + response: map[string]string{"tag_name": "2.321.0"}, + statusCode: http.StatusOK, + wantVer: "2.321.0", + }, + { + name: "empty tag_name", + response: map[string]string{"tag_name": ""}, + statusCode: http.StatusOK, + wantErr: true, + }, + { + name: "api error", + response: nil, + statusCode: http.StatusInternalServerError, + wantErr: true, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(tt.statusCode) + if tt.response != nil { + data, jsonErr := json.Marshal(tt.response) + if jsonErr != nil { + t.Fatalf("marshal response: %v", jsonErr) + } + if _, writeErr := w.Write(data); writeErr != nil { + t.Errorf("write response: %v", writeErr) + } + } + })) + defer srv.Close() + + bm := &BinaryManager{ + cacheDir: t.TempDir(), + logger: silentLogger(), + httpClient: srv.Client(), + } + + ctx := context.Background() + req, err := http.NewRequestWithContext(ctx, http.MethodGet, srv.URL, nil) + if err != nil { + t.Fatalf("create request: %v", err) + } + req.Header.Set("Accept", "application/vnd.github+json") + + resp, err := bm.httpClient.Do(req) + if err != nil { + t.Fatalf("do request: %v", err) + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + if !tt.wantErr { + t.Fatalf("unexpected status %d", resp.StatusCode) + } + return + } + + var release struct { + TagName string `json:"tag_name"` + } + if decodeErr := json.NewDecoder(resp.Body).Decode(&release); decodeErr != nil { + if !tt.wantErr { + t.Fatalf("decode: %v", decodeErr) + } + return + } + + if release.TagName == "" { + if !tt.wantErr { + t.Fatal("empty tag_name") + } + return + } + + got := release.TagName + if len(got) > 0 && got[0] == 'v' { + got = got[1:] + } + + if tt.wantErr { + t.Fatal("expected error but got none") + } + if got != tt.wantVer { + t.Fatalf("expected version %q, got %q", tt.wantVer, got) + } + }) + } +} diff --git a/internal/runner/cleanup.go b/internal/runner/cleanup.go new file mode 100644 index 0000000..8135a5a --- /dev/null +++ b/internal/runner/cleanup.go @@ -0,0 +1,109 @@ +package runner + +import ( + "context" + "fmt" + "os" + "os/exec" + "path/filepath" + "strconv" + "strings" + "syscall" +) + +func (m *ProcessManager) CleanupStale(ctx context.Context) error { + entries, err := os.ReadDir(m.workdirBase) + if err != nil { + if os.IsNotExist(err) { + return nil + } + return fmt.Errorf("read workdir base %s: %w", m.workdirBase, err) + } + + for _, groupEntry := range entries { + if !groupEntry.IsDir() { + continue + } + if err := m.cleanupStaleGroup(ctx, groupEntry.Name()); err != nil { + m.logger.WarnContext(ctx, "failed to cleanup stale group", "group", groupEntry.Name(), "error", err) + } + } + + return nil +} + +func (m *ProcessManager) cleanupStaleGroup(ctx context.Context, group string) error { + groupDir := filepath.Join(m.workdirBase, group) + entries, err := os.ReadDir(groupDir) + if err != nil { + return fmt.Errorf("read group dir %s: %w", groupDir, err) + } + + for _, runnerEntry := range entries { + if !runnerEntry.IsDir() { + continue + } + m.cleanupStaleRunner(ctx, group, runnerEntry.Name()) + } + + return nil +} + +func (m *ProcessManager) cleanupStaleRunner(ctx context.Context, group, runner string) { + runnerDir := filepath.Join(m.workdirBase, group, runner) + pidFile := filepath.Join(runnerDir, ".ghr-pid") + + pidBytes, err := os.ReadFile(pidFile) + if err != nil { + m.logger.DebugContext(ctx, "no PID file found, removing stale workdir", "dir", runnerDir) + removeErr := os.RemoveAll(runnerDir) + if removeErr != nil { + m.logger.WarnContext(ctx, "failed to remove stale workdir", "dir", runnerDir, "error", removeErr) + } + return + } + + pid, err := strconv.Atoi(strings.TrimSpace(string(pidBytes))) + if err != nil { + m.logger.WarnContext(ctx, "invalid PID file content, removing workdir", "dir", runnerDir, "error", err) + removeErr := os.RemoveAll(runnerDir) + if removeErr != nil { + m.logger.WarnContext(ctx, "failed to remove stale workdir", "dir", runnerDir, "error", removeErr) + } + return + } + + if isProcessAlive(pid) { + m.logger.WarnContext(ctx, "killing stale runner process", "pid", pid, "runner", runner, "group", group) + killErr := syscall.Kill(pid, syscall.SIGKILL) + if killErr != nil { + m.logger.WarnContext(ctx, "failed to kill stale process", "pid", pid, "error", killErr) + } + } + + removeErr := os.RemoveAll(runnerDir) + if removeErr != nil { + m.logger.WarnContext(ctx, "failed to remove stale workdir", "dir", runnerDir, "error", removeErr) + } else { + m.logger.InfoContext(ctx, "cleaned up stale runner", "runner", runner, "group", group, "pid", pid) + } +} + +func (m *ProcessManager) KillOrphanRunners(ctx context.Context) { + out, err := exec.CommandContext(ctx, "pgrep", "-f", m.workdirBase).Output() + if err != nil { + return + } + for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") { + pid, err := strconv.Atoi(strings.TrimSpace(line)) + if err != nil || pid <= 0 { + continue + } + m.logger.WarnContext(ctx, "killing orphan runner process", "pid", pid) + _ = syscall.Kill(pid, syscall.SIGKILL) + } +} + +func isProcessAlive(pid int) bool { + return syscall.Kill(pid, 0) == nil +} diff --git a/internal/runner/cleanup_test.go b/internal/runner/cleanup_test.go new file mode 100644 index 0000000..b1cf252 --- /dev/null +++ b/internal/runner/cleanup_test.go @@ -0,0 +1,67 @@ +package runner + +import ( + "context" + "os" + "path/filepath" + "testing" +) + +func TestCleanupStale_DeadProcess(t *testing.T) { + workdirBase := t.TempDir() + + groupDir := filepath.Join(workdirBase, "group-a") + runnerDir := filepath.Join(groupDir, "runner-1") + if err := os.MkdirAll(runnerDir, 0o755); err != nil { + t.Fatalf("create runner dir: %v", err) + } + + pidFile := filepath.Join(runnerDir, ".ghr-pid") + if err := os.WriteFile(pidFile, []byte("9999999"), 0o644); err != nil { + t.Fatalf("write PID file: %v", err) + } + + pm := NewProcessManager(workdirBase, silentLogger()) + if err := pm.CleanupStale(context.Background()); err != nil { + t.Fatalf("CleanupStale: %v", err) + } + + if _, err := os.Stat(runnerDir); !os.IsNotExist(err) { + t.Fatalf("expected runner dir to be removed, stat returned: %v", err) + } +} + +func TestCleanupStale_EmptyDir(t *testing.T) { + workdirBase := t.TempDir() + + pm := NewProcessManager(workdirBase, silentLogger()) + if err := pm.CleanupStale(context.Background()); err != nil { + t.Fatalf("CleanupStale on empty dir: %v", err) + } +} + +func TestCleanupStale_NonexistentDir(t *testing.T) { + pm := NewProcessManager("/nonexistent/path/that/does/not/exist", silentLogger()) + if err := pm.CleanupStale(context.Background()); err != nil { + t.Fatalf("CleanupStale on nonexistent dir: %v", err) + } +} + +func TestCleanupStale_NoPidFile(t *testing.T) { + workdirBase := t.TempDir() + + groupDir := filepath.Join(workdirBase, "group-b") + runnerDir := filepath.Join(groupDir, "runner-orphan") + if err := os.MkdirAll(runnerDir, 0o755); err != nil { + t.Fatalf("create runner dir: %v", err) + } + + pm := NewProcessManager(workdirBase, silentLogger()) + if err := pm.CleanupStale(context.Background()); err != nil { + t.Fatalf("CleanupStale: %v", err) + } + + if _, err := os.Stat(runnerDir); !os.IsNotExist(err) { + t.Fatalf("expected runner dir without PID file to be removed, stat returned: %v", err) + } +} diff --git a/internal/runner/copy.go b/internal/runner/copy.go new file mode 100644 index 0000000..cef7538 --- /dev/null +++ b/internal/runner/copy.go @@ -0,0 +1,57 @@ +package runner + +import ( + "fmt" + "io" + "os" + "path/filepath" +) + +func copyDir(src, dst string) error { + return filepath.Walk(src, func(path string, info os.FileInfo, err error) error { + if err != nil { + return fmt.Errorf("walk source %s: %w", path, err) + } + + relPath, err := filepath.Rel(src, path) + if err != nil { + return fmt.Errorf("compute relative path for %s: %w", path, err) + } + + targetPath := filepath.Join(dst, relPath) + + if info.IsDir() { + return os.MkdirAll(targetPath, info.Mode()) + } + + if info.Mode()&os.ModeSymlink != 0 { + link, err := os.Readlink(path) + if err != nil { + return fmt.Errorf("read symlink %s: %w", path, err) + } + return os.Symlink(link, targetPath) + } + + return copyFile(path, targetPath, info.Mode()) + }) +} + +func copyFile(src, dst string, mode os.FileMode) error { + in, err := os.Open(src) + if err != nil { + return fmt.Errorf("open source %s: %w", src, err) + } + defer in.Close() + + out, err := os.OpenFile(dst, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, mode) + if err != nil { + return fmt.Errorf("create dest %s: %w", dst, err) + } + defer out.Close() + + if _, err := io.Copy(out, in); err != nil { + return fmt.Errorf("copy %s to %s: %w", src, dst, err) + } + + return nil +} diff --git a/internal/runner/download.go b/internal/runner/download.go new file mode 100644 index 0000000..0e66ba9 --- /dev/null +++ b/internal/runner/download.go @@ -0,0 +1,109 @@ +package runner + +import ( + "archive/tar" + "compress/gzip" + "context" + "errors" + "fmt" + "io" + "net/http" + "os" + "path/filepath" + "strings" +) + +const downloadURLTemplate = "https://github.com/actions/runner/releases/download/v%s/actions-runner-osx-%s-%s.tar.gz" + +func downloadAndExtract(ctx context.Context, client *http.Client, version, destDir string) error { + url := fmt.Sprintf(downloadURLTemplate, version, runnerArch(), version) + + req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody) + if err != nil { + return fmt.Errorf("create download request: %w", err) + } + + resp, err := client.Do(req) + if err != nil { + return fmt.Errorf("download tarball: %w", err) + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + return fmt.Errorf("download returned HTTP %d for %s", resp.StatusCode, url) + } + + return extractTarGz(resp.Body, destDir) +} + +func extractTarGz(r io.Reader, destDir string) error { + gz, err := gzip.NewReader(r) + if err != nil { + return fmt.Errorf("open gzip reader: %w", err) + } + defer gz.Close() + + tr := tar.NewReader(gz) + for { + header, err := tr.Next() + if errors.Is(err, io.EOF) { + break + } + if err != nil { + return fmt.Errorf("read tar entry: %w", err) + } + + target, err := sanitizeTarPath(destDir, header.Name) + if err != nil { + return err + } + + switch header.Typeflag { + case tar.TypeDir: + if err := os.MkdirAll(target, os.FileMode(header.Mode)); err != nil { + return fmt.Errorf("create directory %s: %w", target, err) + } + case tar.TypeReg: + if err := extractFile(tr, target, os.FileMode(header.Mode)); err != nil { + return err + } + case tar.TypeSymlink: + linkTarget, linkErr := sanitizeTarPath(destDir, header.Linkname) + if linkErr != nil { + linkTarget = header.Linkname + } + if err := os.Symlink(linkTarget, target); err != nil { + return fmt.Errorf("create symlink %s: %w", target, err) + } + } + } + + return nil +} + +func extractFile(r io.Reader, path string, mode os.FileMode) error { + dir := filepath.Dir(path) + if err := os.MkdirAll(dir, 0o755); err != nil { + return fmt.Errorf("create parent dir for %s: %w", path, err) + } + + f, err := os.OpenFile(path, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, mode) + if err != nil { + return fmt.Errorf("create file %s: %w", path, err) + } + defer f.Close() + + if _, err := io.Copy(f, r); err != nil { + return fmt.Errorf("write file %s: %w", path, err) + } + + return nil +} + +func sanitizeTarPath(destDir, name string) (string, error) { + target := filepath.Join(destDir, filepath.Clean(name)) + if !strings.HasPrefix(target, filepath.Clean(destDir)+string(os.PathSeparator)) && target != filepath.Clean(destDir) { + return "", fmt.Errorf("tar entry %q escapes destination directory", name) + } + return target, nil +} diff --git a/internal/runner/process.go b/internal/runner/process.go new file mode 100644 index 0000000..60b6451 --- /dev/null +++ b/internal/runner/process.go @@ -0,0 +1,137 @@ +package runner + +import ( + "context" + "errors" + "fmt" + "io" + "log/slog" + "os" + "os/exec" + "path/filepath" + "strconv" + "syscall" + "time" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +const stopGracePeriod = 10 * time.Second + +type Process struct { + Name string + Group string + WorkDir string + PID int + StartedAt time.Time + Cmd *exec.Cmd +} + +type ProcessManager struct { + workdirBase string + logger *slog.Logger +} + +func NewProcessManager(workdirBase string, logger *slog.Logger) *ProcessManager { + return &ProcessManager{ + workdirBase: workdirBase, + logger: logger, + } +} + +func (m *ProcessManager) Prepare(ctx context.Context, instance *model.RunnerInstance, cachedDir string) (string, error) { + workdir := filepath.Join(m.workdirBase, instance.Group, instance.Name) + + if err := os.MkdirAll(workdir, 0o755); err != nil { + return "", fmt.Errorf("create workdir %s: %w", workdir, err) + } + + if err := copyDir(cachedDir, workdir); err != nil { + return "", fmt.Errorf("copy runner bits to %s: %w", workdir, err) + } + + m.logger.DebugContext(ctx, "prepared runner workdir", "workdir", workdir, "runner", instance.Name) + return workdir, nil +} + +func (m *ProcessManager) Start(ctx context.Context, instance *model.RunnerInstance, workdir, jitConfig string, logFile io.Writer) (*Process, error) { + runScript := filepath.Join(workdir, "run.sh") + cmd := exec.CommandContext(ctx, runScript) + cmd.Dir = workdir + cmd.Env = append(os.Environ(), "ACTIONS_RUNNER_INPUT_JITCONFIG="+jitConfig) + cmd.Stdout = logFile + cmd.Stderr = logFile + + if err := cmd.Start(); err != nil { + return nil, fmt.Errorf("start runner %s: %w", instance.Name, err) + } + + pidFile := filepath.Join(workdir, ".ghr-pid") + if err := os.WriteFile(pidFile, []byte(strconv.Itoa(cmd.Process.Pid)), 0o644); err != nil { + m.logger.WarnContext(ctx, "failed to write PID file", "path", pidFile, "error", err) + } + + m.logger.InfoContext(ctx, "runner started", "runner", instance.Name, "pid", cmd.Process.Pid) + + return &Process{ + Name: instance.Name, + Group: instance.Group, + WorkDir: workdir, + PID: cmd.Process.Pid, + StartedAt: time.Now(), + Cmd: cmd, + }, nil +} + +func (m *ProcessManager) Stop(ctx context.Context, proc *Process) error { + if proc.Cmd == nil || proc.Cmd.Process == nil { + return nil + } + + m.logger.InfoContext(ctx, "stopping runner", "runner", proc.Name, "pid", proc.PID) + + if err := proc.Cmd.Process.Signal(syscall.SIGTERM); err != nil { + if isProcessFinished(err) { + return nil + } + return fmt.Errorf("send SIGTERM to runner %s (pid %d): %w", proc.Name, proc.PID, err) + } + + done := make(chan error, 1) + go func() { + done <- proc.Cmd.Wait() + }() + + select { + case err := <-done: + if isExpectedExit(err) { + return nil + } + return err + case <-time.After(stopGracePeriod): + m.logger.WarnContext(ctx, "runner did not exit after SIGTERM, sending SIGKILL", "runner", proc.Name, "pid", proc.PID) + if err := proc.Cmd.Process.Kill(); err != nil { + return fmt.Errorf("kill runner %s (pid %d): %w", proc.Name, proc.PID, err) + } + return <-done + } +} + +func isProcessFinished(err error) bool { + return errors.Is(err, os.ErrProcessDone) +} + +func isExpectedExit(err error) bool { + if err == nil { + return true + } + var exitErr *exec.ExitError + return errors.As(err, &exitErr) +} + +func (m *ProcessManager) Cleanup(proc *Process) error { + if err := os.RemoveAll(proc.WorkDir); err != nil { + return fmt.Errorf("remove workdir %s: %w", proc.WorkDir, err) + } + return nil +} diff --git a/internal/runner/process_test.go b/internal/runner/process_test.go new file mode 100644 index 0000000..dc457d1 --- /dev/null +++ b/internal/runner/process_test.go @@ -0,0 +1,77 @@ +package runner + +import ( + "context" + "os" + "path/filepath" + "testing" + + "github.com/RedBoardDev/gh-runners-tool/v2/internal/model" +) + +func TestPrepare(t *testing.T) { + workdirBase := t.TempDir() + cachedDir := t.TempDir() + + files := map[string]string{ + "run.sh": "#!/bin/bash\necho run\n", + "config.sh": "#!/bin/bash\necho config\n", + } + for name, content := range files { + if err := os.WriteFile(filepath.Join(cachedDir, name), []byte(content), 0o755); err != nil { + t.Fatalf("write %s: %v", name, err) + } + } + + pm := NewProcessManager(workdirBase, silentLogger()) + instance := model.RunnerInstance{ + ID: "abc123", + Name: "test-group-abc123", + Group: "test-group", + } + + workdir, err := pm.Prepare(context.Background(), &instance, cachedDir) + if err != nil { + t.Fatalf("Prepare: %v", err) + } + + expectedDir := filepath.Join(workdirBase, "test-group", "test-group-abc123") + if workdir != expectedDir { + t.Fatalf("expected workdir %q, got %q", expectedDir, workdir) + } + + for name, content := range files { + p := filepath.Join(workdir, name) + data, readErr := os.ReadFile(p) + if readErr != nil { + t.Fatalf("read copied file %s: %v", name, readErr) + } + if string(data) != content { + t.Fatalf("file %s content mismatch: got %q, want %q", name, string(data), content) + } + } +} + +func TestCleanup(t *testing.T) { + workdir := t.TempDir() + sentinel := filepath.Join(workdir, "run.sh") + if err := os.WriteFile(sentinel, []byte("#!/bin/bash\n"), 0o755); err != nil { + t.Fatalf("write sentinel: %v", err) + } + + proc := &Process{ + Name: "test-runner", + Group: "test-group", + WorkDir: workdir, + PID: 99999, + } + + pm := NewProcessManager(filepath.Dir(workdir), silentLogger()) + if err := pm.Cleanup(proc); err != nil { + t.Fatalf("Cleanup: %v", err) + } + + if _, err := os.Stat(workdir); !os.IsNotExist(err) { + t.Fatalf("expected workdir to be removed, stat returned: %v", err) + } +} diff --git a/old-version/config.example.yaml b/old-version/config.example.yaml deleted file mode 100644 index ad673b6..0000000 --- a/old-version/config.example.yaml +++ /dev/null @@ -1,20 +0,0 @@ -github: - scope: org # or repo - owner: your-org - # repo: your-repo # required when scope=repo - -defaults: - workdir_base: /var/lib/ghr/groups - cache_dir: /var/lib/ghr/cache - version: latest - -groups: - - name: deploy-api - count: 10 - ephemeral: true - labels: [deploy, macos] - - name: ci-default - count: 5 - ephemeral: false - labels: [ci, macos] - diff --git a/old-version/env.example b/old-version/env.example deleted file mode 100644 index 73723dd..0000000 --- a/old-version/env.example +++ /dev/null @@ -1,2 +0,0 @@ -GITHUB_TOKEN=YOUR_GITHUB_PAT_WITH_RUNNER_PERMS - diff --git a/old-version/go.mod b/old-version/go.mod deleted file mode 100644 index 6bc6cd4..0000000 --- a/old-version/go.mod +++ /dev/null @@ -1,14 +0,0 @@ -module gh-runners-tool - -go 1.24.4 - -require ( - github.com/joho/godotenv v1.5.1 - github.com/spf13/cobra v1.8.0 - gopkg.in/yaml.v3 v3.0.1 -) - -require ( - github.com/inconshreveable/mousetrap v1.1.0 // indirect - github.com/spf13/pflag v1.0.5 // indirect -) diff --git a/old-version/go.sum b/old-version/go.sum deleted file mode 100644 index f2fd08d..0000000 --- a/old-version/go.sum +++ /dev/null @@ -1,14 +0,0 @@ -github.com/cpuguy83/go-md2man/v2 v2.0.3/go.mod h1:tgQtvFlXSQOSOSIRvRPT7W67SCa46tRHOmNcaadrF8o= -github.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8= -github.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw= -github.com/joho/godotenv v1.5.1 h1:7eLL/+HRGLY0ldzfGMeQkb7vMd0as4CfYvUVzLqw0N0= -github.com/joho/godotenv v1.5.1/go.mod h1:f4LDr5Voq0i2e/R5DDNOoa2zzDfwtkZa6DnEwAbqwq4= -github.com/russross/blackfriday/v2 v2.1.0/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM= -github.com/spf13/cobra v1.8.0 h1:7aJaZx1B85qltLMc546zn58BxxfZdR/W22ej9CFoEf0= -github.com/spf13/cobra v1.8.0/go.mod h1:WXLWApfZ71AjXPya3WOlMsY9yMs7YeiHhFVlvLyhcho= -github.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA= -github.com/spf13/pflag v1.0.5/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg= -gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM= -gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= -gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= -gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= diff --git a/old-version/internal/cli/apply.go b/old-version/internal/cli/apply.go deleted file mode 100644 index 516392a..0000000 --- a/old-version/internal/cli/apply.go +++ /dev/null @@ -1,38 +0,0 @@ -package cli - -import ( - "fmt" - "os" - "strconv" - "syscall" - - "gh-runners-tool/internal/config" - "github.com/spf13/cobra" -) - -func applyCmd() *cobra.Command { - cmd := &cobra.Command{ - Use: "apply", - Short: "Validate config and signal daemon to reload", - RunE: func(cmd *cobra.Command, args []string) error { - if _, err := config.Load(configPath); err != nil { - return fmt.Errorf("load config %s: %w", configPath, err) - } - - pidBytes, err := os.ReadFile(pidFilePath()) - if err != nil { - return fmt.Errorf("read daemon pid from %s: %w", pidFilePath(), err) - } - pid, err := strconv.Atoi(string(pidBytes)) - if err != nil { - return fmt.Errorf("invalid pid file: %w", err) - } - if err := syscall.Kill(pid, syscall.SIGHUP); err != nil { - return fmt.Errorf("signal daemon: %w", err) - } - cmd.Println("reload signal sent to daemon") - return nil - }, - } - return cmd -} diff --git a/old-version/internal/cli/daemon.go b/old-version/internal/cli/daemon.go deleted file mode 100644 index 854edad..0000000 --- a/old-version/internal/cli/daemon.go +++ /dev/null @@ -1,118 +0,0 @@ -package cli - -import ( - "context" - "fmt" - "log" - "os" - "os/signal" - "strings" - "syscall" - "time" - - "gh-runners-tool/internal/config" - "gh-runners-tool/internal/logging" - "gh-runners-tool/internal/provider/github" - "gh-runners-tool/internal/reconciler" - "gh-runners-tool/internal/runner" - "github.com/spf13/cobra" -) - -func daemonCmd() *cobra.Command { - cmd := &cobra.Command{ - Use: "daemon", - Short: "Run the controller daemon", - RunE: runDaemon, - } - return cmd -} - -func runDaemon(cmd *cobra.Command, _ []string) error { - logger := logging.New() - - cfg, err := config.Load(configPath) - if err != nil { - return err - } - token := os.Getenv("GITHUB_TOKEN") - if token == "" { - token = os.Getenv("GITHUB_PAT") - } - if token == "" { - return fmt.Errorf("GITHUB_TOKEN (or GITHUB_PAT) is required in environment") - } - - if err := os.MkdirAll(defaultStateDir(), 0o755); err != nil { - return fmt.Errorf("prepare state dir: %w", err) - } - if err := os.WriteFile(pidFilePath(), []byte(fmt.Sprintf("%d", os.Getpid())), 0o644); err != nil { - return fmt.Errorf("write pid file: %w", err) - } - defer os.Remove(pidFilePath()) - - gh := github.New(token) - rm := runner.New(cfg.Defaults.CacheDir, logger) - - rm.CleanupStale(uniqueWorkdirs(cfg)) - - logger.Printf("github cleanup: startup sweep") - if err := cleanupGitHubRegistrations(context.Background(), gh, cfg, logger); err != nil { - logger.Printf("warning: github cleanup failed: %v", err) - } - - rec := reconciler.New(logger, gh, rm) - rec.SetDesired(cfg) - - ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM) - defer stop() - - go func() { - signals := make(chan os.Signal, 1) - signal.Notify(signals, syscall.SIGHUP) - for range signals { - logger.Printf("reload requested (SIGHUP)") - updated, err := config.Load(configPath) - if err != nil { - logger.Printf("reload failed: %v", err) - continue - } - rec.SetDesired(updated) - } - }() - - err = rec.Run(ctx, interval) - shutdownCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second) - defer cancel() - rec.Shutdown(shutdownCtx) - logger.Printf("github cleanup: shutdown sweep") - if err := cleanupGitHubRegistrations(shutdownCtx, gh, cfg, logger); err != nil { - logger.Printf("warning: github cleanup (shutdown) failed: %v", err) - } - return err -} - -func cleanupGitHubRegistrations(ctx context.Context, gh *github.Client, cfg *config.Config, logger *log.Logger) error { - runners, err := gh.ListRunners(ctx, cfg.GitHub) - if err != nil { - return err - } - groupPrefixes := make(map[string]struct{}, len(cfg.Groups)) - for _, g := range cfg.Groups { - groupPrefixes[g.Name+"-"] = struct{}{} - } - - deleted := 0 - for _, rn := range runners { - for prefix := range groupPrefixes { - if strings.HasPrefix(rn.Name, prefix) { - if err := gh.DeleteRunner(ctx, cfg.GitHub, rn.ID); err != nil { - return fmt.Errorf("delete runner %s (%d): %w", rn.Name, rn.ID, err) - } - deleted++ - break - } - } - } - logger.Printf("github cleanup: inspected=%d deleted=%d", len(runners), deleted) - return nil -} diff --git a/old-version/internal/cli/purge.go b/old-version/internal/cli/purge.go deleted file mode 100644 index bac7929..0000000 --- a/old-version/internal/cli/purge.go +++ /dev/null @@ -1,87 +0,0 @@ -package cli - -import ( - "context" - "fmt" - "os" - "time" - - "gh-runners-tool/internal/config" - "gh-runners-tool/internal/logging" - "gh-runners-tool/internal/provider/github" - "github.com/spf13/cobra" -) - -func purgeCmd() *cobra.Command { - var timeout time.Duration - var waitInterval time.Duration - - cmd := &cobra.Command{ - Use: "purge", - Short: "Delete all self-hosted runners for the configured scope (waits for busy runners to go idle)", - RunE: func(cmd *cobra.Command, args []string) error { - logger := logging.New() - - cfg, err := config.Load(configPath) - if err != nil { - return err - } - token := os.Getenv("GITHUB_TOKEN") - if token == "" { - token = os.Getenv("GITHUB_PAT") - } - if token == "" { - return fmt.Errorf("GITHUB_TOKEN (or GITHUB_PAT) is required in environment") - } - - gh := github.New(token) - - ctx, cancel := context.WithTimeout(context.Background(), timeout) - defer cancel() - - logger.Printf("purge: starting (timeout=%s, interval=%s)", timeout, waitInterval) - for { - runners, err := gh.ListRunners(ctx, cfg.GitHub) - if err != nil { - return fmt.Errorf("list runners: %w", err) - } - if len(runners) == 0 { - logger.Printf("purge: nothing to delete") - return nil - } - - deleted := 0 - busy := 0 - for _, rn := range runners { - if rn.Busy || rn.Status == "busy" { - busy++ - continue - } - if err := gh.DeleteRunner(ctx, cfg.GitHub, rn.ID); err != nil { - return fmt.Errorf("delete runner %s (%d): %w", rn.Name, rn.ID, err) - } - deleted++ - logger.Printf("purge: deleted %s (%d)", rn.Name, rn.ID) - } - - remaining := len(runners) - deleted - if remaining == 0 { - logger.Printf("purge: completed") - return nil - } - logger.Printf("purge: remaining=%d busy=%d, waiting %s", remaining, busy, waitInterval) - - select { - case <-ctx.Done(): - return fmt.Errorf("purge timeout: %w", ctx.Err()) - case <-time.After(waitInterval): - } - } - }, - } - - cmd.Flags().DurationVar(&timeout, "timeout", 5*time.Minute, "Overall timeout for purge") - cmd.Flags().DurationVar(&waitInterval, "interval", 5*time.Second, "Wait interval when runners are busy") - - return cmd -} diff --git a/old-version/internal/cli/root.go b/old-version/internal/cli/root.go deleted file mode 100644 index e6c9280..0000000 --- a/old-version/internal/cli/root.go +++ /dev/null @@ -1,73 +0,0 @@ -package cli - -import ( - "os" - "path/filepath" - "time" - - "gh-runners-tool/internal/config" - "github.com/spf13/cobra" -) - -var ( - configPath string - interval time.Duration -) - -func Execute() error { - root := &cobra.Command{ - Use: "ghr", - Short: "GitHub runner controller (macOS)", - } - - root.PersistentFlags().StringVar(&configPath, "config", "config.yaml", "Path to configuration file") - root.PersistentFlags().DurationVar(&interval, "interval", 15*time.Second, "Reconciliation interval for daemon") - - root.AddCommand(daemonCmd()) - root.AddCommand(applyCmd()) - root.AddCommand(statusCmd()) - root.AddCommand(purgeCmd()) - - return root.Execute() -} - -func defaultStateDir() string { - if dir := os.Getenv("GHR_STATE_DIR"); dir != "" { - return dir - } - system := filepath.Join("/var/lib/ghr/state") - if err := os.MkdirAll(system, 0o755); err == nil { - return system - } - home, err := os.UserHomeDir() - if err != nil { - return system - } - return filepath.Join(home, ".local", "state", "ghr") -} - -func pidFilePath() string { - return filepath.Join(defaultStateDir(), "daemon.pid") -} - -func uniqueWorkdirs(cfg *config.Config) []string { - seen := make(map[string]struct{}) - add := func(path string) { - if path == "" { - return - } - if _, ok := seen[path]; ok { - return - } - seen[path] = struct{}{} - } - add(cfg.Defaults.WorkdirBase) - for _, g := range cfg.Groups { - add(g.WorkdirBase) - } - out := make([]string, 0, len(seen)) - for k := range seen { - out = append(out, k) - } - return out -} diff --git a/old-version/internal/cli/status.go b/old-version/internal/cli/status.go deleted file mode 100644 index 1d36503..0000000 --- a/old-version/internal/cli/status.go +++ /dev/null @@ -1,160 +0,0 @@ -package cli - -import ( - "errors" - "fmt" - "os" - "path/filepath" - "strconv" - "strings" - "syscall" - - "gh-runners-tool/internal/config" - "github.com/spf13/cobra" -) - -func statusCmd() *cobra.Command { - cmd := &cobra.Command{ - Use: "status", - Short: "Show daemon presence (pid file)", - RunE: func(cmd *cobra.Command, args []string) error { - cfg, err := config.Load(configPath) - if err != nil { - return fmt.Errorf("load config %s: %w", configPath, err) - } - - pidBytes, err := os.ReadFile(pidFilePath()) - if err != nil { - return fmt.Errorf("daemon not running or pid file missing (%s): %w", pidFilePath(), err) - } - pid, err := strconv.Atoi(strings.TrimSpace(string(pidBytes))) - if err != nil { - return fmt.Errorf("invalid pid file: %w", err) - } - - alive, err := pidAlive(pid) - if err != nil { - return fmt.Errorf("probe daemon pid %d: %w", pid, err) - } - - stats, total, warnings, err := collectRunnerStats(cfg) - if err != nil { - return err - } - - cmd.Printf("daemon: %s (pid=%d)\n", ternary(alive, "running", "not responding"), pid) - cmd.Printf("config: %s\n", configPath) - - for _, g := range cfg.Groups { - s := stats[g.Name] - cmd.Printf("group %-20s desired=%-3d running=%-3d stale=%-3d unknown=%-3d base=%s\n", - g.Name, g.Count, s.Running, s.Stale, s.Unknown, g.WorkdirBase) - } - cmd.Printf("total runners: running=%d stale=%d unknown=%d\n", total.Running, total.Stale, total.Unknown) - for _, w := range warnings { - cmd.Printf("warning: %s\n", w) - } - return nil - }, - } - return cmd -} - -type runnerStats struct { - Running int - Stale int - Unknown int -} - -func collectRunnerStats(cfg *config.Config) (map[string]runnerStats, runnerStats, []string, error) { - stats := make(map[string]runnerStats, len(cfg.Groups)) - for _, g := range cfg.Groups { - stats[g.Name] = runnerStats{} - } - - baseToGroup := make(map[string]string, len(cfg.Groups)) - for _, g := range cfg.Groups { - baseToGroup[g.WorkdirBase] = g.Name - } - - var total runnerStats - var warnings []string - - for base, group := range baseToGroup { - entries, err := os.ReadDir(base) - if err != nil { - if os.IsNotExist(err) { - warnings = append(warnings, fmt.Sprintf("workdir base missing: %s", base)) - continue - } - return nil, total, warnings, fmt.Errorf("read workdir base %s: %w", base, err) - } - for _, entry := range entries { - if !entry.IsDir() { - continue - } - dir := filepath.Join(base, entry.Name()) - pidPath := filepath.Join(dir, ".ghr-pid") - pidBytes, err := os.ReadFile(pidPath) - if err != nil { - stats[group] = addUnknown(stats[group]) - total.Unknown++ - continue - } - pid, err := strconv.Atoi(strings.TrimSpace(string(pidBytes))) - if err != nil { - stats[group] = addUnknown(stats[group]) - total.Unknown++ - continue - } - alive, err := pidAlive(pid) - if err != nil { - return nil, total, warnings, fmt.Errorf("probe runner pid %d (%s): %w", pid, dir, err) - } - if alive { - stats[group] = addRunning(stats[group]) - total.Running++ - } else { - stats[group] = addStale(stats[group]) - total.Stale++ - } - } - } - return stats, total, warnings, nil -} - -func pidAlive(pid int) (bool, error) { - if pid <= 0 { - return false, fmt.Errorf("invalid pid %d", pid) - } - err := syscall.Kill(pid, 0) - if err == nil || errors.Is(err, syscall.EPERM) { - return true, nil - } - if errors.Is(err, syscall.ESRCH) { - return false, nil - } - return false, err -} - -func addRunning(s runnerStats) runnerStats { - s.Running++ - return s -} - -func addStale(s runnerStats) runnerStats { - s.Stale++ - return s -} - -func addUnknown(s runnerStats) runnerStats { - s.Unknown++ - return s -} - -func ternary[T any](cond bool, a, b T) T { - if cond { - return a - } - return b -} diff --git a/old-version/internal/config/config.go b/old-version/internal/config/config.go deleted file mode 100644 index af92390..0000000 --- a/old-version/internal/config/config.go +++ /dev/null @@ -1,110 +0,0 @@ -package config - -import ( - "fmt" - "os" - "path/filepath" - - "github.com/joho/godotenv" - "gopkg.in/yaml.v3" -) - -type GitHubScope string - -const ( - ScopeOrg GitHubScope = "org" - ScopeRepo GitHubScope = "repo" -) - -type GitHubConfig struct { - Scope GitHubScope `yaml:"scope"` - Owner string `yaml:"owner"` - Repo string `yaml:"repo,omitempty"` -} - -type RunnerDefaults struct { - WorkdirBase string `yaml:"workdir_base"` - CacheDir string `yaml:"cache_dir"` - Version string `yaml:"version"` // e.g. "2.319.1" or "latest" -} - -type GroupSpec struct { - Name string `yaml:"name"` - Count int `yaml:"count"` - Ephemeral bool `yaml:"ephemeral"` - Labels []string `yaml:"labels"` - WorkdirBase string `yaml:"workdir_base,omitempty"` - Version string `yaml:"version,omitempty"` -} - -type Config struct { - GitHub GitHubConfig `yaml:"github"` - Defaults RunnerDefaults `yaml:"defaults"` - Groups []GroupSpec `yaml:"groups"` -} - -// Load loads configuration from YAML and .env (env is mandatory for tokens). -func Load(path string) (*Config, error) { - if err := godotenv.Load(); err != nil && !os.IsNotExist(err) { - return nil, fmt.Errorf("loading .env: %w", err) - } - - bytes, err := os.ReadFile(path) - if err != nil { - return nil, fmt.Errorf("read config: %w", err) - } - - cfg := &Config{} - if err := yaml.Unmarshal(bytes, cfg); err != nil { - return nil, fmt.Errorf("parse config: %w", err) - } - - if err := validate(cfg); err != nil { - return nil, err - } - - if cfg.Defaults.WorkdirBase == "" { - cfg.Defaults.WorkdirBase = "/var/lib/ghr/groups" - } - if cfg.Defaults.CacheDir == "" { - cfg.Defaults.CacheDir = "/var/lib/ghr/cache" - } - if cfg.Defaults.Version == "" { - cfg.Defaults.Version = "latest" - } - - for i := range cfg.Groups { - if cfg.Groups[i].WorkdirBase == "" { - cfg.Groups[i].WorkdirBase = filepath.Join(cfg.Defaults.WorkdirBase, cfg.Groups[i].Name) - } - if cfg.Groups[i].Version == "" { - cfg.Groups[i].Version = cfg.Defaults.Version - } - } - - return cfg, nil -} - -func validate(cfg *Config) error { - if cfg.GitHub.Scope != ScopeOrg && cfg.GitHub.Scope != ScopeRepo { - return fmt.Errorf("github.scope must be 'org' or 'repo'") - } - if cfg.GitHub.Owner == "" { - return fmt.Errorf("github.owner is required") - } - if cfg.GitHub.Scope == ScopeRepo && cfg.GitHub.Repo == "" { - return fmt.Errorf("github.repo is required when scope=repo") - } - if len(cfg.Groups) == 0 { - return fmt.Errorf("at least one group is required") - } - for _, g := range cfg.Groups { - if g.Name == "" { - return fmt.Errorf("group.name is required") - } - if g.Count < 0 { - return fmt.Errorf("group.count must be >= 0") - } - } - return nil -} diff --git a/old-version/internal/domain/types.go b/old-version/internal/domain/types.go deleted file mode 100644 index b0a550c..0000000 --- a/old-version/internal/domain/types.go +++ /dev/null @@ -1,19 +0,0 @@ -package domain - -type Group struct { - Name string - Count int - Ephemeral bool - Labels []string - Workdir string - Version string -} - -type RunnerInstance struct { - ID string - GroupName string - Ephemeral bool - Workdir string - Labels []string - Version string -} diff --git a/old-version/internal/logging/logging.go b/old-version/internal/logging/logging.go deleted file mode 100644 index d0cbbb6..0000000 --- a/old-version/internal/logging/logging.go +++ /dev/null @@ -1,12 +0,0 @@ -package logging - -import ( - "log" - "os" -) - -// * Provides a basic logger configured for stdout. -func New() *log.Logger { - logger := log.New(os.Stdout, "[ghr] ", log.LstdFlags|log.Lmicroseconds) - return logger -} diff --git a/old-version/internal/provider/github/client.go b/old-version/internal/provider/github/client.go deleted file mode 100644 index d6c3782..0000000 --- a/old-version/internal/provider/github/client.go +++ /dev/null @@ -1,166 +0,0 @@ -package github - -import ( - "bytes" - "context" - "encoding/json" - "fmt" - "io" - "net/http" - "time" - - "gh-runners-tool/internal/config" -) - -type Client struct { - httpClient *http.Client - token string -} - -type registrationTokenResponse struct { - Token string `json:"token"` - ExpiresAt time.Time `json:"expires_at"` -} - -// New creates a GitHub API client using a PAT from env. -func New(token string) *Client { - return &Client{ - httpClient: &http.Client{Timeout: 15 * time.Second}, - token: token, - } -} - -type Runner struct { - ID int64 `json:"id"` - Name string `json:"name"` - Status string `json:"status"` - Busy bool `json:"busy"` -} - -type listRunnersResponse struct { - Runners []Runner `json:"runners"` -} - -// RegistrationToken requests a registration token for runners. -func (c *Client) RegistrationToken(ctx context.Context, gh config.GitHubConfig) (string, error) { - url := "" - switch gh.Scope { - case config.ScopeOrg: - url = fmt.Sprintf("https://api.github.com/orgs/%s/actions/runners/registration-token", gh.Owner) - case config.ScopeRepo: - url = fmt.Sprintf("https://api.github.com/repos/%s/%s/actions/runners/registration-token", gh.Owner, gh.Repo) - default: - return "", fmt.Errorf("unknown scope %s", gh.Scope) - } - - req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader([]byte("{}"))) - if err != nil { - return "", fmt.Errorf("build request: %w", err) - } - req.Header.Set("Accept", "application/vnd.github+json") - req.Header.Set("Authorization", "Bearer "+c.token) - - resp, err := c.httpClient.Do(req) - if err != nil { - return "", fmt.Errorf("request registration token: %w", err) - } - defer resp.Body.Close() - - if resp.StatusCode >= 300 { - return "", fmt.Errorf("registration token failed: status %d", resp.StatusCode) - } - - var decoded registrationTokenResponse - if err := json.NewDecoder(resp.Body).Decode(&decoded); err != nil { - return "", fmt.Errorf("decode response: %w", err) - } - if decoded.Token == "" { - return "", fmt.Errorf("empty token returned") - } - return decoded.Token, nil -} - -// ListRunners returns all runners for the configured scope (first page). -func (c *Client) ListRunners(ctx context.Context, gh config.GitHubConfig) ([]Runner, error) { - var all []Runner - page := 1 - - for { - url := "" - switch gh.Scope { - case config.ScopeOrg: - url = fmt.Sprintf("https://api.github.com/orgs/%s/actions/runners?per_page=100&page=%d", gh.Owner, page) - case config.ScopeRepo: - url = fmt.Sprintf("https://api.github.com/repos/%s/%s/actions/runners?per_page=100&page=%d", gh.Owner, gh.Repo, page) - default: - return nil, fmt.Errorf("unknown scope %s", gh.Scope) - } - - req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil) - if err != nil { - return nil, fmt.Errorf("build request: %w", err) - } - req.Header.Set("Accept", "application/vnd.github+json") - req.Header.Set("Authorization", "Bearer "+c.token) - - resp, err := c.httpClient.Do(req) - if err != nil { - return nil, fmt.Errorf("list runners: %w", err) - } - if resp.StatusCode >= 300 { - resp.Body.Close() - return nil, fmt.Errorf("list runners failed: status %d", resp.StatusCode) - } - - var decoded listRunnersResponse - if err := json.NewDecoder(resp.Body).Decode(&decoded); err != nil { - resp.Body.Close() - return nil, fmt.Errorf("decode response: %w", err) - } - - all = append(all, decoded.Runners...) - resp.Body.Close() - - if len(decoded.Runners) < 100 { - break - } - page++ - } - - return all, nil -} - -// DeleteRunner removes a runner registration by ID. -func (c *Client) DeleteRunner(ctx context.Context, gh config.GitHubConfig, id int64) error { - url := "" - switch gh.Scope { - case config.ScopeOrg: - url = fmt.Sprintf("https://api.github.com/orgs/%s/actions/runners/%d", gh.Owner, id) - case config.ScopeRepo: - url = fmt.Sprintf("https://api.github.com/repos/%s/%s/actions/runners/%d", gh.Owner, gh.Repo, id) - default: - return fmt.Errorf("unknown scope %s", gh.Scope) - } - - req, err := http.NewRequestWithContext(ctx, http.MethodDelete, url, nil) - if err != nil { - return fmt.Errorf("build request: %w", err) - } - req.Header.Set("Accept", "application/vnd.github+json") - req.Header.Set("Authorization", "Bearer "+c.token) - - resp, err := c.httpClient.Do(req) - if err != nil { - return fmt.Errorf("delete runner: %w", err) - } - defer resp.Body.Close() - - if resp.StatusCode == http.StatusNotFound { - return nil - } - if resp.StatusCode >= 300 { - body, _ := io.ReadAll(resp.Body) - return fmt.Errorf("delete runner failed: status %d body=%s", resp.StatusCode, string(body)) - } - return nil -} diff --git a/old-version/internal/reconciler/reconciler.go b/old-version/internal/reconciler/reconciler.go deleted file mode 100644 index c2db09a..0000000 --- a/old-version/internal/reconciler/reconciler.go +++ /dev/null @@ -1,160 +0,0 @@ -package reconciler - -import ( - "context" - "fmt" - "sync" - "time" - - "gh-runners-tool/internal/config" - "gh-runners-tool/internal/domain" - "gh-runners-tool/internal/provider/github" - "gh-runners-tool/internal/runner" -) - -type Logger interface { - Printf(string, ...any) -} - -type Reconciler struct { - logger Logger - gh *github.Client - runners *runner.Manager - - mu sync.Mutex - desired *config.Config - groupPools map[string]*slotPool - ghCfg config.GitHubConfig - - stopOnce sync.Once -} - -func New(logger Logger, gh *github.Client, runners *runner.Manager) *Reconciler { - return &Reconciler{ - logger: logger, - gh: gh, - runners: runners, - groupPools: make(map[string]*slotPool), - } -} - -func (r *Reconciler) SetDesired(cfg *config.Config) { - r.mu.Lock() - defer r.mu.Unlock() - r.desired = cfg - r.ghCfg = cfg.GitHub -} - -func (r *Reconciler) Run(ctx context.Context, interval time.Duration) error { - if interval <= 0 { - interval = 15 * time.Second - } - - ticker := time.NewTicker(interval) - defer ticker.Stop() - - if err := r.reconcile(ctx); err != nil { - r.logger.Printf("reconcile error: %v", err) - } - - for { - select { - case <-ctx.Done(): - return ctx.Err() - case <-ticker.C: - if err := r.reconcile(ctx); err != nil { - r.logger.Printf("reconcile error: %v", err) - } - } - } -} - -func (r *Reconciler) reconcile(ctx context.Context) error { - r.mu.Lock() - cfg := r.desired - r.mu.Unlock() - - if cfg == nil { - return fmt.Errorf("no desired config set") - } - - desired := make(map[string]domain.Group, len(cfg.Groups)) - for _, g := range cfg.Groups { - desired[g.Name] = domain.Group{ - Name: g.Name, - Count: g.Count, - Ephemeral: g.Ephemeral, - Labels: g.Labels, - Workdir: g.WorkdirBase, - Version: g.Version, - } - } - - // Remove pools for groups that are no longer desired. - for name := range r.groupPools { - if _, ok := desired[name]; !ok { - r.groupPools[name].stop() - delete(r.groupPools, name) - r.logger.Printf("group %s stopped", name) - } - } - - // Ensure pools exist and match desired count. - for name, grp := range desired { - pool, ok := r.groupPools[name] - if !ok { - pool = newSlotPool(r.logger, r.gh, r.runners, grp, r.ghCfg) - r.groupPools[name] = pool - r.logger.Printf("group %s started with %d slots", name, grp.Count) - } - pool.update(grp) - } - - return nil -} - -// Shutdown stops all slots and runners when the daemon exits. -func (r *Reconciler) Shutdown(ctx context.Context) { - r.stopOnce.Do(func() { - r.mu.Lock() - pools := r.snapshotPools() - r.mu.Unlock() - - stopPools(pools) - - waitCtx, cancel := context.WithTimeout(ctx, 30*time.Second) - defer cancel() - waitPools(waitCtx, pools, r.logger) - }) -} - -func (r *Reconciler) snapshotPools() []*slotPool { - out := make([]*slotPool, 0, len(r.groupPools)) - for _, p := range r.groupPools { - out = append(out, p) - } - return out -} - -func stopPools(pools []*slotPool) { - for _, p := range pools { - p.stop() - } -} - -func waitPools(ctx context.Context, pools []*slotPool, logger Logger) { - for _, p := range pools { - p.wait(ctx) - } -} - -// Status returns a snapshot of current slots by group. -func (r *Reconciler) Status() map[string]int { - r.mu.Lock() - defer r.mu.Unlock() - out := make(map[string]int) - for name, pool := range r.groupPools { - out[name] = pool.size() - } - return out -} diff --git a/old-version/internal/reconciler/slots.go b/old-version/internal/reconciler/slots.go deleted file mode 100644 index 68061ab..0000000 --- a/old-version/internal/reconciler/slots.go +++ /dev/null @@ -1,229 +0,0 @@ -package reconciler - -import ( - "context" - "fmt" - "math/rand" - "sync" - "time" - - "gh-runners-tool/internal/config" - "gh-runners-tool/internal/domain" - "gh-runners-tool/internal/provider/github" - "gh-runners-tool/internal/runner" -) - -type slotPool struct { - logger Logger - gh *github.Client - runners *runner.Manager - ghCfg config.GitHubConfig - - mu sync.Mutex - group domain.Group - slots map[int]context.CancelFunc - wg sync.WaitGroup - stopping bool -} - -func newSlotPool(logger Logger, gh *github.Client, runners *runner.Manager, group domain.Group, ghCfg config.GitHubConfig) *slotPool { - return &slotPool{ - logger: logger, - gh: gh, - runners: runners, - ghCfg: ghCfg, - group: group, - slots: make(map[int]context.CancelFunc), - } -} - -func (p *slotPool) update(group domain.Group) { - p.mu.Lock() - defer p.mu.Unlock() - p.group = group - target := group.Count - current := len(p.slots) - - if target > current { - for i := current; i < target; i++ { - p.startSlotLocked(i) - } - } - if target < current { - diff := current - target - i := 0 - for id, cancel := range p.slots { - if i >= diff { - break - } - cancel() - delete(p.slots, id) - i++ - } - } -} - -func (p *slotPool) startSlotLocked(id int) { - ctx, cancel := context.WithCancel(context.Background()) - p.slots[id] = cancel - p.wg.Add(1) - go p.runSlot(ctx, id) -} - -func (p *slotPool) runSlot(ctx context.Context, slotID int) { - defer p.wg.Done() - - const ( - minBackoff = 2 * time.Second - maxBackoff = 30 * time.Second - ) - backoff := minBackoff - - for { - group := p.currentGroup() - - select { - case <-ctx.Done(): - return - default: - } - - token, err := p.gh.RegistrationToken(ctx, p.ghCfg) - if err != nil { - p.logger.Printf("slot %d group=%s: registration token: %v", slotID, group.Name, err) - if !sleepOrDone(ctx, jitter(backoff)) { - return - } - if backoff < maxBackoff { - backoff *= 2 - if backoff > maxBackoff { - backoff = maxBackoff - } - } - continue - } - - inst := runner.NewRunnerInstance(group) - handle, err := p.runners.Start(ctx, inst, p.ghCfg, token) - if err != nil { - p.logger.Printf("slot %d group=%s: start runner: %v", slotID, group.Name, err) - if !sleepOrDone(ctx, jitter(backoff)) { - return - } - if backoff < maxBackoff { - backoff *= 2 - if backoff > maxBackoff { - backoff = maxBackoff - } - } - continue - } - - backoff = minBackoff - - err = handle.Wait() - if err != nil { - p.logger.Printf("slot %d group=%s: runner %s exited with error: %v", slotID, group.Name, handle.ID, err) - } else { - p.logger.Printf("slot %d group=%s: runner %s exited normally", slotID, group.Name, handle.ID) - } - - go func(h *runner.Handle) { - ctxUnreg, cancel := context.WithTimeout(ctx, 15*time.Second) - defer cancel() - if err := p.unregister(ctxUnreg, h); err != nil { - p.logger.Printf("slot %d group=%s: unregister %s: %v", slotID, group.Name, h.ID, err) - } - }(handle) - - if !sleepOrDone(ctx, jitter(minBackoff)) { - return - } - } -} - -func (p *slotPool) unregister(ctx context.Context, h *runner.Handle) error { - name := runnerName(h.Group, h.ID) - runners, err := p.gh.ListRunners(ctx, p.ghCfg) - if err != nil { - return err - } - for _, rn := range runners { - if rn.Name == name { - return p.gh.DeleteRunner(ctx, p.ghCfg, rn.ID) - } - } - return nil -} - -func (p *slotPool) stop() { - p.mu.Lock() - defer p.mu.Unlock() - if p.stopping { - return - } - p.stopping = true - for _, cancel := range p.slots { - cancel() - } -} - -func (p *slotPool) wait(ctx context.Context) { - done := make(chan struct{}) - go func() { - p.wg.Wait() - close(done) - }() - - select { - case <-done: - case <-ctx.Done(): - p.logger.Printf("slot pool group=%s wait timeout", p.group.Name) - } -} - -func sleepOrDone(ctx context.Context, d time.Duration) bool { - select { - case <-time.After(d): - return true - case <-ctx.Done(): - return false - } -} - -func (p *slotPool) size() int { - p.mu.Lock() - defer p.mu.Unlock() - return len(p.slots) -} - -func (p *slotPool) currentGroup() domain.Group { - p.mu.Lock() - defer p.mu.Unlock() - return p.group -} - -func jitter(d time.Duration) time.Duration { - if d <= 0 { - return time.Second - } - // * Apply ±20% jitter to avoid thundering herd on retries. - delta := d / 5 - if delta <= 0 { - delta = time.Millisecond - } - offset := rand.Int63n(int64(delta)*2+1) - int64(delta) - out := d + time.Duration(offset) - if out < time.Millisecond { - return time.Millisecond - } - return out -} - -func init() { - rand.Seed(time.Now().UnixNano()) -} - -func runnerName(group, id string) string { - return fmt.Sprintf("%s-%s", group, id) -} diff --git a/old-version/internal/runner/manager.go b/old-version/internal/runner/manager.go deleted file mode 100644 index 489eca1..0000000 --- a/old-version/internal/runner/manager.go +++ /dev/null @@ -1,398 +0,0 @@ -package runner - -import ( - "archive/tar" - "compress/gzip" - "context" - "crypto/rand" - "encoding/hex" - "encoding/json" - "errors" - "fmt" - "io" - "net/http" - "os" - "os/exec" - "path/filepath" - "runtime" - "strconv" - "strings" - "sync" - "time" - - "gh-runners-tool/internal/config" - "gh-runners-tool/internal/domain" -) - -const pidFileName = ".ghr-pid" - -type Manager struct { - cacheDir string - logger Logger - httpClient *http.Client - mu sync.Mutex -} - -type Logger interface { - Printf(string, ...any) -} - -type Handle struct { - ID string - Group string - Cmd *exec.Cmd - Workdir string - done chan struct{} - err error -} - -func (h *Handle) Wait() error { - <-h.done - return h.err -} - -func (h *Handle) Done() <-chan struct{} { - return h.done -} - -func New(cacheDir string, logger Logger) *Manager { - return &Manager{ - cacheDir: cacheDir, - logger: logger, - httpClient: &http.Client{Timeout: 60 * time.Second}, - } -} - -// Start prepares and launches a runner process for the given instance. -func (m *Manager) Start(ctx context.Context, inst domain.RunnerInstance, gh config.GitHubConfig, registrationToken string) (*Handle, error) { - baseDir, err := m.ensureRunnerBits(ctx, inst.Version) - if err != nil { - return nil, err - } - - if err := os.MkdirAll(inst.Workdir, 0o755); err != nil { - return nil, fmt.Errorf("create workdir: %w", err) - } - - if err := copyDir(baseDir, inst.Workdir); err != nil { - return nil, fmt.Errorf("copy runner files: %w", err) - } - - name := fmt.Sprintf("%s-%s", inst.GroupName, inst.ID) - url := runnerURL(gh) - - configArgs := []string{ - filepath.Join(inst.Workdir, "config.sh"), - "--unattended", - "--url", url, - "--token", registrationToken, - "--name", name, - } - if len(inst.Labels) > 0 { - configArgs = append(configArgs, "--labels", strings.Join(inst.Labels, ",")) - } - if inst.Ephemeral { - configArgs = append(configArgs, "--ephemeral") - } - - configCmd := exec.CommandContext(ctx, "bash", configArgs...) - configCmd.Dir = inst.Workdir - configCmd.Stdout = os.Stdout - configCmd.Stderr = os.Stderr - if err := configCmd.Run(); err != nil { - _ = os.RemoveAll(inst.Workdir) - return nil, fmt.Errorf("config runner: %w", err) - } - - runCmd := exec.CommandContext(ctx, filepath.Join(inst.Workdir, "run.sh")) - runCmd.Dir = inst.Workdir - runCmd.Stdout = os.Stdout - runCmd.Stderr = os.Stderr - - if err := runCmd.Start(); err != nil { - _ = os.RemoveAll(inst.Workdir) - return nil, fmt.Errorf("start runner: %w", err) - } - - if err := m.writePID(inst.Workdir, runCmd.Process.Pid); err != nil { - _ = runCmd.Process.Kill() - _ = os.RemoveAll(inst.Workdir) - return nil, fmt.Errorf("write pid: %w", err) - } - - handle := &Handle{ - ID: inst.ID, - Group: inst.GroupName, - Cmd: runCmd, - Workdir: inst.Workdir, - done: make(chan struct{}), - } - - go func() { - defer close(handle.done) - handle.err = runCmd.Wait() - // Cleanup workdir regardless of exit status. - _ = os.RemoveAll(inst.Workdir) - }() - - return handle, nil -} - -func (m *Manager) ensureRunnerBits(ctx context.Context, version string) (string, error) { - resolvedVersion, err := m.resolveVersion(ctx, version) - if err != nil { - return "", err - } - - m.mu.Lock() - defer m.mu.Unlock() - - targetDir := filepath.Join(m.cacheDir, resolvedVersion) - if _, err := os.Stat(targetDir); err == nil { - return targetDir, nil - } - - if err := os.MkdirAll(targetDir, 0o755); err != nil { - return "", fmt.Errorf("create cache dir: %w", err) - } - - archivePath := filepath.Join(m.cacheDir, fmt.Sprintf("actions-runner-%s.tar.gz", resolvedVersion)) - if err := m.downloadRunner(ctx, resolvedVersion, archivePath); err != nil { - return "", err - } - - if err := untar(archivePath, targetDir); err != nil { - return "", fmt.Errorf("untar: %w", err) - } - - return targetDir, nil -} - -func (m *Manager) downloadRunner(ctx context.Context, version, dest string) error { - url := runnerDownloadURL(version) - req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil) - if err != nil { - return fmt.Errorf("build request: %w", err) - } - - resp, err := m.httpClient.Do(req) - if err != nil { - return fmt.Errorf("download runner: %w", err) - } - defer resp.Body.Close() - - if resp.StatusCode >= 300 { - return fmt.Errorf("download runner failed: status %d", resp.StatusCode) - } - - f, err := os.Create(dest) - if err != nil { - return fmt.Errorf("create archive: %w", err) - } - defer f.Close() - - if _, err := io.Copy(f, resp.Body); err != nil { - return fmt.Errorf("write archive: %w", err) - } - - return nil -} - -func runnerDownloadURL(version string) string { - resolved := version - arch := "x64" - if runtime.GOARCH == "arm64" { - arch = "arm64" - } - return fmt.Sprintf("https://github.com/actions/runner/releases/download/v%s/actions-runner-osx-%s-%s.tar.gz", resolved, arch, resolved) -} - -func (m *Manager) resolveVersion(ctx context.Context, version string) (string, error) { - if version != "latest" { - return version, nil - } - - req, err := http.NewRequestWithContext(ctx, http.MethodGet, "https://api.github.com/repos/actions/runner/releases/latest", nil) - if err != nil { - return "", err - } - resp, err := m.httpClient.Do(req) - if err != nil { - return "", err - } - defer resp.Body.Close() - - if resp.StatusCode >= 300 { - return "", fmt.Errorf("latest version lookup failed: status %d", resp.StatusCode) - } - var payload struct { - TagName string `json:"tag_name"` - } - if err := json.NewDecoder(resp.Body).Decode(&payload); err != nil { - return "", err - } - tag := strings.TrimPrefix(payload.TagName, "v") - if tag == "" { - return "", fmt.Errorf("empty tag from latest release") - } - return tag, nil -} - -func runnerURL(gh config.GitHubConfig) string { - if gh.Scope == config.ScopeRepo { - return fmt.Sprintf("https://github.com/%s/%s", gh.Owner, gh.Repo) - } - return fmt.Sprintf("https://github.com/%s", gh.Owner) -} - -func untar(src, dest string) error { - f, err := os.Open(src) - if err != nil { - return err - } - defer f.Close() - - gzr, err := gzip.NewReader(f) - if err != nil { - return err - } - defer gzr.Close() - - tr := tar.NewReader(gzr) - for { - header, err := tr.Next() - if errors.Is(err, io.EOF) { - break - } - if err != nil { - return err - } - - targetPath := filepath.Join(dest, header.Name) - - switch header.Typeflag { - case tar.TypeDir: - if err := os.MkdirAll(targetPath, os.FileMode(header.Mode)); err != nil { - return err - } - case tar.TypeReg: - if err := os.MkdirAll(filepath.Dir(targetPath), 0o755); err != nil { - return err - } - outFile, err := os.OpenFile(targetPath, os.O_CREATE|os.O_RDWR|os.O_TRUNC, os.FileMode(header.Mode)) - if err != nil { - return err - } - if _, err := io.Copy(outFile, tr); err != nil { - _ = outFile.Close() - return err - } - _ = outFile.Close() - default: - continue - } - } - return nil -} - -func copyDir(src, dst string) error { - return filepath.Walk(src, func(path string, info os.FileInfo, err error) error { - if err != nil { - return err - } - rel, err := filepath.Rel(src, path) - if err != nil { - return err - } - target := filepath.Join(dst, rel) - - if info.IsDir() { - return os.MkdirAll(target, info.Mode()) - } - - if err := os.MkdirAll(filepath.Dir(target), 0o755); err != nil { - return err - } - - srcFile, err := os.Open(path) - if err != nil { - return err - } - defer srcFile.Close() - - dstFile, err := os.OpenFile(target, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, info.Mode()) - if err != nil { - return err - } - defer dstFile.Close() - - if _, err := io.Copy(dstFile, srcFile); err != nil { - return err - } - - return nil - }) -} - -// NewRunnerInstance builds a runner instance descriptor with generated ID. -func NewRunnerInstance(group domain.Group) domain.RunnerInstance { - id := randID() - return domain.RunnerInstance{ - ID: id, - GroupName: group.Name, - Ephemeral: group.Ephemeral, - Workdir: filepath.Join(group.Workdir, id), - Labels: group.Labels, - Version: group.Version, - } -} - -func randID() string { - var b [4]byte - _, _ = rand.Read(b[:]) - return hex.EncodeToString(b[:]) -} - -func (m *Manager) writePID(workdir string, pid int) error { - pidPath := filepath.Join(workdir, pidFileName) - return os.WriteFile(pidPath, []byte(strconv.Itoa(pid)), 0o644) -} - -// CleanupStale removes leftover runner workdirs and terminates stray runner processes in known bases. -func (m *Manager) CleanupStale(bases []string) { - for _, base := range bases { - entries, err := os.ReadDir(base) - if err != nil { - m.logger.Printf("cleanup: skip base %s: %v", base, err) - continue - } - for _, entry := range entries { - if !entry.IsDir() { - continue - } - dir := filepath.Join(base, entry.Name()) - pidPath := filepath.Join(dir, pidFileName) - if pidBytes, err := os.ReadFile(pidPath); err == nil { - if pid, err := strconv.Atoi(strings.TrimSpace(string(pidBytes))); err == nil { - if err := killPID(pid); err != nil { - m.logger.Printf("kill stale pid %d (%s): %v", pid, dir, err) - } - } - } - if err := os.RemoveAll(dir); err != nil { - m.logger.Printf("remove stale workdir %s: %v", dir, err) - } - } - } -} - -func killPID(pid int) error { - if pid <= 0 { - return fmt.Errorf("invalid pid %d", pid) - } - proc, err := os.FindProcess(pid) - if err != nil { - return err - } - return proc.Kill() -} diff --git a/tests/complete/README.md b/tests/complete/README.md new file mode 100644 index 0000000..ec93874 --- /dev/null +++ b/tests/complete/README.md @@ -0,0 +1,106 @@ +# Complete Test + +Full end-to-end test of all ghr v2 features with 4 groups, 20 jobs, and all edge cases. + +## What is tested + +### Scale set management +- 4 scale sets created at startup +- Scale sets deleted on shutdown (Ctrl+C) +- Per-group health override (ghr-deploy: runner_timeout=10m) + +### Scaling behavior +- Scale-up from 0 to max (ghr-heavy: 0 -> 2) +- Pre-provisioned idle runner (ghr-fast: min=1, ghr-single: min=1) +- Scale-up to max under load (ghr-fast: 1 -> 3) +- Job queuing when max reached (ghr-fast 4th job waits) +- Scale-down after job completion (ephemeral runners) +- Second wave of jobs after first batch completes +- Sequential enforcement with max=1 (ghr-deploy: 3 jobs one after another) +- Always-on min=max=1 (ghr-single: runner always available) + +### Runner lifecycle +- Runner provisioned (workdir copy, JIT config, process start) +- Job started (idle -> busy transition) +- Job completed success (stop, cleanup workdir) +- Job completed failure (runner.failed event, cleanup still happens) +- Instant job (fast provision/cleanup cycle) +- Multi-step job (steps share runner) +- High stdout output (100 lines of payload) + +### Health monitoring (check_interval=10s) +- Runner liveness checks (kill -0 on PIDs) +- Runner timeout detection (runner_timeout=5m, won't trigger in test) +- Idle timeout (idle_timeout=2m, triggers on min_runners idle runners after all jobs done) +- Disk space check (min_disk_space=500MB) +- Health issues -> notification events + +### Notifications (Discord) +- runner.failed event sent when edge-fail job fails +- health.* events sent on any health issue +- daemon.start / daemon.stop events + +### Monitoring (Uptime Kuma) +- Daemon health push every check_interval (10s) +- Per-group health push (4 groups, 4 push tokens) +- Degraded threshold at 0.5 + +### Logging +- Daemon log: {log_dir}/daemon/{date}.json +- Group logs: {log_dir}/groups/{group}/{date}.json (4 groups) +- Runner logs: {log_dir}/groups/{group}/runners/{runner}/{date}.json +- Console output in text format with debug level +- Runner stdout captured in runner log files + +### Shutdown +- Ctrl+C triggers graceful shutdown +- All idle runners killed +- All workdirs cleaned +- All scale sets deleted +- PID file removed +- State file removed +- Socket removed +- No orphan processes + +## Setup + +1. Copy `env.example` to `.env` and fill in your values +2. Edit `config.yaml` and set `github.url` to your org +3. Run: + +```bash +cd tests/complete +cp env.example .env +# Edit .env with your Discord webhook + Uptime Kuma URLs + +ghr run --config config.yaml --log-level debug +``` + +4. Copy `workflow.yml` to `.github/workflows/test-ghr-complete.yml` in your repo +5. Trigger from GitHub Actions > "Run workflow" + +## Verification checklist + +After the workflow completes: + +- [ ] All 20 jobs completed in GitHub Actions (19 success, 1 failure) +- [ ] ghr-fast scaled to 3 runners concurrently +- [ ] ghr-heavy scaled to 2 runners concurrently +- [ ] ghr-deploy ran 3 jobs sequentially (max=1) +- [ ] ghr-single had pre-provisioned runner at startup +- [ ] edge-fail shows `result=failed` in ghr logs +- [ ] Discord received a notification for the failed job +- [ ] Uptime Kuma shows pushes for daemon + 4 groups + +After Ctrl+C: + +- [ ] No runner processes remain (`ps aux | grep Runner.Listener`) +- [ ] Workdirs empty (`ls ~/.local/share/ghr/runners/`) +- [ ] No PID file (`ls ~/.local/state/ghr/daemon.pid`) +- [ ] No socket (`ls ~/.local/state/ghr/ghr.sock`) +- [ ] Log files exist with structured JSON entries + +After waiting 2+ minutes idle (before Ctrl+C): + +- [ ] Idle runners killed by health monitor (idle_timeout=2m) +- [ ] min_runners runners re-provisioned after idle kill diff --git a/tests/complete/config.yaml b/tests/complete/config.yaml new file mode 100644 index 0000000..278936e --- /dev/null +++ b/tests/complete/config.yaml @@ -0,0 +1,75 @@ +github: + url: "https://github.com/YOUR_ORG" + runner_group: "default" + +runner: + version: "latest" + +groups: + - name: "ghr-fast" + max_runners: 3 + min_runners: 1 + labels: + - "fast" + - "macos" + + - name: "ghr-heavy" + max_runners: 2 + min_runners: 0 + labels: + - "heavy" + - "macos" + + - name: "ghr-deploy" + max_runners: 1 + min_runners: 0 + labels: + - "deploy" + - "macos" + health: + runner_timeout: "10m" + + - name: "ghr-single" + max_runners: 1 + min_runners: 1 + labels: + - "single" + - "macos" + +health: + enabled: true + check_interval: "10s" + runner_timeout: "5m" + idle_timeout: "2m" + divergence_timeout: "1m" + max_consecutive_failures: 3 + failure_cooldown: "30s" + min_disk_space: "500MB" + +logging: + level: "debug" + format: "text" + retention_days: 7 + runner_output: true + +notifications: + discord: + enabled: true + events: + - "health.*" + - "daemon.*" + - "runner.failed" + - "runner.timeout" + username: "ghr-test" + mentions: + error: "" + critical: "" + +monitoring: + uptime_kuma: + enabled: true + degraded_threshold: 0.5 + report_health_as_ping: true + +daemon: + shutdown_timeout: "15s" diff --git a/tests/complete/env.example b/tests/complete/env.example new file mode 100644 index 0000000..bb35ef1 --- /dev/null +++ b/tests/complete/env.example @@ -0,0 +1,10 @@ +# Discord webhook — required when notifications.discord.enabled = true +GHR_DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/XXXXXXXXXX/YYYYYYYY + +# Uptime Kuma — required when monitoring.uptime_kuma.enabled = true +GHR_UPTIME_KUMA_URL=https://uptime.example.com +GHR_UPTIME_KUMA_DAEMON_TOKEN=your-daemon-push-token +GHR_UPTIME_KUMA_TOKEN_GHR_FAST=your-fast-group-token +GHR_UPTIME_KUMA_TOKEN_GHR_HEAVY=your-heavy-group-token +GHR_UPTIME_KUMA_TOKEN_GHR_DEPLOY=your-deploy-group-token +GHR_UPTIME_KUMA_TOKEN_GHR_SINGLE=your-single-group-token diff --git a/tests/complete/validate.sh b/tests/complete/validate.sh new file mode 100755 index 0000000..fee7138 --- /dev/null +++ b/tests/complete/validate.sh @@ -0,0 +1,199 @@ +#!/bin/bash +set -uo pipefail + +LOG_DIR="${GHR_LOG_DIR:-$HOME/.local/share/ghr/logs}" +STATE_DIR="${GHR_STATE_DIR:-$HOME/.local/state/ghr}" +RUNNER_DIR="${GHR_RUNNER_DIR:-$HOME/.local/share/ghr/runners}" +PASS=0 +FAIL=0 +WARN=0 + +pass() { PASS=$((PASS + 1)); printf " \033[32m✓\033[0m %s\n" "$1"; } +fail() { FAIL=$((FAIL + 1)); printf " \033[31m✗\033[0m %s\n" "$1"; } +warn() { WARN=$((WARN + 1)); printf " \033[33m!\033[0m %s\n" "$1"; } +section() { printf "\n\033[1m%s\033[0m\n" "$1"; } + +TODAY=$(date +%Y-%m-%d) +DAEMON_LOG="$LOG_DIR/daemon/$TODAY.json" + +if [ ! -f "$DAEMON_LOG" ]; then + echo "ERROR: Daemon log not found at $DAEMON_LOG" + echo "Set GHR_LOG_DIR if logs are elsewhere." + exit 1 +fi + +section "=== Scale Set Management ===" + +GROUPS_STARTED=$(grep -c '"group listener started"' "$DAEMON_LOG" 2>/dev/null || echo 0) +if [ "$GROUPS_STARTED" -ge 4 ]; then pass "4 groups started ($GROUPS_STARTED listeners)" +else fail "Expected 4 groups, got $GROUPS_STARTED"; fi + +for g in ghr-fast ghr-heavy ghr-deploy ghr-single; do + if grep -q "\"group\":\"$g\"" "$DAEMON_LOG" 2>/dev/null; then + pass "Group $g active" + else + fail "Group $g not found in logs" + fi +done + +section "=== Runner Provisioning ===" + +TOTAL_PROVISIONED=$(grep -c '"runner provisioned"' "$DAEMON_LOG" 2>/dev/null || echo 0) +pass "Total runners provisioned: $TOTAL_PROVISIONED" + +for g in ghr-fast ghr-heavy ghr-deploy ghr-single; do + GROUP_LOG="$LOG_DIR/groups/$g/$TODAY.json" + if [ -f "$GROUP_LOG" ]; then + COUNT=$(grep -c '"runner provisioned"' "$GROUP_LOG" 2>/dev/null || echo 0) + pass " $g: $COUNT runners provisioned" + else + fail " $g: no group log found" + fi +done + +FAST_PROVISIONED=$(grep '"runner provisioned"' "$DAEMON_LOG" 2>/dev/null | grep -c '"group":"ghr-fast"' || echo 0) +if [ "$FAST_PROVISIONED" -ge 3 ]; then pass "ghr-fast scaled to 3+ runners" +else fail "ghr-fast only scaled to $FAST_PROVISIONED (expected >=3)"; fi + +section "=== Min Runners (Pre-provisioned) ===" + +DAEMON_START=$(grep '"ghr starting"' "$DAEMON_LOG" | head -1 | jq -r '.time' 2>/dev/null || echo "") +FIRST_LISTENER=$(grep '"group listener started"' "$DAEMON_LOG" | head -1 | jq -r '.time' 2>/dev/null || echo "") + +FAST_FIRST=$(grep '"runner provisioned"' "$DAEMON_LOG" | grep '"group":"ghr-fast"' | head -1 | jq -r '.time' 2>/dev/null || echo "") +FAST_FIRST_JOB=$(grep '"job started"' "$DAEMON_LOG" | grep '"group":"ghr-fast"' | head -1 | jq -r '.time' 2>/dev/null || echo "") + +if [ -n "$FAST_FIRST" ] && [ -n "$FAST_FIRST_JOB" ]; then + if [[ "$FAST_FIRST" < "$FAST_FIRST_JOB" ]]; then + pass "ghr-fast: runner provisioned BEFORE first job (min_runners=1)" + else + fail "ghr-fast: runner provisioned AFTER first job" + fi +else + warn "Cannot determine min_runners timing" +fi + +section "=== Job Execution ===" + +JOBS_STARTED=$(grep -c '"job started"' "$DAEMON_LOG" 2>/dev/null || echo 0) +JOBS_COMPLETED=$(grep -c '"job completed"' "$DAEMON_LOG" 2>/dev/null || echo 0) +JOBS_SUCCEEDED=$(grep '"job completed"' "$DAEMON_LOG" 2>/dev/null | grep -c '"result":"succeeded"' || echo 0) +JOBS_FAILED=$(grep '"job completed"' "$DAEMON_LOG" 2>/dev/null | grep -c '"result":"failed"' || echo 0) + +pass "Jobs started: $JOBS_STARTED" +pass "Jobs completed: $JOBS_COMPLETED" +pass " Succeeded: $JOBS_SUCCEEDED" +pass " Failed: $JOBS_FAILED" + +if [ "$JOBS_COMPLETED" -ge 18 ]; then pass "Enough jobs completed (>= 18)" +else fail "Only $JOBS_COMPLETED jobs completed (expected >= 18)"; fi + +if [ "$JOBS_FAILED" -ge 1 ]; then pass "At least 1 failed job detected (edge-fail)" +else fail "No failed job detected"; fi + +section "=== Concurrency ===" + +FAST_LOG="$LOG_DIR/groups/ghr-fast/$TODAY.json" +if [ -f "$FAST_LOG" ]; then + CONCURRENT=$(grep '"runner provisioned"' "$FAST_LOG" | head -3 | jq -r '.time[:19]' 2>/dev/null | sort -u | wc -l | tr -d ' ') + if [ "$CONCURRENT" -le 2 ]; then + pass "ghr-fast: 3 runners provisioned within same time window" + else + warn "ghr-fast: runners provisioned across $CONCURRENT distinct seconds" + fi +fi + +HEAVY_LOG="$LOG_DIR/groups/ghr-heavy/$TODAY.json" +if [ -f "$HEAVY_LOG" ]; then + HEAVY_PROV=$(grep -c '"runner provisioned"' "$HEAVY_LOG" 2>/dev/null || echo 0) + if [ "$HEAVY_PROV" -ge 2 ]; then pass "ghr-heavy: scaled to 2 runners" + else fail "ghr-heavy: only $HEAVY_PROV runners (expected >=2)"; fi +fi + +section "=== Sequential Enforcement (ghr-deploy max=1) ===" + +DEPLOY_LOG="$LOG_DIR/groups/ghr-deploy/$TODAY.json" +if [ -f "$DEPLOY_LOG" ]; then + DEPLOY_JOBS=$(grep -c '"job started"' "$DEPLOY_LOG" 2>/dev/null || echo 0) + DEPLOY_RUNNERS=$(grep '"runner provisioned"' "$DEPLOY_LOG" | jq -r '.runner' 2>/dev/null | sort -u | wc -l | tr -d ' ') + pass "ghr-deploy: $DEPLOY_JOBS jobs across $DEPLOY_RUNNERS unique runners" + if [ "$DEPLOY_JOBS" -ge 3 ]; then pass "ghr-deploy: all 3 deploy jobs ran" + else fail "ghr-deploy: only $DEPLOY_JOBS jobs (expected 3)"; fi +fi + +section "=== Job Failure Handling ===" + +FAILED_RUNNER=$(grep '"job completed"' "$DAEMON_LOG" | grep '"result":"failed"' | head -1 | jq -r '.runner' 2>/dev/null || echo "") +if [ -n "$FAILED_RUNNER" ]; then + pass "Failed job runner identified: $FAILED_RUNNER" + if grep -q "\"runner\":\"$FAILED_RUNNER\".*stopping" "$DAEMON_LOG" 2>/dev/null || \ + grep -q "stopping.*\"runner\":\"$FAILED_RUNNER\"" "$DAEMON_LOG" 2>/dev/null; then + pass "Failed runner was stopped and cleaned" + else + warn "Cannot confirm failed runner cleanup in logs" + fi +else + fail "No failed job runner found" +fi + +section "=== Runner Log Files ===" + +RUNNER_LOG_COUNT=$(find "$LOG_DIR/groups" -path "*/runners/*/$TODAY.json" -type f 2>/dev/null | wc -l | tr -d ' ') +pass "Runner log files created: $RUNNER_LOG_COUNT" + +for g in ghr-fast ghr-heavy ghr-deploy ghr-single; do + GROUP_RUNNERS=$(find "$LOG_DIR/groups/$g/runners" -name "$TODAY.json" -type f 2>/dev/null | wc -l | tr -d ' ') + pass " $g: $GROUP_RUNNERS runner logs" +done + +section "=== Duration Stats ===" + +if grep -q '"duration"' "$DAEMON_LOG" 2>/dev/null; then + pass "Job durations logged" + echo " Durations:" + grep '"job completed"' "$DAEMON_LOG" | jq -r ' " " + .runner + ": " + (.duration // "n/a")' 2>/dev/null | head -10 +else + warn "No duration data in logs" +fi + +section "=== Cleanup State ===" + +ORPHAN_PROCS=$(pgrep -f "Runner.Listener" 2>/dev/null | wc -l | tr -d ' ') +if [ "$ORPHAN_PROCS" -eq 0 ]; then pass "No orphan runner processes" +else fail "$ORPHAN_PROCS orphan processes found"; fi + +WORKDIR_CONTENT=$(find "$RUNNER_DIR" -mindepth 2 -maxdepth 2 -type d 2>/dev/null | wc -l | tr -d ' ') +if [ "$WORKDIR_CONTENT" -eq 0 ]; then pass "All runner workdirs cleaned" +else fail "$WORKDIR_CONTENT workdirs remain"; fi + +if [ ! -f "$STATE_DIR/daemon.pid" ]; then pass "PID file removed" +else fail "PID file still exists"; fi + +if [ ! -S "$STATE_DIR/ghr.sock" ]; then pass "Socket removed" +else fail "Socket still exists"; fi + +section "=== Log Structure ===" + +if [ -f "$LOG_DIR/daemon/$TODAY.json" ]; then pass "Daemon log exists" +else fail "Daemon log missing"; fi + +for g in ghr-fast ghr-heavy ghr-deploy ghr-single; do + if [ -f "$LOG_DIR/groups/$g/$TODAY.json" ]; then pass "Group log $g exists" + else fail "Group log $g missing"; fi +done + +DAEMON_LINES=$(wc -l < "$DAEMON_LOG" | tr -d ' ') +pass "Daemon log entries: $DAEMON_LINES" + +section "=== Notifications ===" + +NOTIF_EVENTS=$(grep '"runner.failed"\|"runner.started"\|"daemon.start"' "$DAEMON_LOG" 2>/dev/null | wc -l | tr -d ' ') +if [ "$NOTIF_EVENTS" -ge 1 ]; then pass "Notification events emitted: $NOTIF_EVENTS" +else warn "No notification events found in daemon log"; fi + +section "=========================================" +printf "\033[1m Results: \033[32m%d passed\033[0m, \033[31m%d failed\033[0m, \033[33m%d warnings\033[0m\n" "$PASS" "$FAIL" "$WARN" +section "=========================================" + +if [ "$FAIL" -gt 0 ]; then exit 1; fi +exit 0 diff --git a/tests/complete/workflow.yml b/tests/complete/workflow.yml new file mode 100644 index 0000000..d2efe36 --- /dev/null +++ b/tests/complete/workflow.yml @@ -0,0 +1,548 @@ +name: ghr v2 complete test +on: + workflow_dispatch: + inputs: + stress_level: + description: "Number of parallel stress jobs per group" + default: "3" + type: choice + options: ["2", "3", "5"] + +jobs: + + # ============================================= + # PHASE 1: STARTUP VALIDATION + # Runs immediately — checks min_runners pre-provisioning + # ============================================= + + startup-fast: + runs-on: ghr-fast + steps: + - run: | + echo "Runner: $RUNNER_NAME" + echo "This runner should already exist (min_runners=1)" + echo "Startup time: $(date -u +%H:%M:%S)" + + startup-single: + runs-on: ghr-single + steps: + - run: | + echo "Runner: $RUNNER_NAME" + echo "This runner should already exist (min_runners=1)" + echo "Startup time: $(date -u +%H:%M:%S)" + + # ============================================= + # PHASE 2: CONCURRENT SCALE-UP + # Hit max_runners on each group simultaneously + # ============================================= + + fast-burst-1: + runs-on: ghr-fast + needs: [startup-fast] + steps: + - run: | + echo "fast-burst-1 | Runner: $RUNNER_NAME | $(date -u +%H:%M:%S)" + sleep 30 + + fast-burst-2: + runs-on: ghr-fast + needs: [startup-fast] + steps: + - run: | + echo "fast-burst-2 | Runner: $RUNNER_NAME | $(date -u +%H:%M:%S)" + sleep 30 + + fast-burst-3: + runs-on: ghr-fast + needs: [startup-fast] + steps: + - run: | + echo "fast-burst-3 | Runner: $RUNNER_NAME | $(date -u +%H:%M:%S)" + sleep 30 + + heavy-burst-1: + runs-on: ghr-heavy + needs: [startup-fast] + steps: + - run: | + echo "heavy-burst-1 | Runner: $RUNNER_NAME | $(date -u +%H:%M:%S)" + sleep 40 + + heavy-burst-2: + runs-on: ghr-heavy + needs: [startup-fast] + steps: + - run: | + echo "heavy-burst-2 | Runner: $RUNNER_NAME | $(date -u +%H:%M:%S)" + sleep 40 + + # ============================================= + # PHASE 3: QUEUING PRESSURE + # More jobs than max_runners — forces queuing + # ============================================= + + fast-queue-1: + runs-on: ghr-fast + needs: [startup-fast] + steps: + - run: | + echo "fast-queue-1 (may be queued) | Runner: $RUNNER_NAME" + sleep 20 + + fast-queue-2: + runs-on: ghr-fast + needs: [startup-fast] + steps: + - run: | + echo "fast-queue-2 (may be queued) | Runner: $RUNNER_NAME" + sleep 20 + + fast-queue-3: + runs-on: ghr-fast + needs: [startup-fast] + steps: + - run: | + echo "fast-queue-3 (may be queued) | Runner: $RUNNER_NAME" + sleep 20 + + heavy-queue-1: + runs-on: ghr-heavy + needs: [startup-fast] + steps: + - run: | + echo "heavy-queue-1 (may be queued) | Runner: $RUNNER_NAME" + sleep 25 + + # ============================================= + # PHASE 4: REAL WORKLOADS + # Simulate actual CI tasks + # ============================================= + + real-checkout: + runs-on: ghr-fast + needs: [fast-burst-1] + steps: + - uses: actions/checkout@v4 + - run: | + echo "=== Checkout completed ===" + echo "Files: $(find . -type f | wc -l)" + echo "Disk: $(du -sh .)" + ls -la + + real-build: + runs-on: ghr-heavy + needs: [heavy-burst-1] + steps: + - run: | + echo "=== Simulating real build ===" + mkdir -p build/output + for i in $(seq 1 20); do + dd if=/dev/urandom bs=1024 count=100 of=build/output/artifact-$i.bin 2>/dev/null + echo "Built artifact $i/20" + done + echo "Total size: $(du -sh build/)" + echo "Disk free: $(df -h / | tail -1)" + + real-test-matrix: + runs-on: ghr-fast + needs: [fast-burst-2] + strategy: + matrix: + test-suite: [unit, integration, e2e] + fail-fast: false + steps: + - run: | + echo "=== Test suite: ${{ matrix.test-suite }} ===" + echo "Runner: $RUNNER_NAME" + case "${{ matrix.test-suite }}" in + unit) sleep 10; echo "47 tests passed" ;; + integration) sleep 15; echo "23 tests passed" ;; + e2e) sleep 20; echo "8 tests passed" ;; + esac + + real-cpu-stress: + runs-on: ghr-heavy + needs: [heavy-burst-2] + steps: + - run: | + echo "=== CPU stress test ===" + echo "Runner: $RUNNER_NAME" + echo "Cores: $(sysctl -n hw.ncpu)" + echo "Starting prime calculation..." + start=$(date +%s) + python3 -c " + import math + primes = [] + for n in range(2, 50000): + if all(n % p != 0 for p in primes): + primes.append(n) + print(f'Found {len(primes)} primes up to 50000') + " + end=$(date +%s) + echo "Duration: $((end - start))s" + + real-disk-io: + runs-on: ghr-heavy + needs: [real-build] + steps: + - run: | + echo "=== Disk I/O test ===" + echo "Runner: $RUNNER_NAME" + mkdir -p /tmp/ghr-io-test + echo "Writing 100MB..." + dd if=/dev/zero bs=1M count=100 of=/tmp/ghr-io-test/testfile 2>&1 + echo "Reading back..." + dd if=/tmp/ghr-io-test/testfile of=/dev/null bs=1M 2>&1 + rm -rf /tmp/ghr-io-test + echo "Disk free after cleanup: $(df -h / | tail -1)" + + real-network: + runs-on: ghr-fast + needs: [fast-burst-3] + steps: + - run: | + echo "=== Network test ===" + echo "Runner: $RUNNER_NAME" + echo "DNS resolution..." + nslookup github.com + echo "HTTP request..." + curl -s -o /dev/null -w "Status: %{http_code}\nTime: %{time_total}s\nSize: %{size_download} bytes\n" https://api.github.com + echo "IP: $(curl -s ifconfig.me)" + + real-env-check: + runs-on: ghr-single + needs: [startup-single] + steps: + - run: | + echo "=== Environment check ===" + echo "Runner: $RUNNER_NAME" + echo "OS: $(sw_vers -productName) $(sw_vers -productVersion)" + echo "Arch: $(uname -m)" + echo "Shell: $SHELL" + echo "User: $(whoami)" + echo "Home: $HOME" + echo "Cores: $(sysctl -n hw.ncpu)" + echo "Memory: $(sysctl -n hw.memsize | awk '{print $1/1024/1024/1024 " GB"}')" + echo "Disk: $(df -h / | tail -1)" + echo "Go: $(go version 2>/dev/null || echo 'not installed')" + echo "Python: $(python3 --version 2>/dev/null || echo 'not installed')" + echo "Node: $(node --version 2>/dev/null || echo 'not installed')" + echo "Git: $(git --version)" + + # ============================================= + # PHASE 5: ERROR HANDLING + # Various failure modes + # ============================================= + + error-exit-1: + runs-on: ghr-fast + needs: [fast-queue-1] + steps: + - run: echo "About to fail with exit 1" + - run: exit 1 + + error-exit-2: + runs-on: ghr-fast + needs: [fast-queue-2] + steps: + - run: exit 2 + + error-bad-command: + runs-on: ghr-fast + needs: [fast-queue-3] + steps: + - run: this-command-does-not-exist-at-all + continue-on-error: true + - run: echo "Continued after bad command" + + error-timeout: + runs-on: ghr-fast + needs: [error-exit-1] + if: always() + timeout-minutes: 1 + steps: + - run: | + echo "This job has a 1 minute timeout" + echo "Sleeping 45s (under the limit)..." + sleep 45 + echo "Finished before timeout" + + error-recovery: + runs-on: ghr-fast + needs: [error-exit-1, error-exit-2] + if: always() + steps: + - run: | + echo "=== Recovery after failures ===" + echo "Runner: $RUNNER_NAME" + echo "This proves the group still works after failed jobs" + + # ============================================= + # PHASE 6: SEQUENTIAL PIPELINE (ghr-deploy) + # Strict ordering, max=1 enforcement + # ============================================= + + deploy-validate: + runs-on: ghr-deploy + needs: [real-build, real-test-matrix] + steps: + - run: | + echo "=== Deploy: validation ===" + echo "Runner: $RUNNER_NAME | $(date -u +%H:%M:%S)" + sleep 10 + echo "Validation passed" + + deploy-staging: + runs-on: ghr-deploy + needs: [deploy-validate] + steps: + - run: | + echo "=== Deploy: staging ===" + echo "Runner: $RUNNER_NAME | $(date -u +%H:%M:%S)" + sleep 15 + echo "Staging deployed" + + deploy-smoke-test: + runs-on: ghr-deploy + needs: [deploy-staging] + steps: + - run: | + echo "=== Deploy: smoke test ===" + echo "Runner: $RUNNER_NAME | $(date -u +%H:%M:%S)" + sleep 5 + echo "Smoke test passed" + + deploy-production: + runs-on: ghr-deploy + needs: [deploy-smoke-test] + steps: + - run: | + echo "=== Deploy: production ===" + echo "Runner: $RUNNER_NAME | $(date -u +%H:%M:%S)" + sleep 15 + echo "Production deployed" + + deploy-verify: + runs-on: ghr-deploy + needs: [deploy-production] + steps: + - run: | + echo "=== Deploy: verification ===" + echo "Runner: $RUNNER_NAME | $(date -u +%H:%M:%S)" + sleep 5 + echo "Production verified" + + # ============================================= + # PHASE 7: SECOND WAVE + # After first batch completes — tests scale-down then scale-up + # ============================================= + + wave2-fast-1: + runs-on: ghr-fast + needs: [error-recovery, real-network] + steps: + - run: | + echo "=== Wave 2 fast-1 ===" + echo "Runner: $RUNNER_NAME" + echo "Runners should have scaled down then back up" + sleep 15 + + wave2-fast-2: + runs-on: ghr-fast + needs: [error-recovery, real-network] + steps: + - run: | + echo "=== Wave 2 fast-2 ===" + echo "Runner: $RUNNER_NAME" + sleep 15 + + wave2-fast-3: + runs-on: ghr-fast + needs: [error-recovery, real-network] + steps: + - run: | + echo "=== Wave 2 fast-3 ===" + echo "Runner: $RUNNER_NAME" + sleep 15 + + wave2-heavy: + runs-on: ghr-heavy + needs: [real-disk-io, real-cpu-stress] + steps: + - run: | + echo "=== Wave 2 heavy ===" + echo "Runner: $RUNNER_NAME" + sleep 20 + + wave2-single-1: + runs-on: ghr-single + needs: [real-env-check] + steps: + - run: | + echo "=== Wave 2 single-1 ===" + echo "Runner: $RUNNER_NAME" + sleep 10 + + wave2-single-2: + runs-on: ghr-single + needs: [wave2-single-1] + steps: + - run: | + echo "=== Wave 2 single-2 ===" + echo "Runner: $RUNNER_NAME" + sleep 10 + + # ============================================= + # PHASE 8: RAPID FIRE + # Many instant jobs to stress provisioning/cleanup + # ============================================= + + rapid-1: + runs-on: ghr-fast + needs: [wave2-fast-1] + steps: + - run: echo "rapid-1 | $RUNNER_NAME" + + rapid-2: + runs-on: ghr-fast + needs: [wave2-fast-1] + steps: + - run: echo "rapid-2 | $RUNNER_NAME" + + rapid-3: + runs-on: ghr-fast + needs: [wave2-fast-1] + steps: + - run: echo "rapid-3 | $RUNNER_NAME" + + rapid-4: + runs-on: ghr-fast + needs: [rapid-1] + steps: + - run: echo "rapid-4 | $RUNNER_NAME" + + rapid-5: + runs-on: ghr-fast + needs: [rapid-2] + steps: + - run: echo "rapid-5 | $RUNNER_NAME" + + rapid-6: + runs-on: ghr-fast + needs: [rapid-3] + steps: + - run: echo "rapid-6 | $RUNNER_NAME" + + # ============================================= + # PHASE 9: LONG RUNNING + # Tests that runners survive for longer periods + # ============================================= + + long-running: + runs-on: ghr-heavy + needs: [wave2-heavy] + steps: + - run: | + echo "=== Long running job ===" + echo "Runner: $RUNNER_NAME" + echo "Start: $(date -u +%H:%M:%S)" + for i in $(seq 1 12); do + echo "Minute $i/12: $(date -u +%H:%M:%S) | Memory: $(vm_stat | head -5 | tail -1)" + sleep 10 + done + echo "End: $(date -u +%H:%M:%S)" + echo "Total: ~2 minutes" + + # ============================================= + # PHASE 10: CROSS-GROUP JOB OUTPUTS + # Tests data passing between jobs on different groups + # ============================================= + + output-producer: + runs-on: ghr-heavy + needs: [long-running] + outputs: + build_id: ${{ steps.gen.outputs.build_id }} + timestamp: ${{ steps.gen.outputs.timestamp }} + steps: + - id: gen + run: | + BUILD_ID="build-$(date +%s)-$(openssl rand -hex 4)" + echo "build_id=$BUILD_ID" >> $GITHUB_OUTPUT + echo "timestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ)" >> $GITHUB_OUTPUT + echo "Generated: $BUILD_ID" + + output-consumer-fast: + runs-on: ghr-fast + needs: [output-producer] + steps: + - run: | + echo "=== Cross-group output ===" + echo "Build ID from heavy group: ${{ needs.output-producer.outputs.build_id }}" + echo "Timestamp: ${{ needs.output-producer.outputs.timestamp }}" + + output-consumer-deploy: + runs-on: ghr-deploy + needs: [output-producer, deploy-verify] + steps: + - run: | + echo "=== Deploy with build ID ===" + echo "Deploying build: ${{ needs.output-producer.outputs.build_id }}" + sleep 5 + + # ============================================= + # SUMMARY + # ============================================= + + summary: + runs-on: ghr-fast + needs: + - wave2-fast-2 + - wave2-fast-3 + - wave2-single-2 + - rapid-4 + - rapid-5 + - rapid-6 + - error-timeout + - error-bad-command + - output-consumer-fast + - output-consumer-deploy + - long-running + if: always() + steps: + - run: | + echo "=========================================" + echo " ghr v2 complete test — SUMMARY" + echo "=========================================" + echo "" + echo "Time: $(date -u)" + echo "Runner: $RUNNER_NAME | Host: $(hostname)" + echo "" + echo "Groups exercised:" + echo " ghr-fast (max=3, min=1) — burst, queue, matrix, rapid fire" + echo " ghr-heavy (max=2, min=0) — build, CPU, disk I/O, long run" + echo " ghr-deploy (max=1, min=0) — 5-stage pipeline, sequential" + echo " ghr-single (max=1, min=1) — always-on, env check" + echo "" + echo "Scenarios tested:" + echo " [Scale] Pre-provisioned min_runners" + echo " [Scale] Burst to max_runners" + echo " [Scale] Queuing under pressure" + echo " [Scale] Scale-down between waves" + echo " [Scale] Scale-up on second wave" + echo " [Work] Git checkout" + echo " [Work] File I/O (100MB write/read)" + echo " [Work] CPU stress (prime sieve)" + echo " [Work] Network (DNS + HTTP)" + echo " [Work] Matrix strategy (3 suites)" + echo " [Work] Cross-group job outputs" + echo " [Work] Long running (2 min)" + echo " [Error] exit 1 / exit 2" + echo " [Error] Bad command + continue-on-error" + echo " [Error] Job timeout (1 min limit)" + echo " [Error] Recovery after failures" + echo " [Pipeline] 5-stage deploy (validate→staging→smoke→prod→verify)" + echo " [Rapid] 6 instant jobs back-to-back" + echo " [Env] System info (OS, arch, memory, disk)" + echo "" + echo "=========================================" diff --git a/tests/simple/README.md b/tests/simple/README.md new file mode 100644 index 0000000..8d5c6ee --- /dev/null +++ b/tests/simple/README.md @@ -0,0 +1,22 @@ +# Simple Test + +Minimal test: 1 group, 1 runner, 1 job. + +## Setup + +```bash +ghr login +ghr run --config tests/simple/config.yaml +``` + +## Trigger + +Copy `workflow.yml` to `.github/workflows/test-simple.yml` in your repo. +Run it from GitHub Actions > "Run workflow". + +## Expected + +- 1 scale set created +- 1 runner provisioned on job dispatch +- Job completes, runner cleaned up +- Ctrl+C stops cleanly diff --git a/tests/simple/config.yaml b/tests/simple/config.yaml new file mode 100644 index 0000000..23dc600 --- /dev/null +++ b/tests/simple/config.yaml @@ -0,0 +1,3 @@ +groups: + - name: "test-simple" + max_runners: 1 diff --git a/tests/simple/workflow.yml b/tests/simple/workflow.yml new file mode 100644 index 0000000..9569eef --- /dev/null +++ b/tests/simple/workflow.yml @@ -0,0 +1,8 @@ +name: ghr simple test +on: workflow_dispatch + +jobs: + hello: + runs-on: test-simple + steps: + - run: echo "Hello from ghr!"