Skip to content

feat(operator): operator UI, GitHub PAT auth, AI rule suggester#7

Merged
cbullinger merged 25 commits intomainfrom
feat/operator-ui-audit
May 2, 2026
Merged

feat(operator): operator UI, GitHub PAT auth, AI rule suggester#7
cbullinger merged 25 commits intomainfrom
feat/operator-ui-audit

Conversation

@cbullinger
Copy link
Copy Markdown
Collaborator

@cbullinger cbullinger commented Apr 21, 2026

Summary

Builds out the operator UI into a production-ready diagnostic + self-service tool, gated by GitHub PAT authentication with repo-derived roles, and adds an AI rule suggester backed by the hosted Anthropic API through the Grove Foundry APIM gateway.

Commits on feat/operator-ui-auditmain:

SHA What
8f39997 Comprehensive operator UI with writer-facing features
fe927fd Optional GitHub PAT auth with role + per-repo access checks
80c5da9 Remove token auth mode, require GitHub PAT + auth repo
8c1a76a AI rule suggester with full UI-based Ollama management
2735584 gosec G107/G704 SSRF hardening on the GitHub permission check
3551a0b Show operator UI + AI settings in the startup banner
c8508e5 Wire operator UI, enable audit, populate changelog
473bdb3 Restore v0.3.0 heading in CHANGELOG
62b6f33 Anthropic LLM provider for AI rule suggester
b75df04 Route Anthropic client through Grove Foundry APIM gateway
579edb9 cmd/test-llm smoke-test CLI

Configuration

Env var Required Description
OPERATOR_UI_ENABLED for operator UI true to mount /operator/ routes
OPERATOR_AUTH_REPO when UI enabled owner/repo — the user's permission on this repo determines their role
OPERATOR_REPO_SLUG optional For clickable GitHub links in audit/trace rows
OPERATOR_RELEASE_GITHUB_TOKEN optional PAT with contents:write to create release tags from the UI
LLM_PROVIDER optional anthropic (default in prod) or ollama (local)
LLM_BASE_URL optional Per-provider default. For the Grove Foundry gateway: https://grove-gateway-prod.azure-api.net/grove-foundry-prod/anthropic
LLM_MODEL optional Per-provider default. claude-haiku-4-5 for the gateway, qwen2.5-coder:7b for Ollama
ANTHROPIC_API_KEY for anthropic (dev) Direct value; overrides Secret Manager if set
ANTHROPIC_API_KEY_SECRET_NAME for anthropic (prod) Secret Manager resource to load on boot; used in Cloud Run

Startup fails if OPERATOR_UI_ENABLED=true but OPERATOR_AUTH_REPO is missing — prevents an accidentally-open operator UI. Missing ANTHROPIC_API_KEY is non-fatal: the UI shows "not configured" and every other feature keeps working.

Authentication & authorization

Each user authenticates with their personal GitHub PAT. Their permission on OPERATOR_AUTH_REPO determines their role:

GitHub permission Operator UI role Can do
admin / maintain operator Everything: view, replay, release, AI settings
write / triage / read writer View workflows, audit, recent copies, AI rule suggester. No replay, no release, no AI settings
None / 404 denied 401 Unauthorized

write is deliberately mapped to writer, not operator: most writers on docs repos have write on the auth repo, so mapping write → operator would give every writer replay and release capability. Operator actions require an explicit admin or maintain grant.

On top of the role, replay is repo-scoped: the user's PAT must also have read access to the source repo of the webhook being replayed. This prevents someone who has operator access to the copier from replaying webhooks for private source repos they can't read. Checked on every replay call and cached per (token, repo) with a 5-minute TTL.

What's new

Backend APIs

  • GET /operator/api/workflows — parsed copier config for the workflow browser
  • POST /operator/api/replay — replay a webhook delivery (in-flight dedup, merged-PR check, per-repo permission check)
  • GET /operator/api/logs?delivery_id=X — per-delivery log viewer via context-tagged ring buffer
  • GET /operator/api/me — authenticated user: {login, avatar_url, role}
  • GET /operator/api/repo-permission?repos=... — batch permission check for frontend button state
  • POST /operator/api/suggest-rule — LLM-generated copier rule from a source→target example, self-verified against the in-process PatternMatcher
  • GET /operator/api/llm/status — provider, reachability, active model, installed models, supports_model_mgmt flag
  • POST /operator/api/llm/settings — switch active model / base URL at runtime (in-memory, resets on restart)
  • POST /operator/api/llm/pull (Ollama only) — NDJSON progress streaming for model pulls
  • DELETE /operator/api/llm/model?name= (Ollama only) — remove a pulled model
  • PR number + file path filters on GET /operator/api/audit/events
  • CommitSHA added to WebhookTraceEntry and CopierFileMeta

AI rule suggester

  • Two providers, one interfaceLLMClient interface with concrete ollamaClient and anthropicClient implementations. NewLLMClient dispatches on LLM_PROVIDER.
  • Anthropic (hosted, default in prod) — calls /v1/messages through the Grove Foundry APIM gateway. Sends both x-api-key (native Anthropic) and api-key (Azure APIM) headers so the same client works with either endpoint. Ping uses a 1-token /v1/messages call (the gateway doesn't expose /v1/models). ListModels falls back to a static list when the endpoint isn't reachable.
  • Ollama (local) — full UI-based management: pull models with streaming progress, delete, switch active. Used for dev or any environment with a reachable Ollama VM.
  • Self-verifying — every generated rule is run through the in-process PatternMatcher before display; mismatches surface a warning.

Bug fixes

  • ObjectID decoding for mongo-driver v2 (ObjectIDAsHexString) — audit queries work again
  • Empty-commit PR creation bug (errTreeUnchanged sentinel) — skips cleanly instead of 422'ing on empty branches
  • Keyboard-shortcut overlay wouldn't close — .help-bg[hidden] now wins over the base display:flex
  • File match tester returned no matches for Java files — JavaScript RegExp doesn't support Python-style (?P<name>); rewrites (?P<(?< before compilation

Frontend

  • 5-tab layout: Overview, Webhooks, Audit, Workflows, System
  • Writer/Operator mode toggle (writers see a trimmed tab set)
  • Sticky status bar: version, uptime, MongoDB health, last webhook, connection status, GitHub avatar + username
  • Dark mode with full CSS variable theming
  • Collapsible sections (persisted per-section in localStorage)
  • Grouped metrics with delta indicators, sparklines, health accent borders
  • Audit event detail drawer with replay button
  • File match tester — client-side glob/regex/move/copy matching with computed target paths
  • AI rule suggester panel — source + target paths in, generated rule out, with Verified/Not-Verified badge
  • AI settings panel — status line labels the provider ("Anthropic connected at…" vs "Ollama connected at…"); pull/delete sections hidden for hosted providers; provider-specific onboarding when disconnected
  • PR lookup combining traces + audit into one timeline
  • Recent copies card-based feed
  • Workflow browser with search and getting-started help
  • Per-delivery log viewer in a side drawer (levels color-coded)
  • Replay buttons hidden for writers; disabled with tooltip for source repos the user can't access
  • CSV export for audit events
  • Shareable state URLs (?tab=audit&pr=123&mode=writer)
  • Keyboard shortcuts: 1-5 tabs, R refresh, D dark mode, T time format, W writer/operator, ? help

Security

  • Token-based auth removed entirely — GitHub PAT is the only supported mechanism
  • gosec G107/G704 SSRF findings fixed: GitHub API URL construction validates path components against strict RE2-compatible whitelists (ghUsernameRe, ghRepoNameRe) and escapes them with url.PathEscape
  • slack_notifier.go #nosec annotation extended to cover NewRequestWithContext
  • Anthropic API key loaded from GCP Secret Manager in prod (same pattern as webhook secret and Mongo URI); never logged

Deploy prep

  • .github/workflows/ci.yml — adds operator UI + AI env vars to the Cloud Run deploy, flips AUDIT_ENABLED to true (aligning with the v0.3.0 "enabled by default" CHANGELOG entry)
  • env-cloudrun.yaml — parity with the CI deploy for the manual fallback path
  • CHANGELOG.md[Unreleased] populated so scripts/release.sh vX.Y.Z can extract release notes cleanly
  • Startup banner — surfaces provider, model, base URL, auth repo at boot
  • cmd/test-llm — smoke-test CLI to verify provider connectivity end-to-end after a key rotation or gateway change

One-time deploy prep (before cutting the next tag)

Create the Anthropic gateway key secret on the prod GCP project and grant the Cloud Run runtime service account read access:

# Secret name must match ANTHROPIC_API_KEY_SECRET_NAME in the deploy env
gcloud secrets create anthropic-api-key --data-file=- \
  --project=github-copy-code-examples <<< "$GATEWAY_KEY"

# Find the runtime SA (or check existing bindings on webhook-secret):
RUNTIME_SA=$(gcloud run services describe examples-copier \
  --region=us-central1 --project=github-copy-code-examples \
  --format='value(spec.template.spec.serviceAccountName)')

gcloud secrets add-iam-policy-binding anthropic-api-key \
  --member=serviceAccount:${RUNTIME_SA} \
  --role=roles/secretmanager.secretAccessor \
  --project=github-copy-code-examples

After merge, ./scripts/release.sh v0.4.0 cuts the tag and the Cloud Run deploy picks up the new provider.

Test plan

Local dev

  • Export OPERATOR_UI_ENABLED=true without OPERATOR_AUTH_REPO — startup fails with a clear error
  • Export OPERATOR_UI_ENABLED=true + OPERATOR_AUTH_REPO=<repo> — server starts, /operator/ loads
  • Paste your GitHub PAT — user chip appears with correct role (operator for admin/write, writer for read)
  • As a writer, verify replay/release/AI-settings are hidden
  • As an operator, Replay a webhook trace for a repo you have access to — reprocesses cleanly
  • Attempt replay for a source repo you can't read — button disabled with tooltip; API returns 403
  • File match tester — matching rules highlighted with correct target paths
  • AI rule suggester (Anthropic via Grove gateway) — verified via cmd/test-llm -env .env.test; Ping, ListModels, and a real suggester prompt all succeed
  • AI rule suggester (Ollama, local) — connect, pull a model, switch active, generate a rule

Hosted (Cloud Run) — after merge

  • Create anthropic-api-key secret and grant Secret Manager access to the runtime SA (see above)
  • ./scripts/release.sh v0.4.0 → verify Cloud Run deploys the new revision
  • Open https://<service>/operator/ — all 5 tabs load with a valid PAT
  • AI settings shows "Anthropic connected at https://grove-gateway-prod.azure-api.net/…"
  • Generate a rule via the AI suggester — verified badge appears
  • Merge a PR that matches a workflow — appears in Webhooks, Audit, and Recent copies
  • Click Logs on a trace — log entries appear in the drawer

@cbullinger cbullinger changed the title feat(operator): comprehensive operator UI + GitHub PAT auth feat(operator): operator UI, GitHub PAT auth, AI rule suggester Apr 22, 2026
cbullinger added a commit that referenced this pull request Apr 22, 2026
- handleRepoPermission: return {allowed, error?} per repo instead of a
  raw bool map. The frontend now shows GitHub rate-limit / transient
  errors in replay-button tooltips instead of generic "no access". UI
  keeps boolean-shape compat for partial rollouts. (review #7)

- githubCreateVersionTag: apply the same ghUsernameRe / ghRepoNameRe
  whitelist + url.PathEscape treatment ghAPIGetRepoPermission already
  uses. Also add a ghBranchNameRe check. Env-vars only today, but keeps
  the SSRF hardening story consistent across the package. (review #8)

- sharedGithubHTTPClient package var replaces per-call allocation in
  githubHTTPClient(). (review minor)

- ReleaseAPIMode typed string with ReleaseAPIDisabled /
  ReleaseAPITagCreateEnabled constants. (review minor)

- truncate() is now rune-aware so multi-byte LLM output isn't cut
  mid-glyph in error logs. (review minor)

- Replay delivery IDs append a 3-byte random suffix so two replays in
  the same millisecond can't collide on the trace ring. (review minor)

- Anthropic fallback model list trimmed to one stable alias
  (claude-haiku-4-5). Listing rotating dated IDs here ships dead
  dropdown options on every Claude release. (review #9)

- Normalize the Anthropic default model to the aliased form
  ("claude-haiku-4-5") in configs/environment.go and llm_anthropic.go
  so every default path matches ci.yml and env-cloudrun.yaml. (review #10)

- AI settings UI hint calls out that settings are process-global,
  affect all operators, and revert on restart. Fixes the "two operators
  clobber each other silently" footgun. (review #5)
Wire MongoDB copy audit events after target uploads; add /operator
dashboard with audit feed, deployment metadata, and optional tag API.

Made-with: Cursor
Backend:
- Add workflow config API (GET /operator/api/workflows) serving parsed config
- Add webhook replay API (POST /operator/api/replay) with safety guards
  (in-flight dedup, merged-PR check, synthetic delivery ID)
- Add per-delivery log viewer (GET /operator/api/logs) with context-tagged
  log capture via DeliveryLogBuffer ring buffer
- Add PR number and file path filters to audit query API
- Fix ObjectID decoding for mongo-driver v2 (ObjectIDAsHexString)
- Fix constant-time token comparison leaking token length
- Fix empty-commit PR creation bug (errTreeUnchanged sentinel)
- Add CommitSHA to WebhookTraceEntry and CopierFileMeta
- Enrich webhook trace detail with target repos and file counts
- Add uptime + MongoDB health check to deployment API

Frontend - operator UI rewrite:
- 5-tab layout: Overview, Webhooks, Audit, Workflows, System
- Writer/Operator mode toggle (hides infrastructure tabs for writers)
- Sticky status bar: version, uptime, MongoDB health, last webhook, connection
- Dark mode with full CSS variable theming
- Collapsible sections with localStorage persistence
- Grouped metrics with delta indicators, sparklines, health accent borders
- Grouped deployment cards with copy-to-clipboard
- Structured detail badges in webhook trace Detail column
- Audit event detail drawer with all fields + replay button
- File match tester: client-side glob/regex/move/copy matching against
  loaded workflow config with target path computation
- PR lookup combining traces + audit into unified timeline view
- Recent copies card-based feed
- Workflow browser with search and visual cards
- In-app help / getting started documentation
- CSV export for audit events
- Toast notifications for background auto-refresh errors
- Connection heartbeat (60s ping with disconnected indicator)
- Auto-refresh with persisted checkbox state
- Shareable state URLs via history.replaceState
- Deep-linkable filters, tabs, and mode (?tab=audit&pr=123&mode=writer)
- Keyboard shortcuts (1-5 tabs, R refresh, D dark, T time, W mode, ? help)
- Empty state messages, expandable errors, GitHub links throughout
- Responsive audit table (hides SHA/Type columns on narrow screens)
- Token show/hide toggle, release confirmation dialog, inline favicon
- Relative/absolute time toggle with periodic update
…checks

Adds a second authentication mode (OPERATOR_AUTH_MODE=github) that validates
the bearer token as a GitHub Personal Access Token instead of a shared secret.
The existing shared-token mode remains the default for backward compatibility.

Backend:
- New config: OPERATOR_AUTH_MODE (token|github), OPERATOR_AUTH_REPO
- operator_auth.go: GitHub PAT validation with GET /user + per-user permission
  check on OPERATOR_AUTH_REPO. Maps GitHub permissions to operator/writer/denied
  roles. 5-minute cache keyed by PAT to avoid hitting GitHub on every request.
- operator_ui.go: wrapAPI supports both auth modes, wrapOperatorOnly enforces
  the operator role on write endpoints (replay, release).
- Per-repo permission check on POST /operator/api/replay — the user's PAT must
  have at least read access to the source repo being replayed (in github mode
  only; token mode unchanged). Cached per token+repo pair.
- New endpoints: GET /operator/api/me returns the authenticated user + role;
  GET /operator/api/repo-permission?repos=... batch-checks repo read access
  for frontend button state.
- /operator/api/status now includes auth_mode so the frontend can adapt.

Frontend:
- User chip in top bar shows GitHub avatar + username + role suffix
- Dynamic token input label: "GitHub token" in github mode
- body.role-writer hides elements with .operator-only class (replay, release)
- Replay buttons pre-check per-repo access via fetchRepoPermissions and
  disable with an explanatory tooltip for inaccessible source repos
- Token clear resets user state and permission cache
… repo

Simplifies the operator UI authentication to a single path: GitHub PAT
validation with role assignment from OPERATOR_AUTH_REPO permissions.

Prior: two auth modes (shared secret via OPERATOR_UI_TOKEN, or GitHub PAT via
OPERATOR_AUTH_MODE=github). The shared-secret mode had no user identity, no
per-repo restrictions, and a single token that could be leaked.

Now: OPERATOR_UI_ENABLED=true always uses GitHub PAT auth. Each user
authenticates with their personal token, and OPERATOR_AUTH_REPO (required)
determines their role (operator for write/admin, writer for read/triage).

Config:
- Remove OPERATOR_UI_TOKEN env var and Config.OperatorUIToken field
- Remove OPERATOR_AUTH_MODE env var and Config.OperatorAuthMode field
- validateOperatorAuth now requires OPERATOR_AUTH_REPO in owner/repo format
  when the UI is enabled — startup fails with a clear error otherwise

Backend (operator_ui.go):
- Remove operatorAuthOK() and crypto/subtle import (no shared secret to compare)
- wrapAPI: always validate bearer token as GitHub PAT, attach user to context
- wrapOperatorOnly: always enforce operator role
- handleOperatorStatus: return auth_repo instead of auth_mode
- handleMe: user always present in context; 500 if somehow missing
- handleRepoPermission: always checks real permissions via ghCache
- Replay handler: unconditionally enforces source-repo permission check
- ghCache always initialized (no longer gated on mode)

Frontend (index.html):
- Token input label always "GitHub Personal Access Token" with auth repo shown
- Remove conditional label/placeholder logic based on auth_mode
- Status check reads auth_repo and displays it in the token input label

Startup:
- app.go root page shows "authenticate with a GitHub PAT; role from <repo>"
- operator_ui.go startup log includes auth_repo
Adds an LLM-powered rule suggester in the Workflows tab. Given a source
file path and a desired target file path, the LLM proposes a copier
workflow rule and the server verifies the rule actually produces the
target before returning it to the user.

All Ollama management (connect/disconnect, list installed models,
pull/delete, switch active model, change base URL) happens from the UI.
No LLM_ENABLED env var or restart required — availability is detected
at runtime via an Ollama ping.

Backend:
- services/llm_client.go: pluggable LLMClient interface with an Ollama
  implementation. Mutex-protected mutable state (active model, base URL)
  so operators can switch models at runtime. Streams pull progress as
  NDJSON, exposes ListModels/Ping/DeleteModel/PullModel.
- services/operator_suggest_rule.go: POST /operator/api/suggest-rule
  prompts the LLM with a bounded schema (move/copy/glob/regex) and
  verifies the generated rule against the user's source/target using the
  existing PatternMatcher + PathTransformer. Returns YAML + explanation
  + verification status + computed_path so writers see exactly what the
  rule would produce.
- services/operator_llm_admin.go: four admin endpoints
    GET  /operator/api/llm/status    reachability, models, active model
    POST /operator/api/llm/settings  set active model / base URL at runtime
    POST /operator/api/llm/pull      stream NDJSON progress from Ollama
    DELETE /operator/api/llm/model   remove a model
  Write operations require operator role.
- operator_ui.go: always instantiate the LLM client (no env gate), expose
  llm_available via /operator/api/status.
- configs/environment.go: LLM_BASE_URL and LLM_MODEL are now optional
  initial defaults; LLM_ENABLED is removed entirely.

Frontend (index.html):
- AI settings panel in the Workflows tab (operator-only): connection
  status chip, installed-models list with delete buttons, active-model
  selector, pull-model input with live streaming progress bar, editable
  base URL. Shows install instructions with a link to ollama.com/download
  when Ollama isn't reachable.
- AI rule suggester panel: source + target inputs (plus optional target
  repo), generate button with long-running spinner, result card with
  verification badge (green or red), computed path on failure, YAML
  code block with copy-to-clipboard, and the LLM's explanation.
- Pre-fills the AI suggester's source from the file match tester if set.
Gosec CI was failing on 3 SSRF findings via taint analysis:

operator_auth.go (2 findings): ghAPIGetRepoPermission constructs a GitHub
API URL using the repo owner/name and username. While the host is
hardcoded, gosec flags the user-derived path components.
- Add strict input validation: ghUsernameRe and ghRepoNameRe whitelists
  enforce GitHub's username and repo name rules before URL construction
  (using RE2-compatible syntax — no lookahead)
- Use url.PathEscape on path components as defense in depth
- Add #nosec G107 G704 with justification on the Request and Do calls

slack_notifier.go (1 finding): sendPayload had an existing nosec on the
Do call but gosec now also flags NewRequestWithContext on the line above.
- Extend the nosec annotation to cover the Request call

No behavioral change — only input validation tightening and nosec
coverage on calls where the URL is either hardcoded (GitHub API) or
comes from trusted server config (Slack webhook).
Surface OPERATOR_UI_ENABLED, auth repo, AI model, and AI base URL in the
startup banner so local dev runs make the active operator/AI configuration
visible at a glance. Adds a truncMiddle helper (ASCII ellipsis) to keep
long paths and URLs from breaking the banner's byte-count-aligned padding.
Prep the Cloud Run deploy for the operator UI + AI rule suggester work:

- CI deploy env vars (ci.yml):
  - OPERATOR_UI_ENABLED=true
  - OPERATOR_AUTH_REPO=grove-platform/github-copier (required when UI is on)
  - OPERATOR_REPO_SLUG=grove-platform/github-copier (for audit-row deep links)
  - AUDIT_ENABLED flipped false → true, aligning with the v0.3.0 "enabled by
    default" CHANGELOG entry; the operator UI's Audit tab is inert without it
    and MONGO_URI_SECRET_NAME is already wired.

- env-cloudrun.yaml: drop the stale OPERATOR_UI_TOKEN comment (token auth
  mode was removed earlier on this branch) and document the PAT/auth-repo
  model plus optional LLM_* overrides.

- CHANGELOG.md: populate [Unreleased] so release.sh has content to extract
  when the next tag is cut. Covers the operator UI, PAT auth, AI rule
  suggester, per-delivery logs, empty-commit fix, audit-decode fix, and the
  gosec SSRF hardening.

LLM_* deliberately left unset in the deploy — defaults point at localhost,
which is unreachable from Cloud Run; operators can configure a real endpoint
from the UI at runtime.
The previous commit's edit accidentally dropped the '## [v0.3.0] - 2026-04-14'
heading, leaving its Changed/Fixed/Security sections orphaned under
[Unreleased]. Re-add the heading so the release history stays intact.
Adds an Anthropic Messages API client alongside the existing Ollama client so
the AI rule suggester works on Cloud Run without standing up a model-serving
VM. Ollama stays as the local/dev option.

Backend
- New services/llm_anthropic.go implementing LLMClient against /v1/messages
  and /v1/models. Pins anthropic-version=2023-06-01 and sends a JSON-only
  guardrail in the system prompt (Anthropic has no native JSON mode); strips
  ``` code fences from responses as a defensive post-process.
- llm_client.go: add ErrModelManagementNotSupported sentinel; change
  NewLLMClient to take LLMClientOptions (provider, base URL, model, API key)
  and dispatch anthropic/ollama.
- operator_llm_admin.go: include provider + supports_model_mgmt in the /llm/status
  response so the UI can branch; map the sentinel to 400 on pull/delete.
- configs/environment.go: add AnthropicAPIKey and AnthropicAPIKeySecretName
  fields + env vars; per-provider LLM_BASE_URL / LLM_MODEL defaults (Anthropic
  defaults to api.anthropic.com + claude-haiku-4-5-20251001).
- github_auth.go: LoadAnthropicAPIKey follows the existing Secret Manager
  pattern used for the webhook secret and Mongo URI.
- app.go: call LoadAnthropicAPIKey on boot (non-fatal — UI shows
  "not configured" if missing); banner now prints AI Provider.

UI
- web/operator/index.html: renderAISettings branches on provider/
  supports_model_mgmt to hide Ollama-only pull/delete sections for hosted
  providers; separate onboarding text for each. Status line labels the
  provider ("Anthropic connected at…" vs "Ollama connected at…").

Deploy config
- ci.yml + env-cloudrun.yaml: select anthropic, default to
  claude-haiku-4-5-20251001, load the API key from Secret Manager via
  ANTHROPIC_API_KEY_SECRET_NAME=anthropic-api-key. Operators can still switch
  the active model from the UI at runtime.

Before deploy, a one-time step is needed on the prod project:
  gcloud secrets create anthropic-api-key --data-file=<key-file>
  gcloud secrets add-iam-policy-binding anthropic-api-key \
    --member=serviceAccount:<runtime-sa> --role=roles/secretmanager.secretAccessor
The Anthropic access we have is an APIM-fronted gateway, not the native
Anthropic API. Two changes needed to match the gateway's contract:

- Auth header: gateway uses "api-key" (Azure API Management convention),
  not Anthropic's native "x-api-key". Send both — direct API ignores
  "api-key", gateway ignores "x-api-key", so one client works with either.
- Ping: switch from GET /v1/models (not exposed by the gateway) to a
  minimal POST /v1/messages with max_tokens=1. Cost ≈ 1 input + 1 output
  token per refresh. ListModels already falls back to a static list when
  /v1/models fails, so this keeps the UI usable.

Deploy config:
- LLM_BASE_URL points at the Grove Foundry gateway.
- LLM_MODEL uses the aliased "claude-haiku-4-5" (verified to work through
  the gateway with a smoke-test curl).
Small CLI for verifying LLM provider connectivity end-to-end — calls Ping,
ListModels, and a real rule-suggester GenerateJSON against the configured
provider. Useful after rotating ANTHROPIC_API_KEY, changing LLM_BASE_URL,
or pointing at a new APIM gateway.

Mirrors the shape of the existing cmd/test-pem and cmd/test-webhook tools.
Narrow the operator role mapping so write access no longer grants operator
privileges. Most writers on docs repos have write on the auth repo, which
would have given them replay and release capability — the point of the
role split is to gate those actions behind an explicit grant.

  admin / maintain         → operator  (was: admin / maintain / write)
  write / triage / read    → writer    (was: read / triage)
  none                     → denied

Updates env-cloudrun.yaml and CHANGELOG to match. Frontend has no hardcoded
permission labels; it just branches on the role string, so no UI change.
Rewrite the rule-suggester system prompt around the actual single-rule output
shape (llmSuggestedRule): one transform_type, one rule per request. The
previous prompt described the types accurately but gave the model no concrete
examples to anchor the decision between move / copy / glob / regex.

New prompt adds:
- Explicit response-shape template with field-level "for move or copy" /
  "for glob or regex" guidance.
- Default rules for destination_branch ("main") and commit_strategy
  ("pull_request" unless clearly a direct commit).
- Four worked Input/Output examples — one per transform type — using
  realistic paths (mflix-java-spring prefix rename, mflix README copy,
  agg/python glob with ${relative_path}, tutorials/v2 regex with named
  captures).

Also exports the prompt as services.SuggestRuleSystemPrompt and wires it
into cmd/test-llm so the smoke test exercises the exact prompt writers hit
via the UI. Verified end-to-end against the Grove Foundry APIM gateway:
claude-haiku-4-5 now correctly picks "glob" with ${relative_path} for a
.py-in-directory example (previously chose "move" with the exact file path).
Two security fixes called out in PR review:

1. Soft-fail on permission check granted writer on 404 → auth bypass.
   Previously, ANY error from /repos/.../collaborators/.../permission
   (including GitHub's 404 for "not a collaborator") returned the default
   writer role. Combined with --allow-unauthenticated on the Cloud Run
   service, any valid GitHub PAT got read access to audit logs, webhook
   traces, workflows, logs, and the AI suggester (real token cost).

   Now: only transient 5xx responses keep the writer role (so a GitHub
   outage doesn't lock every operator out). 404/401/403/network/parse
   errors all surface an error AND set RoleDenied. The distinction is
   carried by a new ghAPIError type that exposes StatusCode and
   IsTransient().

2. Raw PATs were live in the heap as cache keys for the 5-min TTL. A
   memory dump of the running Cloud Run instance would have leaked every
   active operator's token. Cache now keys on SHA-256 hex digests via
   hashToken(). Cache API unchanged — hashing happens at the boundary.

Also introduces githubAPIBaseURL (package var) so tests can point the GitHub
client at an httptest.Server. Not user-controllable, so SSRF posture is
unchanged (gosec clean).

Tests: full role-mapping matrix, explicit 404→denied and 5xx→writer cases,
and an assertion that raw PATs never appear as map keys.
Two LLM cost controls flagged in PR review:

1. /operator/api/suggest-rule was unbounded per PAT — combined with the
   earlier soft-fail auth hole, any valid GitHub user could have spent
   Anthropic tokens at will. Adds a fixed-window rate limiter keyed by
   hashed PAT at 30 calls/hour/user. Denied requests return 429 with a
   Retry-After header. Normal usage is well under the cap.

2. /operator/api/llm/status was issuing a real /v1/messages call on every
   refresh (1 input + 1 output Anthropic token). Writers poll this tab for
   the connection indicator, so on a busy dashboard that added up. Cache
   the ping outcome for 30s; settings mutations (SetActiveModel / SetBaseURL)
   invalidate the entry so operators see fresh state after a change.

tokenBucket and llmPingCache are small internal types in operator_ratelimit.go
and operator_ui.go respectively. Eviction of rate-limit buckets is
opportunistic (soft cap at 256) — matches the existing ghAuthCache pattern.
- handleRepoPermission: return {allowed, error?} per repo instead of a
  raw bool map. The frontend now shows GitHub rate-limit / transient
  errors in replay-button tooltips instead of generic "no access". UI
  keeps boolean-shape compat for partial rollouts. (review #7)

- githubCreateVersionTag: apply the same ghUsernameRe / ghRepoNameRe
  whitelist + url.PathEscape treatment ghAPIGetRepoPermission already
  uses. Also add a ghBranchNameRe check. Env-vars only today, but keeps
  the SSRF hardening story consistent across the package. (review #8)

- sharedGithubHTTPClient package var replaces per-call allocation in
  githubHTTPClient(). (review minor)

- ReleaseAPIMode typed string with ReleaseAPIDisabled /
  ReleaseAPITagCreateEnabled constants. (review minor)

- truncate() is now rune-aware so multi-byte LLM output isn't cut
  mid-glyph in error logs. (review minor)

- Replay delivery IDs append a 3-byte random suffix so two replays in
  the same millisecond can't collide on the trace ring. (review minor)

- Anthropic fallback model list trimmed to one stable alias
  (claude-haiku-4-5). Listing rotating dated IDs here ships dead
  dropdown options on every Claude release. (review #9)

- Normalize the Anthropic default model to the aliased form
  ("claude-haiku-4-5") in configs/environment.go and llm_anthropic.go
  so every default path matches ci.yml and env-cloudrun.yaml. (review #10)

- AI settings UI hint calls out that settings are process-global,
  affect all operators, and revert on restart. Fixes the "two operators
  clobber each other silently" footgun. (review #5)
Addresses PR review #6 — the security-critical surface had zero tests.
Adds table-driven coverage for:

- verifySuggestedRule, the invariant the AI suggester depends on to
  decide whether to show a "not verified" warning. Covers every
  transform type (move/copy/glob/regex), pattern mismatches, target
  mismatches, and invalid regex.
- NewLLMClient dispatch: empty/ollama/anthropic/unsupported, plus the
  "anthropic provider requires ANTHROPIC_API_KEY" guard.
- anthropic client getters/setters and ErrModelManagementNotSupported
  returned by PullModel / DeleteModel.
- stripJSONFences edge cases (fenced, unfenced, nested JSON).
- Rune-aware truncate (multi-byte glyphs not cut mid-byte).

The earlier operator_auth_test.go covers validateGitHubPAT role mapping
and the 404→denied / 5xx→writer auth semantics; this completes the
review's minimum bar.
Bring the doc set up-to-date with what this PR ships so new devs and
operators aren't figuring out a live feature set from source.

- README.md: new "Operator UI" section under Monitoring covering enable
  flags, role mapping (admin/maintain → operator, write/triage/read →
  writer), per-repo replay authorization, and AI-suggester providers.
  Enhanced Features list gains an "Operator UI" group. Tools list gains
  test-llm and test-pem entries.

- docs/CONFIG-REFERENCE.md: new "Operator UI" and "AI Rule Suggester
  (LLM)" env-var tables covering OPERATOR_UI_ENABLED, OPERATOR_AUTH_REPO,
  OPERATOR_REPO_SLUG, OPERATOR_RELEASE_*, LLM_PROVIDER, LLM_BASE_URL,
  LLM_MODEL, ANTHROPIC_API_KEY, ANTHROPIC_API_KEY_SECRET_NAME. Calls out
  the 30/hour/PAT rate limit on /suggest-rule.

- docs/DEPLOYMENT.md: Secret Manager step #4 for anthropic-api-key plus
  the IAM binding; pre-deploy checklist gains the operator-UI auth repo
  bullet; post-deploy smoke test for the operator UI + AI settings.

- docs/LOCAL-TESTING.md: "Optional (for Operator UI + AI rule suggester)"
  env-var block and a step-by-step "Testing the Operator UI Locally"
  section that points at cmd/test-llm for provider verification.

- docs/FAQ.md: new "Operator UI" section (what it is, who can access,
  how the AI suggester works, how to debug "not connected").

- AGENT.md: full rewrite. Expanded file map covers all operator_*.go,
  llm_*.go, web/operator/index.html embed, webhook_trace_buffer, and
  log_buffer. New sections on authorization model, security posture
  (auth fail-closed, PAT hashing, SSRF defense-in-depth, LLM cost cap),
  and edit patterns for operator UI / LLM provider work. Key doc table
  rebuilt with clickable links.
CI staticcheck flagged SA1012 — passing nil as a context argument, even
where the function signature permits it, is a lint error. My local
golangci-lint run was pulling a cached rule set that didn't enforce this;
CI's fresh download did.

Swap the two nil args in TestAnthropicClient_ModelManagementNotSupported
for a shared context.Background(). No behavior change.
@cbullinger cbullinger force-pushed the feat/operator-ui-audit branch from 14c8340 to 338a39c Compare April 24, 2026 13:49
A RoleOperator could redirect the Anthropic base URL to any host via
POST /operator/api/llm/settings, exfiltrating the bearer credential on the
next ping or /suggest-rule call. Address per review feedback:

- validateLLMBaseURL enforces https for Anthropic (+ any non-Ollama
  provider), rejects userinfo / opaque URIs, and honours an optional host
  allowlist via LLM_BASE_URL_ALLOWED_HOSTS.
- handleLLMSettings now persists a config_change AuditEvent (actor =
  GitHub login, old/new base URL + model) so changes are detectable
  after the fact.
- AuditEvent gains an Actor field; AuditLogger gains LogConfigChangeEvent.
Indirect dependency bump (otel, otel/metric, otel/trace) from v1.39.0.
Picked up via `go mod tidy`.
Copy link
Copy Markdown
Collaborator

@dacharyc dacharyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to you with a handful of auth/scoping-related issues, and some suggestions to add test coverage for the rule suggester, @cbullinger !

Comment thread services/operator_ui.go
_ = json.NewEncoder(w).Encode(map[string]any{"events": []any{}})
return
}
events, err := o.container.AuditLogger.QueryAuditEvents(ctx, q)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GET /operator/api/audit/events is wrapped with wrapAPI, so both RoleOperator and RoleWriter can read it. The handler passes the user-supplied filters straight to QueryAuditEvents with no scoping to the caller's GitHub permissions, so any writer (anyone with read/triage/write on OperatorAuthRepo) sees every audit row across every source/target repo the copier has touched — including repos they have no access to on GitHub.

Each row includes source_repo, source_path, target_repo, target_path, commit_sha, pr_number, error_message, and the free-form additional_data map (audit_logger.go:23). Even just the repo+path tuples can leak internal project names, unreleased work, or service topology to a writer who shouldn't see them.

The building block to fix this already exists in the PR: ghCache.CanUserReadRepo is used at line 276 to gate the per-repo replay button. We should probably post-filter results in handleAuditEvents — drop any event whose source_repo (and/or target_repo) the caller can't read. The existing repo-perm cache absorbs most of the GitHub API cost; worst case is one API call per distinct repo per uncached PAT.

Comment thread services/audit_logger.go
ErrorMessage string `bson:"error_message,omitempty" json:"error_message,omitempty"`
DurationMs int64 `bson:"duration_ms,omitempty" json:"duration_ms,omitempty"`
FileSize int64 `bson:"file_size,omitempty" json:"file_size,omitempty"`
AdditionalData map[string]any `bson:"additional_data,omitempty" json:"additional_data,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AdditionalData map[string]any is serialized to the audit-events JSON response verbatim and rendered in the operator UI. Worth auditing every call site that populates this field (and ErrorMessage) to make sure nothing sensitive lands here — raw GitHub API error bodies, tokens accidentally included in error context, internal IDs, etc. Once /operator/api/audit/events is reachable by the writer role, anything in this map is reachable by them too.

Comment thread services/operator_ui.go
_ = json.NewEncoder(w).Encode(map[string]any{"events": events})
}

func parseAuditListQuery(r *http.Request) (AuditListQuery, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor follow-on to the authorization issue: if/when audit results get scoped to the caller's accessible repos, consider also accepting an explicit source_repo / target_repo query param. Right now path is the only way to narrow by repo (substring match against source_path/target_path), which is awkward and won't match the indexed source_repo field. It looks like there's already a Mongo index on source_repo (audit_logger.go:153) that's currently unused by this query path.

Comment thread services/operator_ui.go
mux.HandleFunc("/operator/api/audit/events", o.wrapAPI(o.handleAuditEvents))
mux.HandleFunc("/operator/api/audit/overview", o.wrapAPI(o.handleAuditOverview))
mux.HandleFunc("/operator/api/observability/deliveries", o.wrapAPI(o.handleObservabilityDeliveries))
mux.HandleFunc("/operator/api/observability/webhook-traces", o.wrapAPI(o.handleObservabilityWebhookTraces))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting here the same authorization gap as /operator/api/audit/eventswrapAPI lets writers read this, and WebhookTraceEntry (webhook_trace_buffer.go:11) includes Repo, BaseBranch, CommitSHA, PRNumber, plus a free-form 500-char Detail. A writer with no GitHub access to repo X still sees X's webhook activity here, including whatever is stuffed into Detail. Recommend the same fix as audit: post-filter by ghCache.CanUserReadRepo against entry.Repo.

It's probably also worth scanning every AppendWebhookTrace(...) call site (webhook_handler_new.go:91 onward) to confirm Detail doesn't carry tokens or other secrets. The 500-char truncation is a length cap, not a redaction.

Comment thread services/operator_ui.go
mux.HandleFunc("/operator/api/status", o.handleOperatorStatus)
mux.HandleFunc("/operator/api/audit/events", o.wrapAPI(o.handleAuditEvents))
mux.HandleFunc("/operator/api/audit/overview", o.wrapAPI(o.handleAuditOverview))
mux.HandleFunc("/operator/api/observability/deliveries", o.wrapAPI(o.handleObservabilityDeliveries))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lower-severity, but flagging for completeness: this is also wrapAPI. The payload (DeliverySnapshot: delivery_id, seen_at, duplicate) doesn't carry repo info, so a writer can only see "we received some webhooks." Probably fine as-is, but worth deciding intentionally rather than by default.

Comment thread services/operator_ui.go
mux.HandleFunc("/operator/api/release", o.wrapOperatorOnly(o.handleRelease))
mux.HandleFunc("/operator/api/replay", o.wrapOperatorOnly(o.handleReplay))
mux.HandleFunc("/operator/api/workflows", o.wrapAPI(o.handleWorkflows))
mux.HandleFunc("/operator/api/logs", o.wrapAPI(o.handleDeliveryLogs))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapAPI again, lookup by delivery_id. A writer can pivot from /observability/webhook-traces (which leaks delivery IDs from repos they don't have access to) into the per-delivery logs. Whether that matters depends on what DeliveryLogs captures. We might want to do a spot check to confirm log lines don't carry secrets or content that wouldn't otherwise reach a writer.

Comment thread services/operator_ui.go
mux.HandleFunc("/operator/api/deployment", o.wrapAPI(o.handleDeployment))
mux.HandleFunc("/operator/api/release", o.wrapOperatorOnly(o.handleRelease))
mux.HandleFunc("/operator/api/replay", o.wrapOperatorOnly(o.handleReplay))
mux.HandleFunc("/operator/api/workflows", o.wrapAPI(o.handleWorkflows))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with the audit/traces endpoints scoped per repo, this endpoint hands a writer the entire workflow config — every source repo, target repo, and path pattern the copier knows about. If "writer" is supposed to be a scoped role, this is the largest exposure of repo topology in the PR; if the design is "writers are trusted to know what the copier does," that's fine but worth documenting next to RoleWriter (operator_auth.go:63) so the boundary is explicit.

Comment thread services/operator_auth.go
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking Claude questions about the implementation here, and Claude suggested this fix, so take it with a grain of salt - seems plausible but it's not in my domain expertise: Auth cache + PAT validation are an unbounded amplifier for bad input

ghAuthCache.set / setRepoPerm cache errors at the same TTL as successes (operator_auth.go:120, 151), and the size-triggered sweeps (lines 130, 160) only evict already-expired entries. Combined with the fact that any Bearer string is forwarded to GET /user before validation, an attacker can:

  1. Hold ~200 bytes/entry in the cache for 5 min per unique random token they send.
  2. Force 1–2 outbound GitHub calls per request, draining the server's GitHub QPS budget.

Two cheap fixes:

  • Reject Bearer tokens that don't match a known GitHub PAT prefix (ghp_, github_pat_, gho_, ghs_, ghu_) before any cache or network access.
  • Cap absolute map size (LRU or simple "drop random entry on overflow") rather than relying on TTL-based eviction.

Not exploitable into role escalation, but a meaningful DoS amplifier for an internet-exposed operator UI.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm noting the test coverage here on the suggester is missing some load-bearing pieces. Each transform type has one happy-path verification test, but the coverage misses the parts most likely to break in production:

  1. renderRuleYAML is completely untested; this is the artifact the user copies into config. Minimum: one golden test per transform type, plus the req.SourceRepo == "" branch.
  2. handleSuggestRule is completely untested — no coverage of the 405/503/429/400/502 branches, the Verified=false warning path, or default field application in askLLMForRule. Easy table-driven tests with a fake LLMClient.
  3. Move uses strings.HasPrefix without a path-boundary check (operator_suggest_rule.go:229). from=agg/py against source=agg/python/foo.py would pass verification with rel=thon/foo.py. The prompt doesn't force from to end at a / boundary, so this is reachable. We should probably add a boundary check in verifySuggestedRule or pin the current behavior with a test that demonstrates the divergence between verifier and server-side transform (whichever is correct).
  4. No "rule applies but target mismatch" test for glob or regex — only "rule doesn't apply at all." This is the most likely real failure mode for an LLM-generated rule (close-but-wrong template).
  5. computeGlobRelativePath corner cases (no wildcard, leading wildcard, brackets) aren't exercised — small table test would lock these down.

#1, #2, and #3 I'd consider blocking; the rest are nice-to-have.

Per code review: writers (read/triage/write on the auth repo) could
enumerate every repo the copier touches via the read-only operator
endpoints, regardless of GitHub access to the underlying repos. Each
endpoint leaks a different slice of topology — repo names, paths,
commit SHAs, free-form Detail fields, full source→target workflow
config — to anyone holding any valid PAT for the auth repo.

Add a per-request repoFilter that memoises CanUserReadRepo lookups
and post-filters rows. Operators (admin/maintain) bypass; writers
only see rows whose referenced repos they can read on GitHub.

- /audit/events: drop rows where source_repo or target_repo is
  unreadable; config_change rows (no repo) are operator-only.
- /observability/webhook-traces: drop traces with no repo or with
  unreadable repos; total stays unfiltered so retention vs
  visibility stays distinguishable.
- /logs (per-delivery): look up the delivery's repo via a new
  WebhookTraceBuffer.RepoForDelivery helper; deny writers when
  no trace exists or the repo is unreadable.
- /workflows: drop workflows whose source or destination repo is
  unreadable; surface hidden_count so the UI can hint without
  naming the hidden rows.

RoleWriter doc updated to describe the post-filter behaviour
explicitly so the boundary is discoverable from the type.
Per code review: verifySuggestedRule used strings.HasPrefix without a
trailing-separator check, so a "move" rule with from="agg/py" would
falsely pass verification for source="agg/python/foo.py" — the LLM's
intended rule and the verified rule diverge silently. After the prefix
the next character must now be either nothing (whole-path rename) or
"/" (subpath rename).

Test coverage on the suggester also expanded per review:
- renderRuleYAML golden tests for each transform type (the YAML the
  user copies into config) plus the empty-source-repo branch.
- handleSuggestRule branch coverage: 405/503/429/400/502, the
  Verified=false warning path, and the askLLMForRule default-fill
  path (name, branch, dest_repo, commit_strategy).
- Move-boundary regression test plus an exact-match-at-boundary case.
@cbullinger
Copy link
Copy Markdown
Collaborator Author

cbullinger commented May 1, 2026

@dacharyc Thanks for the thorough review!

Addressed items 1–4 and the blocking suggester pieces in commits 739ab56 and f41bc52.

Addressed

  • /audit/events, /observability/webhook-traces, /logs, /workflows — new repoFilter (services/operator_repo_filter.go) post-filters all four endpoints by CanUserReadRepo. Operators bypass; writers only see rows whose referenced repos they can read on GitHub. config_change audit rows (no repo) are operator-only. /logs resolves a delivery's repo via a new WebhookTraceBuffer.RepoForDelivery helper and denies writers when the trace has aged out. /workflows returns a hidden_count so the UI can hint without leaking names. RoleWriter doc updated to make the boundary explicit.
  • Move-transform path-boundary bugverifySuggestedRule now requires the prefix to end at a / boundary, so from=agg/py no longer falsely matches agg/python/.... Regression test added.
  • Suggester test coverage — golden renderRuleYAML tests per transform type + empty-source-repo branch; handleSuggestRule branch tests for 405/503/429/400/502, the Verified=false warning path, and askLLMForRule default application (name, branch, dest_repo, commit_strategy).

Deferred to follow-up issues (intentionally — would balloon this PR's diff)

  • AdditionalData / ErrorMessage hygiene sweep — call-site audit to confirm nothing sensitive lands in those fields, especially now that they're reachable by writers (after this PR's scoping).
  • Auth-cache DoS hardening — PAT prefix check before validation/cache + bounded LRU on ghAuthCache. Plausible mitigation, but separate subsystem.
  • Suggester nice-to-haves — glob/regex "rule applies but target mismatch" tests, computeGlobRelativePath corner cases.

Copy link
Copy Markdown
Collaborator

@dacharyc dacharyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for fixing up the blockers I called out - let's get this delightful addition merged so we can all enjoy the beautiful new dashboard 😁

@cbullinger cbullinger merged commit c02c3b4 into main May 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants