An uptime monitor + status page built as a Kubernetes application around an
UptimeCheck custom resource and operator: checks are cluster objects,
probe results live in their status, history flows into Postgres, and state
changes fire signed webhooks. Packaged as a Helm chart, hardened with
restricted Pod Security and default-deny NetworkPolicies.
$ kubectl get uptimechecks
NAME URL STATE CODE LATENCY(MS) SINCE
always-down https://broken.invalid down 0 3002 2026-06-06T22:41:07+00:00
example https://example.com up 200 89 2026-06-06T22:41:05+00:00
github https://github.com up 200 142 2026-06-06T22:41:04+00:00make cluster-up # k3d cluster, ingress mapped to localhost:8080
make build # docker build + import into the cluster
make deploy # namespace (kubectl, carries PSS labels) + helm install
make smoke # 10-step verification
open http://localhost:8080Requires: Docker, k3d, kubectl, helm. No AWS account needed for local dev.
UptimeCheck CRs ──watched──→ operator (kopf) ──probe loop per check──→ targets
▲ │ status patched back │
│ ▼ ├─→ Postgres (results, transitions)
helm api (FastAPI) ──reads──┐ └─→ webhook alert on state change (HMAC-signed)
values │ └── uptime %, events ← Postgres
▼
ingress → status page (fetch-polling)
- Operator pattern: create/edit/delete an
UptimeCheckand the probe loop reconciles within seconds — no restarts, no config files. Transition detection is seeded from prior status, so a fix-the-URL edit still alerts. - Two ServiceAccounts, least privilege: the operator may watch/patch checks and their status; the api is read-only.
- History is best-effort by design: Postgres down → monitoring continues, only uptime %/events suffer.
- Hardening: restricted Pod Security (all pods non-root, read-only rootfs, no capabilities, seccomp), default-deny NetworkPolicies with five explicit allows (dns, ingress→api, api/operator→db, operator→probe targets), demo Postgres included with the same constraints.
kubectl apply -f dir/is alphabetical — the namespace raced resources into NotFound. Deploy order is now explicit.- Rollouts 502'd briefly: the ingress kept routing to the dying pod. Fixed
with a
preStopsleep; the smoke test demands a streak of 200s before trusting the ingress (round-robin can sneak one dying pod past a single probe). - Namespace deletion deadlocked on kopf's finalizers (operator was gone, so
nothing removed them). Our delete cleanup is in-memory only, so the handler
is now
optional=True— no finalizer, deletions never block on Vigil.
ConfigMap-driven checker + status page on k3dUptimeCheckCRD + kopf operatorPostgres history, uptime %, transitions + signed webhook alertsRestricted PSS, default-deny NetworkPolicies, Helm chartEKS via Terraform (public-subnet nodes — no NAT Gateway — IRSA, ALB, ECR)— see docs/eks-runbook.md
The same chart runs on EKS — only the cluster and ingress class change. Full steps in docs/eks-runbook.md; in short:
make eks-up # Terraform: VPC (no NAT) + EKS + ECR + ALB-controller IRSA + budget
make eks-kubeconfig
make eks-push # image → ECR
# install the AWS Load Balancer Controller (runbook step 4), then:
helm upgrade --install vigil charts/vigil -f charts/vigil/values.yaml \
-f charts/vigil/values-eks.yaml --set image.repository=$ECR -n vigil
make eks-down # ALWAYS — EKS bills ~$0.50/session, ~$3/day if forgotten