feat: CR-configurable per-cluster domain metrics on the Prometheus /metrics endpoint by xrl · Pull Request #382 · etcd-io/etcd-operator

xrl · 2026-06-18T23:15:34Z

What

Exposes custom per-cluster domain metrics for every reconciled EtcdCluster on the operator's existing Prometheus /metrics endpoint, and makes Prometheus observability configurable via the CR (spec.metrics).

The operator already serves a controller-runtime metrics endpoint backed by a global registry; this PR registers custom collectors against that same registry, so there is no new HTTP server or port. Every series is labelled with the cluster namespace and name, so a single operator managing many clusters yields one time series per cluster.

Metrics (subsystem `etcd_operator_cluster_`)

metric	type	meaning
`member_count_desired`	gauge	desired members (`spec.size`)
`member_count`	gauge	members from the etcd member-list API (`status.memberCount`)
`member_count_ready`	gauge	ready etcd pods (StatefulSet `readyReplicas`)
`member_count_healthy`	gauge	members reported healthy
`learner_count`	gauge	non-voting (learner) members
`has_quorum`	gauge 1/0	healthy members form a strict majority
`has_leader`	gauge 1/0	a leader is currently known
`tls_enabled`	gauge 1/0	cluster configured for TLS
`leader_changes_total`	counter	observed leader-ID transitions
`reconcile_duration_seconds`	histogram	reconcile wall-clock
`reconcile_errors_total`	counter	reconciles returning an error

Gauge values are derived from the freshly-computed EtcdClusterStatus at the end of updateStatus — no extra API calls. Reconcile duration/errors are observed in Reconcile. Leader changes are tracked per cluster (the first observation and leaderless windows do not count, so operator restarts don't inflate the counter). When a cluster is deleted, its series are dropped to bound cardinality.

CR config knob (`spec.metrics`)

spec:
  metrics:
    enabled: true            # default true; false drops this cluster's series
    podMonitor:
      enabled: true          # emit a prometheus-operator PodMonitor for the etcd member pods
      interval: "30s"
      port: "client"

spec.metrics.enabled (*bool, default true) toggles/scopes the operator domain metrics per cluster.
spec.metrics.podMonitor, when enabled, makes the operator reconcile a PodMonitor (owned by the EtcdCluster, so GC'd on delete) selecting the etcd member pods (app/controller labels) so Prometheus scrapes etcd's own /metrics.

Why unstructured for the PodMonitor

The PodMonitor is created via unstructured so the operator takes no hard dependency on the prometheus-operator Go module and does not require its CRDs to be installed unless a user opts in. If the CRD is absent, the operator logs and skips PodMonitor reconciliation without failing the rest of the reconcile (meta.IsNoMatchError).

Files

NEW internal/metrics/ — collector definitions, registration (MustRegister/RegisterTo), RecordClusterMetrics, ObserveReconcile, DeleteClusterMetrics, per-cluster leader tracker.
NEW internal/controller/podmonitor.go — PodMonitor reconcile/cleanup.
api/v1alpha1/etcdcluster_types.go — MetricsSpec / PodMonitorSpec + MetricsEnabled()/PodMonitorEnabled() helpers (CRD/deepcopy regenerated via the pinned controller-gen).
internal/controller/etcdcluster_controller.go — minimal, localized instrumentation (time the loop, record metrics after status, reconcile PodMonitor, drop series on delete) + PodMonitor RBAC marker.
cmd/main.go — register collectors at startup.
NEW test/e2e/metrics_test.go — compile-verified e2e that creates a cluster with spec.metrics.podMonitor and scrapes the operator /metrics.
config/samples/sample_metrics_etcdcluster.yaml — example CR.

Testing

Unit tests (internal/metrics/metrics_test.go) register collectors to an isolated registry, drive cluster state, scrape via prometheus/client_golang/testutil, and assert exact values + labels — including quorum/leader edge cases, leader-change counting across leaderless windows, reconcile error counting, and series deletion (and leader-tracker reset) on cluster delete. Spec-helper defaults are covered too.
go build ./..., go vet ./..., go vet ./test/e2e/..., and the controller envtest suite all pass.
No live KinD run in this PR; the e2e test is authored to compile in CI and run under make test-e2e.

🤖 Generated with Claude Code

Add an internal/metrics package that defines and registers custom Prometheus collectors for every reconciled EtcdCluster against the controller-runtime global registry, so they are served on the operator's existing /metrics endpoint. All series are labelled with the cluster namespace and name. Metrics: member_count_desired, member_count, member_count_ready, member_count_healthy, learner_count, has_quorum, has_leader, tls_enabled, leader_changes_total, reconcile_duration_seconds, reconcile_errors_total. Values are derived from the freshly-computed EtcdClusterStatus at the end of updateStatus (no extra API calls); reconcile duration/errors are observed in Reconcile; leader changes are tracked per cluster. Series are deleted when a cluster is removed to bound cardinality. Add spec.metrics to EtcdCluster: spec.metrics.enabled (default true) toggles/scopes the domain metrics per cluster, and spec.metrics.podMonitor makes the operator reconcile a prometheus-operator PodMonitor (owned by the cluster) for the etcd member pods. The PodMonitor is addressed via unstructured so the operator takes no hard dependency on the prometheus-operator module and degrades gracefully when the CRD is absent. Unit tests register to an isolated registry, drive state, scrape with prometheus testutil, and assert values/labels. A compile-verified e2e test creates a cluster with spec.metrics.podMonitor and scrapes /metrics. Refs etcd-io#381 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>

k8s-ci-robot · 2026-06-18T23:15:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-06-18T23:15:45Z

Hi @xrl. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

…eletion/leaderloss Two correctness fixes in the reconcile path that fed the per-cluster metrics: - Gate ObserveReconcile/updateStatus on state != nil in the Reconcile defer. On the NotFound path fetchAndValidateState already drops every series via DeleteClusterMetrics; observing afterwards re-created reconcile_duration_seconds (and the other series via updateStatus) for the just-deleted cluster on every tombstone reconcile, defeating the cardinality bound the package exists to keep. - Clear Status.LeaderID to "" when no leader is currently known in updateStatus. Previously LeaderID was only ever set, never cleared, so has_leader (derived from LeaderID != "") stuck at 1 forever once any leader was seen, hiding the exact leaderless outage it should report. observeLeader already treats "" as 'no change', so leader_changes_total is unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>

RecordClusterMetrics computed quorum as hasQuorum(MemberCount, healthy), but both totals included learners. etcd raft quorum is over voting members only, so a healthy learner could mask a degraded voting set (e.g. 2 voting / 1 healthy plus 1 healthy learner reported quorum=1 during a real outage). Track healthyVoting (healthy and not a learner) and derive voting = MemberCount - learners, then call hasQuorum(voting, healthyVoting). Replace the learner test with one that exercises the learner-exclusion case and add a fully-healthy voting-set-with-learner case. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>

…a value The e2e scrape hits the operator pod's :8080 over plain HTTP via pod-proxy, but the default overlay serves secure HTTPS metrics on :8443 behind authn/authz, so the scrape reached a port nothing served (or 401'd) and the test only ever 'compile-verified'. Add a config/e2e patch-metrics.yaml overriding the manager args to --metrics-bind-address=:8080 --metrics-secure=false (composed with the existing patch-env.yaml), so the scrape actually reaches the manager. Also upgrade an assertion from a metric-name substring (which passes from HELP/TYPE lines even with zero series) to a value check on member_count_desired{name=...} 3, proving the per-cluster gauge was recorded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>

The PodMonitor reconcile (the riskiest new control flow) had no coverage. Add table of fake-client tests: created-with-expected-selector/port/interval/owner when enabled, default port when unset, CRD-absent (NoMatchError) treated as a soft skip with no panic, delete-on-disable, and NotFound-on-delete swallowed. Directly validates the headline 'never fails reconcile when the CRD is absent'. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>

Completes the observability surface for the per-cluster domain metrics: - Add etcd_operator_cluster_available (mirrors the Available status condition) and etcd_operator_cluster_tls_ready (only present for TLS-configured clusters; 1 when the TLS-secured cluster is Available), wired into RecordClusterMetrics and DeleteClusterMetrics with unit tests using prometheus testutil. The has_quorum gauge and leader_changes_total counter already existed and remain the recovery feature's signals. - Add config/prometheus/alerts.yaml (PrometheusRule) with quorum-lost, no-leader, members-not-ready, tls-not-ready, reconcile-errors, leader-flapping, and not-available alerts over the domain metrics, and wire it into the config/prometheus kustomization alongside the existing ServiceMonitor. - Add config/grafana/etcd-operator-dashboard.json and a ConfigMap- generator kustomization (grafana_dashboard sidecar label) covering cluster health, membership, and reconcile stability with datasource/ namespace/cluster template variables. - Add docs/metrics.md documenting every metric, its labels, the spec.metrics CR knobs, the ServiceMonitor vs PodMonitor scrape paths, and the shipped dashboard/alert assets. Verified: go build/vet across the module, metrics unit tests, and the controller envtest suite pass; kustomize build renders config/prometheus and config/grafana, and the prometheus component composes under the default overlay's namespace/prefix transforms. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>

Strengthen TestDomainMetricsExposed from a name-substring check (which the test's own comment admitted "passes purely from collector registration") into a true end-to-end proof of the feature's derived values. Improvements: - Value-assert the derived gauges for a healthy cluster instead of just metric names: member_count{,_ready,_healthy,_desired} == size, has_quorum == 1, has_leader == 1, learner_count == 0, tls_enabled == 0. has_quorum/has_leader are the dangerous booleans that require a real etcd round-trip + populated Status.Members/LeaderID, so they distinguish "operator fully reconciled etcd" from "operator never talked to etcd" (the previous member_count_desired check is sourced from spec.Size and needs no etcd round-trip at all). - Assert reconcile_duration_seconds_count >= 1 (the _count child series), proving the reconcile-timing hook actually fired; a histogram's # HELP line exists with zero observations so the name check proved nothing. - Assert tls_ready is ABSENT for the plaintext cluster — the documented invariant that keeps "tls configured but not ready" alerts from firing on plaintext clusters. - Assert the operator actually created a PodMonitor for the cluster with endpoint port "client" and interval "30s". reconcilePodMonitor swallows all errors (CRD-absent is a silent skip), so without observing the object the whole podMonitor half of the feature was dead setup. The e2e suite installs the prometheus-operator CRDs, so the object must exist. - Poll the scrape with wait.For instead of a one-shot read after STS readiness. The derived gauges are written by the async reconcile loop, which races STS ReadyReplicas; a one-shot scrape in that window would flake the new value assertions. No fixed sleeps anywhere. - Optimize: drop the cluster from size 3 to size 1. Quorum (1/1), a leader, healthy members, the PodMonitor path, and every derived gauge are all exercised at size 1; the quorum-over-voting-members arithmetic is covered by the metrics package unit tests. This cuts the dominant cost (sequential StatefulSet bootstrap + member-join): live e2e wall-clock 202.8s -> 108.1s. - Drop the dead `var _ = corev1.Pod{}` keepalive and its unused import, and rewrite the misleading "authored primarily for compile-time verification" doc comment to describe what the test now proves. Live-validated on KinD (kindest/node:v1.32.0): both runs green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>

A namespaced/release-scoped Prometheus selects PodMonitors by label (e.g. podMonitorSelector matchLabels release: kvs-prometheus). The operator-generated PodMonitor only carried fixed app.kubernetes.io/* labels, so such a Prometheus never discovered it and downstream had to hand-author a PodMonitor. Add spec.metrics.podMonitor.labels (map[string]string). The labels are merged on top of the operator-managed labels and win on conflict; nil/ empty preserves today's behavior. Regenerated CRD + deepcopy and added unit coverage asserting release: kvs-prometheus lands on the object and that user keys override managed defaults. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>

k8s-ci-robot added the needs-ok-to-test label Jun 18, 2026

k8s-ci-robot added the size/XXL label Jun 18, 2026

xrl and others added 7 commits June 18, 2026 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: CR-configurable per-cluster domain metrics on the Prometheus /metrics endpoint#382

feat: CR-configurable per-cluster domain metrics on the Prometheus /metrics endpoint#382
xrl wants to merge 8 commits into
etcd-io:mainfrom
xrl:pr/domain-metrics

xrl commented Jun 18, 2026

Uh oh!

k8s-ci-robot commented Jun 18, 2026

Uh oh!

k8s-ci-robot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xrl commented Jun 18, 2026

What

Metrics (subsystem etcd_operator_cluster_)

CR config knob (spec.metrics)

Why unstructured for the PodMonitor

Files

Testing

Uh oh!

k8s-ci-robot commented Jun 18, 2026

Uh oh!

k8s-ci-robot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Metrics (subsystem `etcd_operator_cluster_`)

CR config knob (`spec.metrics`)