Skip to content

feat: CR-configurable per-cluster domain metrics on the Prometheus /metrics endpoint#382

Open
xrl wants to merge 8 commits into
etcd-io:mainfrom
xrl:pr/domain-metrics
Open

feat: CR-configurable per-cluster domain metrics on the Prometheus /metrics endpoint#382
xrl wants to merge 8 commits into
etcd-io:mainfrom
xrl:pr/domain-metrics

Conversation

@xrl

@xrl xrl commented Jun 18, 2026

Copy link
Copy Markdown

Refs #381

What

Exposes custom per-cluster domain metrics for every reconciled EtcdCluster on the operator's existing Prometheus /metrics endpoint, and makes Prometheus observability configurable via the CR (spec.metrics).

The operator already serves a controller-runtime metrics endpoint backed by a global registry; this PR registers custom collectors against that same registry, so there is no new HTTP server or port. Every series is labelled with the cluster namespace and name, so a single operator managing many clusters yields one time series per cluster.

Metrics (subsystem etcd_operator_cluster_)

metric type meaning
member_count_desired gauge desired members (spec.size)
member_count gauge members from the etcd member-list API (status.memberCount)
member_count_ready gauge ready etcd pods (StatefulSet readyReplicas)
member_count_healthy gauge members reported healthy
learner_count gauge non-voting (learner) members
has_quorum gauge 1/0 healthy members form a strict majority
has_leader gauge 1/0 a leader is currently known
tls_enabled gauge 1/0 cluster configured for TLS
leader_changes_total counter observed leader-ID transitions
reconcile_duration_seconds histogram reconcile wall-clock
reconcile_errors_total counter reconciles returning an error

Gauge values are derived from the freshly-computed EtcdClusterStatus at the end of updateStatus — no extra API calls. Reconcile duration/errors are observed in Reconcile. Leader changes are tracked per cluster (the first observation and leaderless windows do not count, so operator restarts don't inflate the counter). When a cluster is deleted, its series are dropped to bound cardinality.

CR config knob (spec.metrics)

spec:
  metrics:
    enabled: true            # default true; false drops this cluster's series
    podMonitor:
      enabled: true          # emit a prometheus-operator PodMonitor for the etcd member pods
      interval: "30s"
      port: "client"
  • spec.metrics.enabled (*bool, default true) toggles/scopes the operator domain metrics per cluster.
  • spec.metrics.podMonitor, when enabled, makes the operator reconcile a PodMonitor (owned by the EtcdCluster, so GC'd on delete) selecting the etcd member pods (app/controller labels) so Prometheus scrapes etcd's own /metrics.

Why unstructured for the PodMonitor

The PodMonitor is created via unstructured so the operator takes no hard dependency on the prometheus-operator Go module and does not require its CRDs to be installed unless a user opts in. If the CRD is absent, the operator logs and skips PodMonitor reconciliation without failing the rest of the reconcile (meta.IsNoMatchError).

Files

  • NEW internal/metrics/ — collector definitions, registration (MustRegister/RegisterTo), RecordClusterMetrics, ObserveReconcile, DeleteClusterMetrics, per-cluster leader tracker.
  • NEW internal/controller/podmonitor.go — PodMonitor reconcile/cleanup.
  • api/v1alpha1/etcdcluster_types.goMetricsSpec / PodMonitorSpec + MetricsEnabled()/PodMonitorEnabled() helpers (CRD/deepcopy regenerated via the pinned controller-gen).
  • internal/controller/etcdcluster_controller.go — minimal, localized instrumentation (time the loop, record metrics after status, reconcile PodMonitor, drop series on delete) + PodMonitor RBAC marker.
  • cmd/main.go — register collectors at startup.
  • NEW test/e2e/metrics_test.go — compile-verified e2e that creates a cluster with spec.metrics.podMonitor and scrapes the operator /metrics.
  • config/samples/sample_metrics_etcdcluster.yaml — example CR.

Testing

  • Unit tests (internal/metrics/metrics_test.go) register collectors to an isolated registry, drive cluster state, scrape via prometheus/client_golang/testutil, and assert exact values + labels — including quorum/leader edge cases, leader-change counting across leaderless windows, reconcile error counting, and series deletion (and leader-tracker reset) on cluster delete. Spec-helper defaults are covered too.
  • go build ./..., go vet ./..., go vet ./test/e2e/..., and the controller envtest suite all pass.
  • No live KinD run in this PR; the e2e test is authored to compile in CI and run under make test-e2e.

🤖 Generated with Claude Code

Add an internal/metrics package that defines and registers custom
Prometheus collectors for every reconciled EtcdCluster against the
controller-runtime global registry, so they are served on the operator's
existing /metrics endpoint. All series are labelled with the cluster
namespace and name.

Metrics: member_count_desired, member_count, member_count_ready,
member_count_healthy, learner_count, has_quorum, has_leader, tls_enabled,
leader_changes_total, reconcile_duration_seconds, reconcile_errors_total.

Values are derived from the freshly-computed EtcdClusterStatus at the end
of updateStatus (no extra API calls); reconcile duration/errors are
observed in Reconcile; leader changes are tracked per cluster. Series are
deleted when a cluster is removed to bound cardinality.

Add spec.metrics to EtcdCluster: spec.metrics.enabled (default true)
toggles/scopes the domain metrics per cluster, and spec.metrics.podMonitor
makes the operator reconcile a prometheus-operator PodMonitor (owned by the
cluster) for the etcd member pods. The PodMonitor is addressed via
unstructured so the operator takes no hard dependency on the
prometheus-operator module and degrades gracefully when the CRD is absent.

Unit tests register to an isolated registry, drive state, scrape with
prometheus testutil, and assert values/labels. A compile-verified e2e test
creates a cluster with spec.metrics.podMonitor and scrapes /metrics.

Refs etcd-io#381

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
@k8s-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

Copy link
Copy Markdown

Hi @xrl. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

xrl and others added 7 commits June 18, 2026 20:03
…eletion/leaderloss

Two correctness fixes in the reconcile path that fed the per-cluster metrics:

- Gate ObserveReconcile/updateStatus on state != nil in the Reconcile defer.
  On the NotFound path fetchAndValidateState already drops every series via
  DeleteClusterMetrics; observing afterwards re-created reconcile_duration_seconds
  (and the other series via updateStatus) for the just-deleted cluster on every
  tombstone reconcile, defeating the cardinality bound the package exists to keep.

- Clear Status.LeaderID to "" when no leader is currently known in updateStatus.
  Previously LeaderID was only ever set, never cleared, so has_leader (derived
  from LeaderID != "") stuck at 1 forever once any leader was seen, hiding the
  exact leaderless outage it should report. observeLeader already treats "" as
  'no change', so leader_changes_total is unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
RecordClusterMetrics computed quorum as hasQuorum(MemberCount, healthy), but
both totals included learners. etcd raft quorum is over voting members only, so
a healthy learner could mask a degraded voting set (e.g. 2 voting / 1 healthy
plus 1 healthy learner reported quorum=1 during a real outage).

Track healthyVoting (healthy and not a learner) and derive voting = MemberCount
- learners, then call hasQuorum(voting, healthyVoting). Replace the learner test
with one that exercises the learner-exclusion case and add a fully-healthy
voting-set-with-learner case.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…a value

The e2e scrape hits the operator pod's :8080 over plain HTTP via pod-proxy, but
the default overlay serves secure HTTPS metrics on :8443 behind authn/authz, so
the scrape reached a port nothing served (or 401'd) and the test only ever
'compile-verified'. Add a config/e2e patch-metrics.yaml overriding the manager
args to --metrics-bind-address=:8080 --metrics-secure=false (composed with the
existing patch-env.yaml), so the scrape actually reaches the manager.

Also upgrade an assertion from a metric-name substring (which passes from
HELP/TYPE lines even with zero series) to a value check on
member_count_desired{name=...} 3, proving the per-cluster gauge was recorded.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
The PodMonitor reconcile (the riskiest new control flow) had no coverage. Add
table of fake-client tests: created-with-expected-selector/port/interval/owner
when enabled, default port when unset, CRD-absent (NoMatchError) treated as a
soft skip with no panic, delete-on-disable, and NotFound-on-delete swallowed.
Directly validates the headline 'never fails reconcile when the CRD is absent'.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Completes the observability surface for the per-cluster domain metrics:

- Add etcd_operator_cluster_available (mirrors the Available status
  condition) and etcd_operator_cluster_tls_ready (only present for
  TLS-configured clusters; 1 when the TLS-secured cluster is Available),
  wired into RecordClusterMetrics and DeleteClusterMetrics with unit
  tests using prometheus testutil. The has_quorum gauge and
  leader_changes_total counter already existed and remain the recovery
  feature's signals.

- Add config/prometheus/alerts.yaml (PrometheusRule) with quorum-lost,
  no-leader, members-not-ready, tls-not-ready, reconcile-errors,
  leader-flapping, and not-available alerts over the domain metrics, and
  wire it into the config/prometheus kustomization alongside the existing
  ServiceMonitor.

- Add config/grafana/etcd-operator-dashboard.json and a ConfigMap-
  generator kustomization (grafana_dashboard sidecar label) covering
  cluster health, membership, and reconcile stability with datasource/
  namespace/cluster template variables.

- Add docs/metrics.md documenting every metric, its labels, the
  spec.metrics CR knobs, the ServiceMonitor vs PodMonitor scrape paths,
  and the shipped dashboard/alert assets.

Verified: go build/vet across the module, metrics unit tests, and the
controller envtest suite pass; kustomize build renders config/prometheus
and config/grafana, and the prometheus component composes under the
default overlay's namespace/prefix transforms.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Strengthen TestDomainMetricsExposed from a name-substring check (which the
test's own comment admitted "passes purely from collector registration")
into a true end-to-end proof of the feature's derived values.

Improvements:
- Value-assert the derived gauges for a healthy cluster instead of just
  metric names: member_count{,_ready,_healthy,_desired} == size,
  has_quorum == 1, has_leader == 1, learner_count == 0, tls_enabled == 0.
  has_quorum/has_leader are the dangerous booleans that require a real etcd
  round-trip + populated Status.Members/LeaderID, so they distinguish
  "operator fully reconciled etcd" from "operator never talked to etcd"
  (the previous member_count_desired check is sourced from spec.Size and
  needs no etcd round-trip at all).
- Assert reconcile_duration_seconds_count >= 1 (the _count child series),
  proving the reconcile-timing hook actually fired; a histogram's # HELP
  line exists with zero observations so the name check proved nothing.
- Assert tls_ready is ABSENT for the plaintext cluster — the documented
  invariant that keeps "tls configured but not ready" alerts from firing on
  plaintext clusters.
- Assert the operator actually created a PodMonitor for the cluster with
  endpoint port "client" and interval "30s". reconcilePodMonitor swallows
  all errors (CRD-absent is a silent skip), so without observing the object
  the whole podMonitor half of the feature was dead setup. The e2e suite
  installs the prometheus-operator CRDs, so the object must exist.
- Poll the scrape with wait.For instead of a one-shot read after STS
  readiness. The derived gauges are written by the async reconcile loop,
  which races STS ReadyReplicas; a one-shot scrape in that window would
  flake the new value assertions. No fixed sleeps anywhere.
- Optimize: drop the cluster from size 3 to size 1. Quorum (1/1), a leader,
  healthy members, the PodMonitor path, and every derived gauge are all
  exercised at size 1; the quorum-over-voting-members arithmetic is covered
  by the metrics package unit tests. This cuts the dominant cost (sequential
  StatefulSet bootstrap + member-join): live e2e wall-clock 202.8s -> 108.1s.
- Drop the dead `var _ = corev1.Pod{}` keepalive and its unused import, and
  rewrite the misleading "authored primarily for compile-time verification"
  doc comment to describe what the test now proves.

Live-validated on KinD (kindest/node:v1.32.0): both runs green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
A namespaced/release-scoped Prometheus selects PodMonitors by label
(e.g. podMonitorSelector matchLabels release: kvs-prometheus). The
operator-generated PodMonitor only carried fixed app.kubernetes.io/*
labels, so such a Prometheus never discovered it and downstream had to
hand-author a PodMonitor.

Add spec.metrics.podMonitor.labels (map[string]string). The labels are
merged on top of the operator-managed labels and win on conflict; nil/
empty preserves today's behavior. Regenerated CRD + deepcopy and added
unit coverage asserting release: kvs-prometheus lands on the object and
that user keys override managed defaults.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants