feat: CR-configurable per-cluster domain metrics on the Prometheus /metrics endpoint#382
feat: CR-configurable per-cluster domain metrics on the Prometheus /metrics endpoint#382xrl wants to merge 8 commits into
Conversation
Add an internal/metrics package that defines and registers custom Prometheus collectors for every reconciled EtcdCluster against the controller-runtime global registry, so they are served on the operator's existing /metrics endpoint. All series are labelled with the cluster namespace and name. Metrics: member_count_desired, member_count, member_count_ready, member_count_healthy, learner_count, has_quorum, has_leader, tls_enabled, leader_changes_total, reconcile_duration_seconds, reconcile_errors_total. Values are derived from the freshly-computed EtcdClusterStatus at the end of updateStatus (no extra API calls); reconcile duration/errors are observed in Reconcile; leader changes are tracked per cluster. Series are deleted when a cluster is removed to bound cardinality. Add spec.metrics to EtcdCluster: spec.metrics.enabled (default true) toggles/scopes the domain metrics per cluster, and spec.metrics.podMonitor makes the operator reconcile a prometheus-operator PodMonitor (owned by the cluster) for the etcd member pods. The PodMonitor is addressed via unstructured so the operator takes no hard dependency on the prometheus-operator module and degrades gracefully when the CRD is absent. Unit tests register to an isolated registry, drive state, scrape with prometheus testutil, and assert values/labels. A compile-verified e2e test creates a cluster with spec.metrics.podMonitor and scrapes /metrics. Refs etcd-io#381 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: xrl The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @xrl. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…eletion/leaderloss Two correctness fixes in the reconcile path that fed the per-cluster metrics: - Gate ObserveReconcile/updateStatus on state != nil in the Reconcile defer. On the NotFound path fetchAndValidateState already drops every series via DeleteClusterMetrics; observing afterwards re-created reconcile_duration_seconds (and the other series via updateStatus) for the just-deleted cluster on every tombstone reconcile, defeating the cardinality bound the package exists to keep. - Clear Status.LeaderID to "" when no leader is currently known in updateStatus. Previously LeaderID was only ever set, never cleared, so has_leader (derived from LeaderID != "") stuck at 1 forever once any leader was seen, hiding the exact leaderless outage it should report. observeLeader already treats "" as 'no change', so leader_changes_total is unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
RecordClusterMetrics computed quorum as hasQuorum(MemberCount, healthy), but both totals included learners. etcd raft quorum is over voting members only, so a healthy learner could mask a degraded voting set (e.g. 2 voting / 1 healthy plus 1 healthy learner reported quorum=1 during a real outage). Track healthyVoting (healthy and not a learner) and derive voting = MemberCount - learners, then call hasQuorum(voting, healthyVoting). Replace the learner test with one that exercises the learner-exclusion case and add a fully-healthy voting-set-with-learner case. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
…a value
The e2e scrape hits the operator pod's :8080 over plain HTTP via pod-proxy, but
the default overlay serves secure HTTPS metrics on :8443 behind authn/authz, so
the scrape reached a port nothing served (or 401'd) and the test only ever
'compile-verified'. Add a config/e2e patch-metrics.yaml overriding the manager
args to --metrics-bind-address=:8080 --metrics-secure=false (composed with the
existing patch-env.yaml), so the scrape actually reaches the manager.
Also upgrade an assertion from a metric-name substring (which passes from
HELP/TYPE lines even with zero series) to a value check on
member_count_desired{name=...} 3, proving the per-cluster gauge was recorded.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
The PodMonitor reconcile (the riskiest new control flow) had no coverage. Add table of fake-client tests: created-with-expected-selector/port/interval/owner when enabled, default port when unset, CRD-absent (NoMatchError) treated as a soft skip with no panic, delete-on-disable, and NotFound-on-delete swallowed. Directly validates the headline 'never fails reconcile when the CRD is absent'. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
Completes the observability surface for the per-cluster domain metrics: - Add etcd_operator_cluster_available (mirrors the Available status condition) and etcd_operator_cluster_tls_ready (only present for TLS-configured clusters; 1 when the TLS-secured cluster is Available), wired into RecordClusterMetrics and DeleteClusterMetrics with unit tests using prometheus testutil. The has_quorum gauge and leader_changes_total counter already existed and remain the recovery feature's signals. - Add config/prometheus/alerts.yaml (PrometheusRule) with quorum-lost, no-leader, members-not-ready, tls-not-ready, reconcile-errors, leader-flapping, and not-available alerts over the domain metrics, and wire it into the config/prometheus kustomization alongside the existing ServiceMonitor. - Add config/grafana/etcd-operator-dashboard.json and a ConfigMap- generator kustomization (grafana_dashboard sidecar label) covering cluster health, membership, and reconcile stability with datasource/ namespace/cluster template variables. - Add docs/metrics.md documenting every metric, its labels, the spec.metrics CR knobs, the ServiceMonitor vs PodMonitor scrape paths, and the shipped dashboard/alert assets. Verified: go build/vet across the module, metrics unit tests, and the controller envtest suite pass; kustomize build renders config/prometheus and config/grafana, and the prometheus component composes under the default overlay's namespace/prefix transforms. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
Strengthen TestDomainMetricsExposed from a name-substring check (which the
test's own comment admitted "passes purely from collector registration")
into a true end-to-end proof of the feature's derived values.
Improvements:
- Value-assert the derived gauges for a healthy cluster instead of just
metric names: member_count{,_ready,_healthy,_desired} == size,
has_quorum == 1, has_leader == 1, learner_count == 0, tls_enabled == 0.
has_quorum/has_leader are the dangerous booleans that require a real etcd
round-trip + populated Status.Members/LeaderID, so they distinguish
"operator fully reconciled etcd" from "operator never talked to etcd"
(the previous member_count_desired check is sourced from spec.Size and
needs no etcd round-trip at all).
- Assert reconcile_duration_seconds_count >= 1 (the _count child series),
proving the reconcile-timing hook actually fired; a histogram's # HELP
line exists with zero observations so the name check proved nothing.
- Assert tls_ready is ABSENT for the plaintext cluster — the documented
invariant that keeps "tls configured but not ready" alerts from firing on
plaintext clusters.
- Assert the operator actually created a PodMonitor for the cluster with
endpoint port "client" and interval "30s". reconcilePodMonitor swallows
all errors (CRD-absent is a silent skip), so without observing the object
the whole podMonitor half of the feature was dead setup. The e2e suite
installs the prometheus-operator CRDs, so the object must exist.
- Poll the scrape with wait.For instead of a one-shot read after STS
readiness. The derived gauges are written by the async reconcile loop,
which races STS ReadyReplicas; a one-shot scrape in that window would
flake the new value assertions. No fixed sleeps anywhere.
- Optimize: drop the cluster from size 3 to size 1. Quorum (1/1), a leader,
healthy members, the PodMonitor path, and every derived gauge are all
exercised at size 1; the quorum-over-voting-members arithmetic is covered
by the metrics package unit tests. This cuts the dominant cost (sequential
StatefulSet bootstrap + member-join): live e2e wall-clock 202.8s -> 108.1s.
- Drop the dead `var _ = corev1.Pod{}` keepalive and its unused import, and
rewrite the misleading "authored primarily for compile-time verification"
doc comment to describe what the test now proves.
Live-validated on KinD (kindest/node:v1.32.0): both runs green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
A namespaced/release-scoped Prometheus selects PodMonitors by label (e.g. podMonitorSelector matchLabels release: kvs-prometheus). The operator-generated PodMonitor only carried fixed app.kubernetes.io/* labels, so such a Prometheus never discovered it and downstream had to hand-author a PodMonitor. Add spec.metrics.podMonitor.labels (map[string]string). The labels are merged on top of the operator-managed labels and win on conflict; nil/ empty preserves today's behavior. Regenerated CRD + deepcopy and added unit coverage asserting release: kvs-prometheus lands on the object and that user keys override managed defaults. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
Refs #381
What
Exposes custom per-cluster domain metrics for every reconciled
EtcdClusteron the operator's existing Prometheus/metricsendpoint, and makes Prometheus observability configurable via the CR (spec.metrics).The operator already serves a controller-runtime metrics endpoint backed by a global registry; this PR registers custom collectors against that same registry, so there is no new HTTP server or port. Every series is labelled with the cluster
namespaceandname, so a single operator managing many clusters yields one time series per cluster.Metrics (subsystem
etcd_operator_cluster_)member_count_desiredspec.size)member_countstatus.memberCount)member_count_readyreadyReplicas)member_count_healthylearner_counthas_quorumhas_leadertls_enabledleader_changes_totalreconcile_duration_secondsreconcile_errors_totalGauge values are derived from the freshly-computed
EtcdClusterStatusat the end ofupdateStatus— no extra API calls. Reconcile duration/errors are observed inReconcile. Leader changes are tracked per cluster (the first observation and leaderless windows do not count, so operator restarts don't inflate the counter). When a cluster is deleted, its series are dropped to bound cardinality.CR config knob (
spec.metrics)spec.metrics.enabled(*bool, default true) toggles/scopes the operator domain metrics per cluster.spec.metrics.podMonitor, when enabled, makes the operator reconcile aPodMonitor(owned by theEtcdCluster, so GC'd on delete) selecting the etcd member pods (app/controllerlabels) so Prometheus scrapes etcd's own/metrics.Why unstructured for the PodMonitor
The PodMonitor is created via
unstructuredso the operator takes no hard dependency on the prometheus-operator Go module and does not require its CRDs to be installed unless a user opts in. If the CRD is absent, the operator logs and skips PodMonitor reconciliation without failing the rest of the reconcile (meta.IsNoMatchError).Files
internal/metrics/— collector definitions, registration (MustRegister/RegisterTo),RecordClusterMetrics,ObserveReconcile,DeleteClusterMetrics, per-cluster leader tracker.internal/controller/podmonitor.go— PodMonitor reconcile/cleanup.api/v1alpha1/etcdcluster_types.go—MetricsSpec/PodMonitorSpec+MetricsEnabled()/PodMonitorEnabled()helpers (CRD/deepcopy regenerated via the pinned controller-gen).internal/controller/etcdcluster_controller.go— minimal, localized instrumentation (time the loop, record metrics after status, reconcile PodMonitor, drop series on delete) + PodMonitor RBAC marker.cmd/main.go— register collectors at startup.test/e2e/metrics_test.go— compile-verified e2e that creates a cluster withspec.metrics.podMonitorand scrapes the operator/metrics.config/samples/sample_metrics_etcdcluster.yaml— example CR.Testing
internal/metrics/metrics_test.go) register collectors to an isolated registry, drive cluster state, scrape viaprometheus/client_golang/testutil, and assert exact values + labels — including quorum/leader edge cases, leader-change counting across leaderless windows, reconcile error counting, and series deletion (and leader-tracker reset) on cluster delete. Spec-helper defaults are covered too.go build ./...,go vet ./...,go vet ./test/e2e/..., and the controller envtest suite all pass.make test-e2e.🤖 Generated with Claude Code