feat: automatic quorum-loss disaster recovery#387
Conversation
Detect when an EtcdCluster has lost quorum — a majority of members unreachable with no leader electable, sustained past a grace window — and automatically recover it using etcd's documented disaster-recovery path: rebuild a single-member cluster from a surviving member with --force-new-cluster, then re-add the remaining members one at a time via the operator's existing learner-add path. The recovery is an idempotent controller state machine (Detecting → Rebuilding → ScalingOut → Completed) guarded so it ONLY triggers on true quorum loss, never on transient single-member failures or size-1 clusters. Progress is surfaced via a new Recovering status Condition, a Recovery status block, and Events. - internal/controller/quorum_recovery.go: detection (assessQuorum) + state machine + force-new-cluster rebuild, all self-contained. - etcdcluster_controller.go: one localized hook in the health/reconcile path that routes a failed health check through the recovery gate. - api: EtcdClusterStatus.Recovery (RecoveryStatus/RecoveryPhase); CRD + deepcopy regenerated via controller-gen. - Unit tests: table-driven detection guard + state-machine transitions (fake client). New e2e test/e2e/quorum_recovery_test.go breaks a 3-member cluster (delete 2 pods + PVCs) and asserts recovery to a healthy 3-member cluster with data intact. Refs etcd-io#383 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: xrl The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @xrl. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…build config Three correctness fixes to the quorum-loss recovery state machine, all on the detection/rebuild path: - assessQuorum: exclude learners from the reachable (voting) count. A learner answers health probes but cannot vote, so a surviving voter + healthy learner was wrongly read as "electable" when real voting quorum is gone. Aligns with countHealthyVoting, which already excludes learners. - recoveryScaleOut: do not delegate to reconcileClusterState when the member list is unusable this loop (memberListErr != nil or memberListResp == nil). Previously a transient member-list blip during ScalingOut made memberCnt=0, tripping the scale-IN branch and shrinking the StatefulSet mid-rebuild, reversing recovery. We now hold ownership and requeue. The member-list error is plumbed through advanceRecovery. - recoveryRebuild: rewrite the <name>-state ConfigMap to a single-member "new" bootstrap before injecting --force-new-cluster, so pod-0's environment is consistent with the single-member rebuild instead of advertising a contradictory 3-member "existing" cluster. --force-new-cluster masks this when pod-0's data dir survives, but an empty/replaced PVC would otherwise hang on a 3-member "existing" bootstrap. The scale-out path restores the multi-member state as members are re-added. Also adds a clusterHealthFn seam on the reconciler (defaults to etcdutils.ClusterHealth) so the survivor-health gate is unit-testable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
…g removal Unit tests for the three correctness fixes: - assessQuorum: a surviving voter + healthy learner with no leader is quorum LOST (proves learners don't satisfy quorum). - recoveryScaleOut: an unusable member list (memberListErr set) holds and requeues instead of delegating, so reconcileClusterState never scales the STS down during recovery. Existing ScalingOut cases now seed a real memberListResp. - recoveryRebuild: drives Detecting -> Rebuilding -> (survivor healthy, self-leader) -> ScalingOut via the clusterHealthFn seam, asserting --force-new-cluster and its marker annotation are removed from the live StatefulSet and that a RecoveryScalingOut event is emitted (FakeRecorder). Also asserts the state ConfigMap is rewritten to single-member "new" on rebuild entry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
The quorum-recovery e2e previously read the sentinel back from the untouched survivor (etcd-0) and asserted only phase=Completed + readiness, so a no-op recovery — or one that left --force-new-cluster permanently on the StatefulSet — would pass. Add two high-value assertions: - After Completed, assert the StatefulSet carries neither the operator.etcd.io/force-new-cluster annotation nor the --force-new-cluster arg, guarding against the worst failure mode (pod-0 re-forking the cluster on its next restart). - Read the sentinel from a destroyed-and-re-added member (etcd-1) in addition to the survivor, proving the rebuilt cluster actually replicated to the rejoined member. Live validation deferred: Docker/kind unavailable in this environment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
Quorum-loss recovery via --force-new-cluster retains only what the chosen survivor had committed locally; any write the destroyed majority had committed but not yet replicated to the survivor is gone. That loss was previously silent. Make it loud and auditable, and grant the operator the RBAC its recovery path actually uses. Data-loss accounting: - New status.recovery.dataLoss (DataLossInfo): survivor member id, retained revision, raft index, recovered time, and an operator-facing message, captured once when the rebuilt single-member cluster first reports healthy and self-leader (so a later post-recovery write can't inflate the recorded revision). - New DataLossPossible condition, set True and LEFT True as an audit marker awaiting human acknowledgement (distinct from the transient Recovering condition). - A Warning PossibleDataLoss Event plus a structured "POSSIBLE DATA LOSS" log, and a data-loss caveat added to the RecoveryStarted Event. Recovery counter: - New status.recovery.attempts, a durable, never-reset count bumped each time the operator commits to a destructive rebuild — repeated bumps point at an underlying infra problem. Surfaced via the existing Event channel (no new metrics framework, to stay cohesive with this branch). RBAC + survivor gate: - Add pods get;list;watch kubebuilder marker (regenerated role.yaml) and make it load-bearing: before arming the irreversible --force-new-cluster flag the rebuild now confirms the survivor pod exists, holding/requeueing otherwise so we never fork from a not-yet-created / freshly-replaced ordinal-0. Tests: - Unit: rebuild records dataLoss (id/revision/raftIndex), sets DataLossPossible True, emits the Warning event, bumps attempts, and does not overwrite the captured revision on later loops; survivor-pod-missing holds without arming the flag. Existing rebuild tests updated to create the survivor pod. - e2e (compile-only; Docker unavailable): assert DataLossPossible + dataLoss accounting are surfaced after recovery. Deferred: deeper false-trigger / survivor-selection redesign. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
The quorum-loss e2e deleted a majority of members' pods+PVCs once and
expected permanent quorum loss. Live-validation showed that never triggers
recovery: the StatefulSet (OrderedReady, Retain) eagerly recreates the
empty pods, and etcd's persisted raft membership lets each rejoin and
re-form quorum within seconds — the survivor reports a leader throughout,
so assessQuorum (correctly) classifies it as recoverable-by-restart and
the operator does nothing. The unmodified test failed at the detection
assess after 345s, having observed no recovery; this was a test bug, not a
product bug (no controller change needed).
Drive a real, sustained majority outage and strengthen the proof:
- holdMajorityDown: force-delete (grace=0) the recreated majority pods+PVCs
on a tight loop until the operator COMMITS to recovery
(Rebuilding/ScalingOut/Completed). This keeps a majority continuously
unreachable past the 30s linearizable member-list timeout + 60s grace —
the only state that is true, sustained quorum loss. The hold ends the
moment the operator scales to the single survivor, so it never fights the
rebuild and never touches ordinal 0. Replaces the racy
waitForRecoveringObserved (which could pass having seen only the finished
phase) with a durable, committed-phase signal.
- Strengthen assertions per adversarial review:
* re-added member is read node-LOCAL + serializable (--endpoints=
127.0.0.1 --consistency=s) so it provably reads its own store instead
of being routed to the leader/survivor;
* write 6 keys and assert the full keyspace replicated to the re-added
member, not just one round-trip;
* assert all members share ONE cluster id (catches a --force-new-cluster
re-fork / split-brain the STS-spec check cannot);
* assert SurvivorRevision > 1 (the load-bearing retained-revision
accounting), Attempts >= 1, and a loss-describing message;
* assert the survivor's PVC exists before the break (the precondition the
rebuild relies on).
- Speed/robustness: drop the redundant setup waitForNoLearners; tighten the
completion poll to 5s.
Live: unmodified test FAIL (345s, no recovery observed); improved test PASS
(212s) — recovery committed to Rebuilding, completed to 3 healthy members,
single shared cluster id, data-loss surfaced (revision 7, attempts 1),
sentinel + full keyspace replicated to the re-added member.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Detects sustained quorum loss on an
EtcdCluster(majority of members unreachable / no electable leader for a sustained window) and automatically recovers by rebuilding from a surviving member (snapshot-restore /--force-new-cluster), then re-adds members one at a time via the existing learner path. Guarded to trigger only on true quorum loss (not transient single-member failures), idempotent, with aRecoveringstatus Condition + Events and detailed logging.Includes unit tests for the detection + recovery state transitions, and a self-contained e2e (
test/e2e/quorum_recovery_test.go) that breaks a 3-node cluster and asserts recovery + data integrity. The e2e is authored and compile-verified; the live kind run is still pending (Docker was unavailable in the build environment).Refs #383