Skip to content

feat: automatic quorum-loss disaster recovery#387

Open
xrl wants to merge 6 commits into
etcd-io:mainfrom
xrl:pr/quorum-loss-recovery
Open

feat: automatic quorum-loss disaster recovery#387
xrl wants to merge 6 commits into
etcd-io:mainfrom
xrl:pr/quorum-loss-recovery

Conversation

@xrl

@xrl xrl commented Jun 18, 2026

Copy link
Copy Markdown

Detects sustained quorum loss on an EtcdCluster (majority of members unreachable / no electable leader for a sustained window) and automatically recovers by rebuilding from a surviving member (snapshot-restore / --force-new-cluster), then re-adds members one at a time via the existing learner path. Guarded to trigger only on true quorum loss (not transient single-member failures), idempotent, with a Recovering status Condition + Events and detailed logging.

Includes unit tests for the detection + recovery state transitions, and a self-contained e2e (test/e2e/quorum_recovery_test.go) that breaks a 3-node cluster and asserts recovery + data integrity. The e2e is authored and compile-verified; the live kind run is still pending (Docker was unavailable in the build environment).

Refs #383

Detect when an EtcdCluster has lost quorum — a majority of members
unreachable with no leader electable, sustained past a grace window — and
automatically recover it using etcd's documented disaster-recovery path:
rebuild a single-member cluster from a surviving member with
--force-new-cluster, then re-add the remaining members one at a time via
the operator's existing learner-add path.

The recovery is an idempotent controller state machine (Detecting →
Rebuilding → ScalingOut → Completed) guarded so it ONLY triggers on true
quorum loss, never on transient single-member failures or size-1 clusters.
Progress is surfaced via a new Recovering status Condition, a Recovery
status block, and Events.

- internal/controller/quorum_recovery.go: detection (assessQuorum) +
  state machine + force-new-cluster rebuild, all self-contained.
- etcdcluster_controller.go: one localized hook in the health/reconcile
  path that routes a failed health check through the recovery gate.
- api: EtcdClusterStatus.Recovery (RecoveryStatus/RecoveryPhase);
  CRD + deepcopy regenerated via controller-gen.
- Unit tests: table-driven detection guard + state-machine transitions
  (fake client). New e2e test/e2e/quorum_recovery_test.go breaks a
  3-member cluster (delete 2 pods + PVCs) and asserts recovery to a
  healthy 3-member cluster with data intact.

Refs etcd-io#383

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
@k8s-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

Copy link
Copy Markdown

Hi @xrl. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

xrl and others added 5 commits June 18, 2026 20:02
…build config

Three correctness fixes to the quorum-loss recovery state machine, all on the
detection/rebuild path:

- assessQuorum: exclude learners from the reachable (voting) count. A learner
  answers health probes but cannot vote, so a surviving voter + healthy learner
  was wrongly read as "electable" when real voting quorum is gone. Aligns with
  countHealthyVoting, which already excludes learners.

- recoveryScaleOut: do not delegate to reconcileClusterState when the member
  list is unusable this loop (memberListErr != nil or memberListResp == nil).
  Previously a transient member-list blip during ScalingOut made memberCnt=0,
  tripping the scale-IN branch and shrinking the StatefulSet mid-rebuild,
  reversing recovery. We now hold ownership and requeue. The member-list error
  is plumbed through advanceRecovery.

- recoveryRebuild: rewrite the <name>-state ConfigMap to a single-member "new"
  bootstrap before injecting --force-new-cluster, so pod-0's environment is
  consistent with the single-member rebuild instead of advertising a contradictory
  3-member "existing" cluster. --force-new-cluster masks this when pod-0's data
  dir survives, but an empty/replaced PVC would otherwise hang on a 3-member
  "existing" bootstrap. The scale-out path restores the multi-member state as
  members are re-added.

Also adds a clusterHealthFn seam on the reconciler (defaults to
etcdutils.ClusterHealth) so the survivor-health gate is unit-testable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…g removal

Unit tests for the three correctness fixes:

- assessQuorum: a surviving voter + healthy learner with no leader is quorum LOST
  (proves learners don't satisfy quorum).
- recoveryScaleOut: an unusable member list (memberListErr set) holds and requeues
  instead of delegating, so reconcileClusterState never scales the STS down during
  recovery. Existing ScalingOut cases now seed a real memberListResp.
- recoveryRebuild: drives Detecting -> Rebuilding -> (survivor healthy, self-leader)
  -> ScalingOut via the clusterHealthFn seam, asserting --force-new-cluster and its
  marker annotation are removed from the live StatefulSet and that a
  RecoveryScalingOut event is emitted (FakeRecorder). Also asserts the state
  ConfigMap is rewritten to single-member "new" on rebuild entry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
The quorum-recovery e2e previously read the sentinel back from the untouched
survivor (etcd-0) and asserted only phase=Completed + readiness, so a no-op
recovery — or one that left --force-new-cluster permanently on the StatefulSet —
would pass. Add two high-value assertions:

- After Completed, assert the StatefulSet carries neither the
  operator.etcd.io/force-new-cluster annotation nor the --force-new-cluster arg,
  guarding against the worst failure mode (pod-0 re-forking the cluster on its
  next restart).
- Read the sentinel from a destroyed-and-re-added member (etcd-1) in addition to
  the survivor, proving the rebuilt cluster actually replicated to the rejoined
  member.

Live validation deferred: Docker/kind unavailable in this environment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Quorum-loss recovery via --force-new-cluster retains only what the chosen
survivor had committed locally; any write the destroyed majority had
committed but not yet replicated to the survivor is gone. That loss was
previously silent. Make it loud and auditable, and grant the operator the
RBAC its recovery path actually uses.

Data-loss accounting:
- New status.recovery.dataLoss (DataLossInfo): survivor member id, retained
  revision, raft index, recovered time, and an operator-facing message,
  captured once when the rebuilt single-member cluster first reports healthy
  and self-leader (so a later post-recovery write can't inflate the recorded
  revision).
- New DataLossPossible condition, set True and LEFT True as an audit marker
  awaiting human acknowledgement (distinct from the transient Recovering
  condition).
- A Warning PossibleDataLoss Event plus a structured "POSSIBLE DATA LOSS" log,
  and a data-loss caveat added to the RecoveryStarted Event.

Recovery counter:
- New status.recovery.attempts, a durable, never-reset count bumped each time
  the operator commits to a destructive rebuild — repeated bumps point at an
  underlying infra problem. Surfaced via the existing Event channel (no new
  metrics framework, to stay cohesive with this branch).

RBAC + survivor gate:
- Add pods get;list;watch kubebuilder marker (regenerated role.yaml) and make
  it load-bearing: before arming the irreversible --force-new-cluster flag the
  rebuild now confirms the survivor pod exists, holding/requeueing otherwise so
  we never fork from a not-yet-created / freshly-replaced ordinal-0.

Tests:
- Unit: rebuild records dataLoss (id/revision/raftIndex), sets DataLossPossible
  True, emits the Warning event, bumps attempts, and does not overwrite the
  captured revision on later loops; survivor-pod-missing holds without arming
  the flag. Existing rebuild tests updated to create the survivor pod.
- e2e (compile-only; Docker unavailable): assert DataLossPossible + dataLoss
  accounting are surfaced after recovery.

Deferred: deeper false-trigger / survivor-selection redesign.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
The quorum-loss e2e deleted a majority of members' pods+PVCs once and
expected permanent quorum loss. Live-validation showed that never triggers
recovery: the StatefulSet (OrderedReady, Retain) eagerly recreates the
empty pods, and etcd's persisted raft membership lets each rejoin and
re-form quorum within seconds — the survivor reports a leader throughout,
so assessQuorum (correctly) classifies it as recoverable-by-restart and
the operator does nothing. The unmodified test failed at the detection
assess after 345s, having observed no recovery; this was a test bug, not a
product bug (no controller change needed).

Drive a real, sustained majority outage and strengthen the proof:

- holdMajorityDown: force-delete (grace=0) the recreated majority pods+PVCs
  on a tight loop until the operator COMMITS to recovery
  (Rebuilding/ScalingOut/Completed). This keeps a majority continuously
  unreachable past the 30s linearizable member-list timeout + 60s grace —
  the only state that is true, sustained quorum loss. The hold ends the
  moment the operator scales to the single survivor, so it never fights the
  rebuild and never touches ordinal 0. Replaces the racy
  waitForRecoveringObserved (which could pass having seen only the finished
  phase) with a durable, committed-phase signal.

- Strengthen assertions per adversarial review:
  * re-added member is read node-LOCAL + serializable (--endpoints=
    127.0.0.1 --consistency=s) so it provably reads its own store instead
    of being routed to the leader/survivor;
  * write 6 keys and assert the full keyspace replicated to the re-added
    member, not just one round-trip;
  * assert all members share ONE cluster id (catches a --force-new-cluster
    re-fork / split-brain the STS-spec check cannot);
  * assert SurvivorRevision > 1 (the load-bearing retained-revision
    accounting), Attempts >= 1, and a loss-describing message;
  * assert the survivor's PVC exists before the break (the precondition the
    rebuild relies on).

- Speed/robustness: drop the redundant setup waitForNoLearners; tighten the
  completion poll to 5s.

Live: unmodified test FAIL (345s, no recovery observed); improved test PASS
(212s) — recovery committed to Rebuilding, completed to 3 healthy members,
single shared cluster id, data-loss surfaced (revision 7, attempts 1),
sentinel + full keyspace replicated to the re-added member.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants