Recovering control plane from a loss of quorum #6627

albgus · 2022-12-12T08:43:32Z

albgus
Dec 12, 2022

I have a scenario where I want to deploy a Talos cluster spanning two different physical data centers for increased availability. The problem with this and Kubernetes is obviously the need to quorum for etcd.

So, what I'm wondering is if there is any good way to "manually failover" the control plane in a Talos linux cluster in the event that the primary data center goes down? Would simply running talosctl etcd remove-member to remove the other nodes work?

smira · 2022-12-12T11:26:44Z

smira
Dec 12, 2022
Maintainer

In case of network partitioning or a complete failure of the data center, the remaining etcd members should still have quorum in order to continue operations. As a quorum is a majority of the members, I don't see how this can be achieved so that failover can happen always.

E.g. with 3 members, 2+1 split, if the DC with 2 members fails, remaining single node won't be operational.

As for your other part of the question, there's always a loadbalancer of some sorts which points to the Kubernetes controlplane, and you can configure it the way you need to point traffic to the available DC.

You could split your controlplane nodes across 3 DCs/AZs, and in that case a failure of a single DC still allows other two DCs two operate. Keep in mind though that loss of network connectivity from etcd point of view is equivalent to a failure, so at any moment two DCs should always be connected. Also etcd is sensitive to network latency, so usually these failure domains are availability zones.

2 replies

albgus Dec 12, 2022
Author

I'm very aware that this is not an optimal solution, but as the situation is right now we have two physical data-centers and can't reasonably do anything about that.

Obviously the cluster will not be able to automatically recover from a failure of one of the regions, but the idea is if it would be possible to manually tell the remaining node to claim leadership / forget the disconnected nodes, allowing it to resume operation.

smira Dec 12, 2022
Maintainer

In case "remaining" nodes don't have quorum, there's no way to continue operations without performing disaster recovery procedure (rebuilding control plane nodes from etcd backup): https://www.talos.dev/v1.2/advanced/disaster-recovery/.

If "remaining" nodes have quorum, there's no reason to do any action, they will continue operating normal way. If the other node is lost forever, it should be removed with talosctl etcd remove-member, and new node created to restore number of members.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Recovering control plane from a loss of quorum #6627

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Recovering control plane from a loss of quorum #6627

Uh oh!

albgus Dec 12, 2022

Replies: 1 comment · 2 replies

Uh oh!

smira Dec 12, 2022 Maintainer

Uh oh!

albgus Dec 12, 2022 Author

Uh oh!

smira Dec 12, 2022 Maintainer

albgus
Dec 12, 2022

Replies: 1 comment 2 replies

smira
Dec 12, 2022
Maintainer

albgus Dec 12, 2022
Author

smira Dec 12, 2022
Maintainer