Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/agents/documentation-agent.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
description: 'Agent for documentation tasks in the DocumentDB Kubernetes Operator project.'
tools: [execute, read, terminal]
tools: [execute, read, edit]
---
# Documentation Agent Instructions

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,350 @@
---
title: Multi-region failover procedures
description: Step-by-step runbooks for planned and unplanned DocumentDB failovers across regions, including verification and rollback procedures.
tags:
- multi-region
- failover
- disaster-recovery
- operations
---

## Overview

A **failover** promotes a replica Kubernetes cluster to become the new primary, making it
accept write operations. The previous primary (if still available) becomes a replica
replicating from the new primary.

**When to perform failover:**

- **Planned maintenance:** Region maintenance, infrastructure upgrades, cost optimization
- **Disaster recovery:** Primary region outage, network partition, catastrophic failure
- **Performance optimization:** Move primary closer to write-heavy workload
- **Testing:** Validate disaster recovery procedures

## Failover types

### Planned failover

This is a failover where the primary is safely demoted, writes are all flushed,
and then the new primary is promoted. This kind of failover has no data loss, a
set window where writes aren't accepted, and the same number of replicas before
and after.

### Unplanned failover (disaster recovery)

This is a failover where the primary becomes unavailable and has to be forced out
of the DocumentDB cluster entirely. Downtime depends on how quickly primary degradation
is detected and, if HA is enabled, how long it takes to scale up the new primary.
Some writes to the failed primary might be lost, but with high availability enabled,
clients can determine when writes were not committed to replicas. After an unplanned
failover, the DocumentDB cluster has one fewer region, and you will need to add
the region back when the failed Kubernetes cluster is back online, or add a replacement
region. See the [add region playground guide](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/fleet-add-region/README.md)
for an example.

## Prerequisites

Before performing any failover:

- **Replica health:** The target replica Kubernetes cluster is running and replication is current
- **Network access:** You have `kubectl` access to all Kubernetes clusters involved
- **Backup available:** Recent backup exists for rollback if needed
- **Monitoring:** Metrics and logs are accessible for verification
- **Communication:** Stakeholders are notified (for planned failover)
- **Application readiness:** Applications can handle brief connection interruption
- **kubectl-documentdb plugin:** Install the plugin for streamlined failover operations in [kubectl-plugin](../kubectl-plugin.md)

### Check current replication status

Identify the current primary and verify replica health:

```bash
# View current primary setting
kubectl --context hub get documentdb documentdb-preview \
-n documentdb-preview-ns -o jsonpath='{.spec.clusterReplication.primary}'

# Check replication status on primary
kubectl --context primary exec -it -n documentdb-preview-ns \
documentdb-preview-1 -- psql -U postgres -c "SELECT * FROM pg_stat_replication;"
```

Expected output shows active replication to all replicas:

```text
pid | usename | application_name | client_addr | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | sync_state
-----+----------+------------------+-------------+-----------+------------+------------+------------+------------+------------
123 | postgres | replica1 | 10.2.1.5 | streaming | 0/30000A8 | 0/30000A8 | 0/30000A8 | 0/30000A8 | async
124 | postgres | replica2 | 10.3.1.5 | streaming | 0/30000A8 | 0/30000A8 | 0/30000A8 | 0/30000A8 | async
```

**Key indicators:**

- **state:** Should be `streaming`
- **LSN values:** `replay_lsn` should be close to `sent_lsn` (low replication lag)

## Planned failover procedure

Use this procedure when the primary Kubernetes cluster is healthy and you want to switch primary regions in a controlled manner.

### Step 1: Pre-failover verification

Verify system health before starting:

```bash
# 1. Check all DocumentDB clusters are ready
kubectl --context hub get documentdb -A

# 2. Verify replication lag is low (< 1 second)
kubectl --context current-primary exec -it -n documentdb-preview-ns \
documentdb-preview-1 -- psql -U postgres -c \
"SELECT client_addr, state, replay_lag FROM pg_stat_replication;"

# 3. Check target replica is healthy
kubectl --context new-primary get pods -n documentdb-preview-ns
```

All checks should show healthy status before proceeding.

### Step 2: Perform failover

Promote the replica to become the new primary Kubernetes cluster.

=== "Plugin"

!!! note
KubeFleet deployment required
The plugin handles the CRD change and automatically waits for convergence:

```bash
kubectl documentdb promote \
--documentdb documentdb-preview \
--namespace documentdb-preview-ns \
--target-cluster new-primary-cluster-name \
--hub-context hub
```

=== "kubectl patch (KubeFleet)"

Update the DocumentDB resource on the hub Kubernetes cluster:

```bash
kubectl --context hub patch documentdb documentdb-preview \
-n documentdb-preview-ns \
--type='merge' \
-p '{"spec":{"clusterReplication":{"primary":"new-primary-cluster-name"}}}'
```

The fleet controller propagates the change to all member Kubernetes clusters automatically.

=== "kubectl patch (manual)"

Update the DocumentDB resource on **all** Kubernetes clusters:

```bash
# Update on all Kubernetes clusters (use a loop or run individually)
for context in cluster1 cluster2 cluster3; do
kubectl --context "$context" patch documentdb documentdb-preview \
-n documentdb-preview-ns \
--type='merge' \
-p '{"spec":{"clusterReplication":{"primary":"new-primary-cluster-name"}}}'
done
```

**What happens:**

1. The operator detects the primary change
2. The old primary becomes a replica after flushing writes
3. The new primary Kubernetes cluster scales up (if HA) and starts to accept writes
4. Replication direction reverses (new primary → replicas including old primary)

### Step 3: Monitor failover progress

Watch operator logs and DocumentDB status:

```bash
# Watch DocumentDB status on new primary
watch kubectl --context new-primary get documentdb -n documentdb-preview-ns

# Monitor operator logs on new primary
kubectl --context new-primary logs -n documentdb-operator \
-l app.kubernetes.io/name=documentdb-operator -f

# Check pod status
kubectl --context new-primary get pods -n documentdb-preview-ns -w
```

### Step 4: Verify promoted primary

Confirm the new primary accepts writes:

```bash
# Port forward to new primary
kubectl --context new-primary port-forward \
-n documentdb-preview-ns svc/documentdb-preview 10260:10260 &

# Connect with mongosh
mongosh "mongodb://admin:password@localhost:10260/?tls=true&tlsAllowInvalidCertificates=true"

# Test write operation
db.testCollection.insertOne({
message: "Write test after failover",
timestamp: new Date()
})

# Should succeed without errors
```

### Step 5: Verify old primary as replica

Check that the old primary is now replicating from the new primary:

```bash
# Verify replication status ON NEW PRIMARY
kubectl --context new-primary exec -it -n documentdb-preview-ns \
documentdb-preview-1 -- psql -U postgres -c "SELECT * FROM pg_stat_replication;"
```

You should see the old primary listed as a replica receiving replication stream.

### Step 6: Post-failover validation

Run comprehensive checks:

```bash
# 1. Verify all Kubernetes clusters are in sync
for context in cluster1 cluster2 cluster3; do
echo "=== $context ==="
kubectl --context "$context" get documentdb -n documentdb-preview-ns
done

# 2. Check application health
kubectl --context new-primary get pods -n app-namespace

# 3. Review metrics and logs for errors
# (use your monitoring system, such as Prometheus, Grafana, or CloudWatch)

# 4. Verify data consistency (read from all replicas)
```

## Unplanned failover procedure (disaster recovery)

Use this procedure when the primary Kubernetes cluster is unavailable and you need to immediately promote a replica.

!!! danger "Data loss risk"
Unplanned failover may result in data loss if the primary DocumentDB cluster failed before replicating recent writes. Assess replication lag before deciding which replica to promote.

### Step 1: Assess the situation

Determine the scope of the outage:

```bash
# 1. Check primary Kubernetes cluster accessibility
kubectl --context primary get nodes
# If this fails, the primary Kubernetes cluster is unreachable

# 2. Check replica Kubernetes cluster health
kubectl --context replica1 get documentdb -n documentdb-preview-ns
kubectl --context replica2 get documentdb -n documentdb-preview-ns

# 3. Check cloud provider status pages for regional outages
```

### Step 2: Select target replica

Choose which replica to promote based on:

- **Replication lag:** Prefer the replica with the lowest lag (most recent data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to provide any command?

- **Geographic location:** Consider application proximity
- **Kubernetes cluster health:** Ensure the target Kubernetes cluster is fully operational

If you cannot query the primary, check the last known replication status from monitoring dashboards or logs.

### Step 3: Promote replica to primary

Immediately promote the selected replica to become the new primary.

=== "Plugin"

!!! note
KubeFleet deployment required

```bash
kubectl documentdb promote \
--documentdb documentdb-preview \
--namespace documentdb-preview-ns \
--target-cluster replica-cluster-name \
--hub-context hub \
--failover \
--wait-timeout 15m
```

The plugin handles the change to `clusterList` and `primary` and monitors for
successful convergence. Use `--skip-wait` if you need to return immediately
and verify manually.

=== "kubectl patch (KubeFleet)"

```bash
# Remove failed primary from cluster list and set new primary in one command
kubectl --context hub patch documentdb documentdb-preview \
-n documentdb-preview-ns \
--type='merge' \
-p '{"spec":{"clusterReplication":{"primary":"replica-cluster-name","clusterList":[{"name":"replica-cluster-name"},{"name":"other-replica-cluster-name"}]}}}'
```

Replace the `clusterList` entries with your actual list of healthy Kubernetes clusters, excluding the failed primary.

=== "kubectl patch (manual)"

```bash
# Update on all accessible Kubernetes clusters
# Remove failed primary from cluster list and set new primary in one command
for context in replica1 replica2; do
kubectl --context "$context" patch documentdb documentdb-preview \
-n documentdb-preview-ns \
--type='merge' \
-p '{"spec":{"clusterReplication":{"primary":"replica-cluster-name","clusterList":[{"name":"replica-cluster-name"},{"name":"other-replica-cluster-name"}]}}}'
done
```

Replace the `clusterList` entries with your actual list of healthy Kubernetes clusters, excluding the failed primary.

**What happens:**

1. The operator detects the primary and cluster list changes
2. The new primary Kubernetes cluster scales up (if HA) and starts to accept writes
3. The old primary is removed from replication

### Step 4: Verify new primary

Confirm the promoted replica is accepting writes:

```bash
# Check status
kubectl --context new-primary get documentdb documentdb-preview \
-n documentdb-preview-ns

# Test write access
kubectl --context new-primary port-forward \
-n documentdb-preview-ns svc/documentdb-preview 10260:10260 &

mongosh "mongodb://admin:password@localhost:10260/?tls=true&tlsAllowInvalidCertificates=true"
db.testCollection.insertOne({message: "DR failover test"})
```

### Step 5: Monitor recovery

```bash
# Application pod logs
kubectl --context app-cluster logs -l app=your-app --tail=100 -f

# DocumentDB operator logs
kubectl --context new-primary logs -n documentdb-operator \
-l app.kubernetes.io/name=documentdb-operator -f
```

### Step 6: Handle failed primary recovery

When the failed primary Kubernetes cluster recovers, you need to re-add it to the DocumentDB cluster
as a replica. For detailed guidance on adding a region back to your DocumentDB cluster,
see the [add region playground guide](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/fleet-add-region/README.md).
Loading
Loading