-
Notifications
You must be signed in to change notification settings - Fork 18
DOCS Multi-region #314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
alaye-ms
wants to merge
3
commits into
documentdb:main
Choose a base branch
from
alaye-ms:users/alaye/multi-region-docs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
DOCS Multi-region #314
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
350 changes: 350 additions & 0 deletions
350
...tor-public-documentation/preview/multi-region-deployment/failover-procedures.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,350 @@ | ||
| --- | ||
| title: Multi-region failover procedures | ||
| description: Step-by-step runbooks for planned and unplanned DocumentDB failovers across regions, including verification and rollback procedures. | ||
| tags: | ||
| - multi-region | ||
| - failover | ||
| - disaster-recovery | ||
| - operations | ||
| --- | ||
|
|
||
| ## Overview | ||
|
|
||
| A **failover** promotes a replica Kubernetes cluster to become the new primary, making it | ||
| accept write operations. The previous primary (if still available) becomes a replica | ||
| replicating from the new primary. | ||
|
|
||
| **When to perform failover:** | ||
|
|
||
| - **Planned maintenance:** Region maintenance, infrastructure upgrades, cost optimization | ||
| - **Disaster recovery:** Primary region outage, network partition, catastrophic failure | ||
| - **Performance optimization:** Move primary closer to write-heavy workload | ||
| - **Testing:** Validate disaster recovery procedures | ||
|
|
||
| ## Failover types | ||
|
|
||
| ### Planned failover | ||
|
|
||
| This is a failover where the primary is safely demoted, writes are all flushed, | ||
| and then the new primary is promoted. This kind of failover has no data loss, a | ||
| set window where writes aren't accepted, and the same number of replicas before | ||
| and after. | ||
|
|
||
| ### Unplanned failover (disaster recovery) | ||
|
|
||
| This is a failover where the primary becomes unavailable and has to be forced out | ||
| of the DocumentDB cluster entirely. Downtime depends on how quickly primary degradation | ||
| is detected and, if HA is enabled, how long it takes to scale up the new primary. | ||
| Some writes to the failed primary might be lost, but with high availability enabled, | ||
| clients can determine when writes were not committed to replicas. After an unplanned | ||
| failover, the DocumentDB cluster has one fewer region, and you will need to add | ||
| the region back when the failed Kubernetes cluster is back online, or add a replacement | ||
| region. See the [add region playground guide](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/fleet-add-region/README.md) | ||
| for an example. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| Before performing any failover: | ||
|
|
||
| - **Replica health:** The target replica Kubernetes cluster is running and replication is current | ||
| - **Network access:** You have `kubectl` access to all Kubernetes clusters involved | ||
| - **Backup available:** Recent backup exists for rollback if needed | ||
| - **Monitoring:** Metrics and logs are accessible for verification | ||
| - **Communication:** Stakeholders are notified (for planned failover) | ||
| - **Application readiness:** Applications can handle brief connection interruption | ||
| - **kubectl-documentdb plugin:** Install the plugin for streamlined failover operations in [kubectl-plugin](../kubectl-plugin.md) | ||
|
|
||
| ### Check current replication status | ||
|
|
||
| Identify the current primary and verify replica health: | ||
|
|
||
| ```bash | ||
| # View current primary setting | ||
| kubectl --context hub get documentdb documentdb-preview \ | ||
| -n documentdb-preview-ns -o jsonpath='{.spec.clusterReplication.primary}' | ||
|
|
||
| # Check replication status on primary | ||
| kubectl --context primary exec -it -n documentdb-preview-ns \ | ||
| documentdb-preview-1 -- psql -U postgres -c "SELECT * FROM pg_stat_replication;" | ||
| ``` | ||
|
|
||
| Expected output shows active replication to all replicas: | ||
|
|
||
| ```text | ||
| pid | usename | application_name | client_addr | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | sync_state | ||
| -----+----------+------------------+-------------+-----------+------------+------------+------------+------------+------------ | ||
| 123 | postgres | replica1 | 10.2.1.5 | streaming | 0/30000A8 | 0/30000A8 | 0/30000A8 | 0/30000A8 | async | ||
| 124 | postgres | replica2 | 10.3.1.5 | streaming | 0/30000A8 | 0/30000A8 | 0/30000A8 | 0/30000A8 | async | ||
| ``` | ||
|
|
||
| **Key indicators:** | ||
|
|
||
| - **state:** Should be `streaming` | ||
| - **LSN values:** `replay_lsn` should be close to `sent_lsn` (low replication lag) | ||
|
|
||
| ## Planned failover procedure | ||
|
|
||
| Use this procedure when the primary Kubernetes cluster is healthy and you want to switch primary regions in a controlled manner. | ||
|
|
||
| ### Step 1: Pre-failover verification | ||
|
|
||
| Verify system health before starting: | ||
|
|
||
| ```bash | ||
| # 1. Check all DocumentDB clusters are ready | ||
| kubectl --context hub get documentdb -A | ||
|
|
||
| # 2. Verify replication lag is low (< 1 second) | ||
| kubectl --context current-primary exec -it -n documentdb-preview-ns \ | ||
| documentdb-preview-1 -- psql -U postgres -c \ | ||
| "SELECT client_addr, state, replay_lag FROM pg_stat_replication;" | ||
|
|
||
| # 3. Check target replica is healthy | ||
| kubectl --context new-primary get pods -n documentdb-preview-ns | ||
| ``` | ||
|
|
||
| All checks should show healthy status before proceeding. | ||
|
|
||
| ### Step 2: Perform failover | ||
|
|
||
| Promote the replica to become the new primary Kubernetes cluster. | ||
|
|
||
| === "Plugin" | ||
|
|
||
| !!! note | ||
| KubeFleet deployment required | ||
| The plugin handles the CRD change and automatically waits for convergence: | ||
|
|
||
| ```bash | ||
| kubectl documentdb promote \ | ||
| --documentdb documentdb-preview \ | ||
| --namespace documentdb-preview-ns \ | ||
| --target-cluster new-primary-cluster-name \ | ||
| --hub-context hub | ||
| ``` | ||
|
|
||
| === "kubectl patch (KubeFleet)" | ||
|
|
||
| Update the DocumentDB resource on the hub Kubernetes cluster: | ||
|
|
||
| ```bash | ||
| kubectl --context hub patch documentdb documentdb-preview \ | ||
| -n documentdb-preview-ns \ | ||
| --type='merge' \ | ||
| -p '{"spec":{"clusterReplication":{"primary":"new-primary-cluster-name"}}}' | ||
| ``` | ||
|
|
||
| The fleet controller propagates the change to all member Kubernetes clusters automatically. | ||
|
|
||
| === "kubectl patch (manual)" | ||
|
|
||
| Update the DocumentDB resource on **all** Kubernetes clusters: | ||
|
|
||
| ```bash | ||
| # Update on all Kubernetes clusters (use a loop or run individually) | ||
| for context in cluster1 cluster2 cluster3; do | ||
| kubectl --context "$context" patch documentdb documentdb-preview \ | ||
| -n documentdb-preview-ns \ | ||
| --type='merge' \ | ||
| -p '{"spec":{"clusterReplication":{"primary":"new-primary-cluster-name"}}}' | ||
| done | ||
| ``` | ||
|
|
||
| **What happens:** | ||
|
|
||
| 1. The operator detects the primary change | ||
| 2. The old primary becomes a replica after flushing writes | ||
| 3. The new primary Kubernetes cluster scales up (if HA) and starts to accept writes | ||
| 4. Replication direction reverses (new primary → replicas including old primary) | ||
|
|
||
| ### Step 3: Monitor failover progress | ||
|
|
||
| Watch operator logs and DocumentDB status: | ||
|
|
||
| ```bash | ||
| # Watch DocumentDB status on new primary | ||
| watch kubectl --context new-primary get documentdb -n documentdb-preview-ns | ||
|
|
||
| # Monitor operator logs on new primary | ||
| kubectl --context new-primary logs -n documentdb-operator \ | ||
| -l app.kubernetes.io/name=documentdb-operator -f | ||
|
|
||
| # Check pod status | ||
| kubectl --context new-primary get pods -n documentdb-preview-ns -w | ||
| ``` | ||
|
|
||
| ### Step 4: Verify promoted primary | ||
|
|
||
| Confirm the new primary accepts writes: | ||
|
|
||
| ```bash | ||
| # Port forward to new primary | ||
| kubectl --context new-primary port-forward \ | ||
| -n documentdb-preview-ns svc/documentdb-preview 10260:10260 & | ||
|
|
||
| # Connect with mongosh | ||
| mongosh "mongodb://admin:password@localhost:10260/?tls=true&tlsAllowInvalidCertificates=true" | ||
|
|
||
| # Test write operation | ||
| db.testCollection.insertOne({ | ||
| message: "Write test after failover", | ||
| timestamp: new Date() | ||
| }) | ||
|
|
||
| # Should succeed without errors | ||
| ``` | ||
|
|
||
| ### Step 5: Verify old primary as replica | ||
|
|
||
| Check that the old primary is now replicating from the new primary: | ||
|
|
||
| ```bash | ||
| # Verify replication status ON NEW PRIMARY | ||
| kubectl --context new-primary exec -it -n documentdb-preview-ns \ | ||
| documentdb-preview-1 -- psql -U postgres -c "SELECT * FROM pg_stat_replication;" | ||
| ``` | ||
|
|
||
| You should see the old primary listed as a replica receiving replication stream. | ||
|
|
||
| ### Step 6: Post-failover validation | ||
|
|
||
| Run comprehensive checks: | ||
|
|
||
| ```bash | ||
| # 1. Verify all Kubernetes clusters are in sync | ||
| for context in cluster1 cluster2 cluster3; do | ||
| echo "=== $context ===" | ||
| kubectl --context "$context" get documentdb -n documentdb-preview-ns | ||
| done | ||
|
|
||
| # 2. Check application health | ||
| kubectl --context new-primary get pods -n app-namespace | ||
|
|
||
| # 3. Review metrics and logs for errors | ||
| # (use your monitoring system, such as Prometheus, Grafana, or CloudWatch) | ||
|
|
||
| # 4. Verify data consistency (read from all replicas) | ||
| ``` | ||
|
|
||
| ## Unplanned failover procedure (disaster recovery) | ||
|
|
||
| Use this procedure when the primary Kubernetes cluster is unavailable and you need to immediately promote a replica. | ||
|
|
||
| !!! danger "Data loss risk" | ||
| Unplanned failover may result in data loss if the primary DocumentDB cluster failed before replicating recent writes. Assess replication lag before deciding which replica to promote. | ||
|
|
||
| ### Step 1: Assess the situation | ||
|
|
||
| Determine the scope of the outage: | ||
|
|
||
| ```bash | ||
| # 1. Check primary Kubernetes cluster accessibility | ||
| kubectl --context primary get nodes | ||
| # If this fails, the primary Kubernetes cluster is unreachable | ||
|
|
||
| # 2. Check replica Kubernetes cluster health | ||
| kubectl --context replica1 get documentdb -n documentdb-preview-ns | ||
| kubectl --context replica2 get documentdb -n documentdb-preview-ns | ||
|
|
||
| # 3. Check cloud provider status pages for regional outages | ||
| ``` | ||
|
|
||
| ### Step 2: Select target replica | ||
|
|
||
| Choose which replica to promote based on: | ||
|
|
||
| - **Replication lag:** Prefer the replica with the lowest lag (most recent data) | ||
| - **Geographic location:** Consider application proximity | ||
| - **Kubernetes cluster health:** Ensure the target Kubernetes cluster is fully operational | ||
|
|
||
| If you cannot query the primary, check the last known replication status from monitoring dashboards or logs. | ||
|
|
||
| ### Step 3: Promote replica to primary | ||
|
|
||
| Immediately promote the selected replica to become the new primary. | ||
|
|
||
| === "Plugin" | ||
|
|
||
| !!! note | ||
| KubeFleet deployment required | ||
|
|
||
| ```bash | ||
| kubectl documentdb promote \ | ||
| --documentdb documentdb-preview \ | ||
| --namespace documentdb-preview-ns \ | ||
| --target-cluster replica-cluster-name \ | ||
| --hub-context hub \ | ||
| --failover \ | ||
| --wait-timeout 15m | ||
| ``` | ||
|
|
||
| The plugin handles the change to `clusterList` and `primary` and monitors for | ||
| successful convergence. Use `--skip-wait` if you need to return immediately | ||
| and verify manually. | ||
|
|
||
| === "kubectl patch (KubeFleet)" | ||
|
|
||
| ```bash | ||
| # Remove failed primary from cluster list and set new primary in one command | ||
| kubectl --context hub patch documentdb documentdb-preview \ | ||
| -n documentdb-preview-ns \ | ||
| --type='merge' \ | ||
| -p '{"spec":{"clusterReplication":{"primary":"replica-cluster-name","clusterList":[{"name":"replica-cluster-name"},{"name":"other-replica-cluster-name"}]}}}' | ||
| ``` | ||
|
|
||
| Replace the `clusterList` entries with your actual list of healthy Kubernetes clusters, excluding the failed primary. | ||
|
|
||
| === "kubectl patch (manual)" | ||
|
|
||
| ```bash | ||
| # Update on all accessible Kubernetes clusters | ||
| # Remove failed primary from cluster list and set new primary in one command | ||
| for context in replica1 replica2; do | ||
| kubectl --context "$context" patch documentdb documentdb-preview \ | ||
| -n documentdb-preview-ns \ | ||
| --type='merge' \ | ||
| -p '{"spec":{"clusterReplication":{"primary":"replica-cluster-name","clusterList":[{"name":"replica-cluster-name"},{"name":"other-replica-cluster-name"}]}}}' | ||
| done | ||
| ``` | ||
|
|
||
| Replace the `clusterList` entries with your actual list of healthy Kubernetes clusters, excluding the failed primary. | ||
|
|
||
| **What happens:** | ||
|
|
||
| 1. The operator detects the primary and cluster list changes | ||
| 2. The new primary Kubernetes cluster scales up (if HA) and starts to accept writes | ||
| 3. The old primary is removed from replication | ||
|
|
||
| ### Step 4: Verify new primary | ||
|
|
||
| Confirm the promoted replica is accepting writes: | ||
|
|
||
| ```bash | ||
| # Check status | ||
| kubectl --context new-primary get documentdb documentdb-preview \ | ||
| -n documentdb-preview-ns | ||
|
|
||
| # Test write access | ||
| kubectl --context new-primary port-forward \ | ||
| -n documentdb-preview-ns svc/documentdb-preview 10260:10260 & | ||
|
|
||
| mongosh "mongodb://admin:password@localhost:10260/?tls=true&tlsAllowInvalidCertificates=true" | ||
| db.testCollection.insertOne({message: "DR failover test"}) | ||
| ``` | ||
|
|
||
| ### Step 5: Monitor recovery | ||
|
|
||
| ```bash | ||
| # Application pod logs | ||
| kubectl --context app-cluster logs -l app=your-app --tail=100 -f | ||
|
|
||
| # DocumentDB operator logs | ||
| kubectl --context new-primary logs -n documentdb-operator \ | ||
| -l app.kubernetes.io/name=documentdb-operator -f | ||
| ``` | ||
|
|
||
| ### Step 6: Handle failed primary recovery | ||
|
|
||
| When the failed primary Kubernetes cluster recovers, you need to re-add it to the DocumentDB cluster | ||
| as a replica. For detailed guidance on adding a region back to your DocumentDB cluster, | ||
| see the [add region playground guide](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/fleet-add-region/README.md). | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to provide any command?