diff --git a/.github/agents/documentation-agent.md b/.github/agents/documentation-agent.md index 8f0a1d18..8915af38 100644 --- a/.github/agents/documentation-agent.md +++ b/.github/agents/documentation-agent.md @@ -1,6 +1,6 @@ --- description: 'Agent for documentation tasks in the DocumentDB Kubernetes Operator project.' -tools: [execute, read, terminal] +tools: [execute, read, edit] --- # Documentation Agent Instructions diff --git a/docs/operator-public-documentation/preview/multi-region-deployment/failover-procedures.md b/docs/operator-public-documentation/preview/multi-region-deployment/failover-procedures.md new file mode 100644 index 00000000..24dc4d0a --- /dev/null +++ b/docs/operator-public-documentation/preview/multi-region-deployment/failover-procedures.md @@ -0,0 +1,350 @@ +--- +title: Multi-region failover procedures +description: Step-by-step runbooks for planned and unplanned DocumentDB failovers across regions, including verification and rollback procedures. +tags: + - multi-region + - failover + - disaster-recovery + - operations +--- + +## Overview + +A **failover** promotes a replica Kubernetes cluster to become the new primary, making it +accept write operations. The previous primary (if still available) becomes a replica +replicating from the new primary. + +**When to perform failover:** + +- **Planned maintenance:** Region maintenance, infrastructure upgrades, cost optimization +- **Disaster recovery:** Primary region outage, network partition, catastrophic failure +- **Performance optimization:** Move primary closer to write-heavy workload +- **Testing:** Validate disaster recovery procedures + +## Failover types + +### Planned failover + +This is a failover where the primary is safely demoted, writes are all flushed, +and then the new primary is promoted. This kind of failover has no data loss, a +set window where writes aren't accepted, and the same number of replicas before +and after. + +### Unplanned failover (disaster recovery) + +This is a failover where the primary becomes unavailable and has to be forced out +of the DocumentDB cluster entirely. Downtime depends on how quickly primary degradation +is detected and, if HA is enabled, how long it takes to scale up the new primary. +Some writes to the failed primary might be lost, but with high availability enabled, +clients can determine when writes were not committed to replicas. After an unplanned +failover, the DocumentDB cluster has one fewer region, and you will need to add +the region back when the failed Kubernetes cluster is back online, or add a replacement +region. See the [add region playground guide](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/fleet-add-region/README.md) +for an example. + +## Prerequisites + +Before performing any failover: + +- **Replica health:** The target replica Kubernetes cluster is running and replication is current +- **Network access:** You have `kubectl` access to all Kubernetes clusters involved +- **Backup available:** Recent backup exists for rollback if needed +- **Monitoring:** Metrics and logs are accessible for verification +- **Communication:** Stakeholders are notified (for planned failover) +- **Application readiness:** Applications can handle brief connection interruption +- **kubectl-documentdb plugin:** Install the plugin for streamlined failover operations in [kubectl-plugin](../kubectl-plugin.md) + +### Check current replication status + +Identify the current primary and verify replica health: + +```bash +# View current primary setting +kubectl --context hub get documentdb documentdb-preview \ + -n documentdb-preview-ns -o jsonpath='{.spec.clusterReplication.primary}' + +# Check replication status on primary +kubectl --context primary exec -it -n documentdb-preview-ns \ + documentdb-preview-1 -- psql -U postgres -c "SELECT * FROM pg_stat_replication;" +``` + +Expected output shows active replication to all replicas: + +```text + pid | usename | application_name | client_addr | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | sync_state +-----+----------+------------------+-------------+-----------+------------+------------+------------+------------+------------ + 123 | postgres | replica1 | 10.2.1.5 | streaming | 0/30000A8 | 0/30000A8 | 0/30000A8 | 0/30000A8 | async + 124 | postgres | replica2 | 10.3.1.5 | streaming | 0/30000A8 | 0/30000A8 | 0/30000A8 | 0/30000A8 | async +``` + +**Key indicators:** + +- **state:** Should be `streaming` +- **LSN values:** `replay_lsn` should be close to `sent_lsn` (low replication lag) + +## Planned failover procedure + +Use this procedure when the primary Kubernetes cluster is healthy and you want to switch primary regions in a controlled manner. + +### Step 1: Pre-failover verification + +Verify system health before starting: + +```bash +# 1. Check all DocumentDB clusters are ready +kubectl --context hub get documentdb -A + +# 2. Verify replication lag is low (< 1 second) +kubectl --context current-primary exec -it -n documentdb-preview-ns \ + documentdb-preview-1 -- psql -U postgres -c \ + "SELECT client_addr, state, replay_lag FROM pg_stat_replication;" + +# 3. Check target replica is healthy +kubectl --context new-primary get pods -n documentdb-preview-ns +``` + +All checks should show healthy status before proceeding. + +### Step 2: Perform failover + +Promote the replica to become the new primary Kubernetes cluster. + +=== "Plugin" + + !!! note + KubeFleet deployment required + The plugin handles the CRD change and automatically waits for convergence: + + ```bash + kubectl documentdb promote \ + --documentdb documentdb-preview \ + --namespace documentdb-preview-ns \ + --target-cluster new-primary-cluster-name \ + --hub-context hub + ``` + +=== "kubectl patch (KubeFleet)" + + Update the DocumentDB resource on the hub Kubernetes cluster: + + ```bash + kubectl --context hub patch documentdb documentdb-preview \ + -n documentdb-preview-ns \ + --type='merge' \ + -p '{"spec":{"clusterReplication":{"primary":"new-primary-cluster-name"}}}' + ``` + + The fleet controller propagates the change to all member Kubernetes clusters automatically. + +=== "kubectl patch (manual)" + + Update the DocumentDB resource on **all** Kubernetes clusters: + + ```bash + # Update on all Kubernetes clusters (use a loop or run individually) + for context in cluster1 cluster2 cluster3; do + kubectl --context "$context" patch documentdb documentdb-preview \ + -n documentdb-preview-ns \ + --type='merge' \ + -p '{"spec":{"clusterReplication":{"primary":"new-primary-cluster-name"}}}' + done + ``` + +**What happens:** + +1. The operator detects the primary change +2. The old primary becomes a replica after flushing writes +3. The new primary Kubernetes cluster scales up (if HA) and starts to accept writes +4. Replication direction reverses (new primary → replicas including old primary) + +### Step 3: Monitor failover progress + +Watch operator logs and DocumentDB status: + +```bash +# Watch DocumentDB status on new primary +watch kubectl --context new-primary get documentdb -n documentdb-preview-ns + +# Monitor operator logs on new primary +kubectl --context new-primary logs -n documentdb-operator \ + -l app.kubernetes.io/name=documentdb-operator -f + +# Check pod status +kubectl --context new-primary get pods -n documentdb-preview-ns -w +``` + +### Step 4: Verify promoted primary + +Confirm the new primary accepts writes: + +```bash +# Port forward to new primary +kubectl --context new-primary port-forward \ + -n documentdb-preview-ns svc/documentdb-preview 10260:10260 & + +# Connect with mongosh +mongosh "mongodb://admin:password@localhost:10260/?tls=true&tlsAllowInvalidCertificates=true" + +# Test write operation +db.testCollection.insertOne({ + message: "Write test after failover", + timestamp: new Date() +}) + +# Should succeed without errors +``` + +### Step 5: Verify old primary as replica + +Check that the old primary is now replicating from the new primary: + +```bash +# Verify replication status ON NEW PRIMARY +kubectl --context new-primary exec -it -n documentdb-preview-ns \ + documentdb-preview-1 -- psql -U postgres -c "SELECT * FROM pg_stat_replication;" +``` + +You should see the old primary listed as a replica receiving replication stream. + +### Step 6: Post-failover validation + +Run comprehensive checks: + +```bash +# 1. Verify all Kubernetes clusters are in sync +for context in cluster1 cluster2 cluster3; do + echo "=== $context ===" + kubectl --context "$context" get documentdb -n documentdb-preview-ns +done + +# 2. Check application health +kubectl --context new-primary get pods -n app-namespace + +# 3. Review metrics and logs for errors +# (use your monitoring system, such as Prometheus, Grafana, or CloudWatch) + +# 4. Verify data consistency (read from all replicas) +``` + +## Unplanned failover procedure (disaster recovery) + +Use this procedure when the primary Kubernetes cluster is unavailable and you need to immediately promote a replica. + +!!! danger "Data loss risk" + Unplanned failover may result in data loss if the primary DocumentDB cluster failed before replicating recent writes. Assess replication lag before deciding which replica to promote. + +### Step 1: Assess the situation + +Determine the scope of the outage: + +```bash +# 1. Check primary Kubernetes cluster accessibility +kubectl --context primary get nodes +# If this fails, the primary Kubernetes cluster is unreachable + +# 2. Check replica Kubernetes cluster health +kubectl --context replica1 get documentdb -n documentdb-preview-ns +kubectl --context replica2 get documentdb -n documentdb-preview-ns + +# 3. Check cloud provider status pages for regional outages +``` + +### Step 2: Select target replica + +Choose which replica to promote based on: + +- **Replication lag:** Prefer the replica with the lowest lag (most recent data) +- **Geographic location:** Consider application proximity +- **Kubernetes cluster health:** Ensure the target Kubernetes cluster is fully operational + +If you cannot query the primary, check the last known replication status from monitoring dashboards or logs. + +### Step 3: Promote replica to primary + +Immediately promote the selected replica to become the new primary. + +=== "Plugin" + + !!! note + KubeFleet deployment required + + ```bash + kubectl documentdb promote \ + --documentdb documentdb-preview \ + --namespace documentdb-preview-ns \ + --target-cluster replica-cluster-name \ + --hub-context hub \ + --failover \ + --wait-timeout 15m + ``` + + The plugin handles the change to `clusterList` and `primary` and monitors for + successful convergence. Use `--skip-wait` if you need to return immediately + and verify manually. + +=== "kubectl patch (KubeFleet)" + + ```bash + # Remove failed primary from cluster list and set new primary in one command + kubectl --context hub patch documentdb documentdb-preview \ + -n documentdb-preview-ns \ + --type='merge' \ + -p '{"spec":{"clusterReplication":{"primary":"replica-cluster-name","clusterList":[{"name":"replica-cluster-name"},{"name":"other-replica-cluster-name"}]}}}' + ``` + + Replace the `clusterList` entries with your actual list of healthy Kubernetes clusters, excluding the failed primary. + +=== "kubectl patch (manual)" + + ```bash + # Update on all accessible Kubernetes clusters + # Remove failed primary from cluster list and set new primary in one command + for context in replica1 replica2; do + kubectl --context "$context" patch documentdb documentdb-preview \ + -n documentdb-preview-ns \ + --type='merge' \ + -p '{"spec":{"clusterReplication":{"primary":"replica-cluster-name","clusterList":[{"name":"replica-cluster-name"},{"name":"other-replica-cluster-name"}]}}}' + done + ``` + + Replace the `clusterList` entries with your actual list of healthy Kubernetes clusters, excluding the failed primary. + +**What happens:** + +1. The operator detects the primary and cluster list changes +2. The new primary Kubernetes cluster scales up (if HA) and starts to accept writes +3. The old primary is removed from replication + +### Step 4: Verify new primary + +Confirm the promoted replica is accepting writes: + +```bash +# Check status +kubectl --context new-primary get documentdb documentdb-preview \ + -n documentdb-preview-ns + +# Test write access +kubectl --context new-primary port-forward \ + -n documentdb-preview-ns svc/documentdb-preview 10260:10260 & + +mongosh "mongodb://admin:password@localhost:10260/?tls=true&tlsAllowInvalidCertificates=true" +db.testCollection.insertOne({message: "DR failover test"}) +``` + +### Step 5: Monitor recovery + +```bash +# Application pod logs +kubectl --context app-cluster logs -l app=your-app --tail=100 -f + +# DocumentDB operator logs +kubectl --context new-primary logs -n documentdb-operator \ + -l app.kubernetes.io/name=documentdb-operator -f +``` + +### Step 6: Handle failed primary recovery + +When the failed primary Kubernetes cluster recovers, you need to re-add it to the DocumentDB cluster +as a replica. For detailed guidance on adding a region back to your DocumentDB cluster, +see the [add region playground guide](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/fleet-add-region/README.md). diff --git a/docs/operator-public-documentation/preview/multi-region-deployment/overview.md b/docs/operator-public-documentation/preview/multi-region-deployment/overview.md new file mode 100644 index 00000000..341b3e7d --- /dev/null +++ b/docs/operator-public-documentation/preview/multi-region-deployment/overview.md @@ -0,0 +1,200 @@ +--- +title: Multi-region deployment overview +description: Understand multi-region DocumentDB deployments for disaster + recovery, low-latency access, and compliance with geographic data residency + requirements. +tags: + - multi-region + - disaster-recovery + - high-availability + - architecture +--- + +## Use cases + +### Disaster recovery (DR) + +Protect against regional outages by maintaining database replicas in separate +geographic regions. If the primary region fails, promote a replica in another +region to maintain service availability. + +### Low-latency global access + +Reduce application response times and distribute load by deploying read replicas +closer to end users. + +### Compliance and data residency + +Meet regulatory requirements for data storage location by deploying replicas in +specific regions. Ensure that data resides within required geographic +boundaries while maintaining availability. + +## Architecture + +### Primary-replica model + +DocumentDB uses a primary-replica architecture where: + +- **Primary cluster:** Accepts both read and write operations +- **Replica clusters:** Accept read-only operations and replicate changes from + the primary Kubernetes cluster +- **Replication:** PostgreSQL streaming replication propagates changes from + the primary Kubernetes cluster to replica Kubernetes clusters + +### DocumentDB cluster components + +Each regional Kubernetes cluster includes: + +- **Gateway containers:** Provide MongoDB-compatible API and connection management +- **PostgreSQL containers:** Store data and handle replication (managed by + CloudNative-PG) +- **Persistent storage:** Regional block storage for data persistence +- **Service endpoints:** LoadBalancer or ClusterIP for client connections +- **Self-name ConfigMap:** A ConfigMap that stores the Kubernetes cluster name + (must match `clusterList[].name`) + +### Replication configuration + +Multi-region replication is configured in the `DocumentDB` resource: + +```yaml +apiVersion: documentdb.io/preview +kind: DocumentDB +metadata: + name: documentdb-preview + namespace: documentdb-preview-ns +spec: + clusterReplication: + primary: member-eastus2-cluster + clusterList: + - name: member-westus3-cluster + - name: member-uksouth-cluster + - name: member-eastus2-cluster +``` + +The operator handles: + +- Creating replica Kubernetes clusters in specified regions +- Establishing streaming replication from the primary to replicas +- Monitoring replication lag and health +- Coordinating failover operations + +## Network requirements + +### Inter-region connectivity + +Use cloud-native VNet/VPC peering for direct Kubernetes cluster-to-cluster communication: + +- **Azure:** VNet peering between AKS clusters +- **AWS:** VPC peering between EKS clusters +- **GCP:** VPC peering between GKE clusters + +### Port requirements + +DocumentDB replication requires these ports between Kubernetes clusters: + +| Port | Protocol | Purpose | +| | | | +| 5432 | TCP | PostgreSQL streaming replication | +| 443 | TCP | Kubernetes API (for KubeFleet, optional) | + +Ensure firewall rules and network security groups allow traffic on these ports +between regional Kubernetes clusters. + +### DNS and service discovery + +The operator uses the DocumentDB cluster name and the generated service for the +corresponding CNPG cluster to connect regional deployments. You must make sure +those connections can resolve and route correctly between Kubernetes clusters. +You can also use either of the built-in networking integrations. + +#### Istio networking + +If Istio is installed on the Kubernetes cluster, Istio networking is enabled, +and an east-west gateway is present connecting each Kubernetes cluster, then +the operator generates services that automatically route the default service +names across regions. + +#### Fleet networking + +If Fleet networking is installed on each Kubernetes cluster, instead of using +default service names, the operator creates ServiceExports and +MultiClusterServices on each Kubernetes cluster. It then uses those generated +cross-regional services to connect CNPG instances to one another. + +## Deployment models + +### Managed fleet orchestration + +Use a multi-cluster orchestration system such as KubeFleet to manage +deployments of resources across Kubernetes clusters and centrally manage +changes, ensuring your topology stays synchronized between regions. + +**Example:** [AKS Fleet Deployment](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/aks-fleet-deployment/README.md) + +### Manual multi-cluster management + +Deploy DocumentDB resources individually to each Kubernetes cluster, manually +ensuring that each DocumentDB CRD is in sync. + +## Performance considerations + +### Replication lag + +Distance between regions affects replication lag. Monitor replication lag with +PostgreSQL metrics and adjust application read patterns accordingly. + +### Storage performance + +Each region requires independent storage resources, and each replica must have +an equal or greater volume of available storage compared to the primary. + +## Security considerations + +### TLS encryption + +Enable TLS for all connections: + +- **Client-to-gateway:** Encrypt application connections (see [TLS configuration](../configuration/tls.md)) +- **Replication traffic:** PostgreSQL SSL for inter-cluster replication +- **Service mesh:** mTLS for cross-cluster service communication + +### Authentication and authorization + +Credentials must be synchronized across regions: + +- **Kubernetes Secrets:** Replicate secrets to all Kubernetes clusters + (KubeFleet handles this automatically) +- **RBAC policies:** Apply consistent access controls across regions +- **Credential rotation:** Coordinate credential changes across all Kubernetes + clusters + +### Network security + +Restrict network access between regions: + +- **Private connectivity:** Use VNet/VPC peering instead of public internet +- **Network policies:** Kubernetes NetworkPolicy to limit pod-to-pod + communication +- **Firewall rules:** Allow only required ports between regional Kubernetes + clusters + +## Monitoring and observability + +Track multi-region health and performance: + +- **Replication lag:** Monitor `pg_stat_replication` metrics +- **Kubernetes cluster health:** Pod status, resource usage, and connection counts +- **Network metrics:** Bandwidth, latency, packet loss between regions +- **Application performance:** Request latency, error rates per region + +See [Telemetry examples](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/telemetry/README.md) +for OpenTelemetry, Prometheus, and Grafana setup. + +## Next steps + +- [Multi-region setup guide](setup.md) - Deploy your first multi-region + DocumentDB cluster +- [Failover procedures](failover-procedures.md) - Learn how to handle planned + and unplanned failovers +- [AKS Fleet deployment example](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/aks-fleet-deployment/README.md) diff --git a/docs/operator-public-documentation/preview/multi-region-deployment/setup.md b/docs/operator-public-documentation/preview/multi-region-deployment/setup.md new file mode 100644 index 00000000..1b7890d7 --- /dev/null +++ b/docs/operator-public-documentation/preview/multi-region-deployment/setup.md @@ -0,0 +1,405 @@ +--- +title: Multi-region setup guide +description: Step-by-step instructions for deploying DocumentDB across multiple Kubernetes clusters with replication, prerequisites, and configuration examples. +tags: + - multi-region + - setup + - deployment + - replication +--- + +## Prerequisites + +### Infrastructure requirements + +Before deploying DocumentDB in multi-region mode, ensure you have: + +- **Multiple Kubernetes clusters:** 2 or more Kubernetes clusters in different regions +- **Network connectivity:** Kubernetes clusters can communicate over private networking or the internet +- **Storage:** CSI-compatible storage class in each Kubernetes cluster with snapshot support +- **Load balancing:** LoadBalancer or Ingress capability for external access (optional) + +### Required components + +Install these components on **all** Kubernetes clusters: + +#### 1. cert-manager + +Required for TLS certificate management between Kubernetes clusters. + +```bash +helm repo add jetstack https://charts.jetstack.io +helm repo update +helm install cert-manager jetstack/cert-manager \ + --namespace cert-manager \ + --create-namespace \ + --set installCRDs=true +``` + +Verify installation: + +```bash +kubectl get pods -n cert-manager +``` + +See [Get Started](../index.md#install-cert-manager) for detailed cert-manager setup. + +#### 2. DocumentDB operator + +Install the operator on each Kubernetes cluster. + +```bash +helm repo add documentdb https://documentdb.github.io/documentdb-kubernetes-operator +helm repo update +helm install documentdb-operator documentdb/documentdb-operator \ + --namespace documentdb-operator \ + --create-namespace +``` + +Verify installation: + +```bash +kubectl get pods -n documentdb-operator +``` + +#### 3. Kubernetes cluster identity ConfigMap + +Each Kubernetes cluster in a multi-region deployment must identify itself with +a unique Kubernetes cluster name. Create a ConfigMap on each Kubernetes cluster: + +```bash +# Run on each Kubernetes cluster and replace with your actual cluster name. +CLUSTER_NAME="member-eastus2-cluster" # for example: member-eastus2-cluster, member-westus3-cluster + +kubectl create configmap cluster-identity \ + --namespace kube-system \ + --from-literal=cluster-name="${CLUSTER_NAME}" +``` + +!!! note + The Kubernetes cluster name in this ConfigMap must exactly match one + of the member Kubernetes cluster names in `spec.clusterReplication.clusterList[].name`. + +This is required because the DocumentDB CRD is the same across primaries and +replicas, and each Kubernetes cluster must identify its own role in the topology. + +### Network configuration + +#### VNet/VPC peering (single cloud provider) + +For Kubernetes clusters in the same cloud provider, configure VNet or VPC peering: + +=== "Azure (AKS)" + + Create VNet peering between all AKS cluster VNets: + + ```bash + az network vnet peering create \ + --name peer-to-cluster2 \ + --resource-group cluster1-rg \ + --vnet-name cluster1-vnet \ + --remote-vnet /subscriptions/.../cluster2-vnet \ + --allow-vnet-access + ``` + + Repeat for all Kubernetes cluster pairs in a full mesh topology. + + See [AKS Fleet Deployment](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/aks-fleet-deployment/README.md) for automated Azure multi-region setup with VNet peering. + +=== "AWS (EKS)" + + Create VPC peering connections between EKS cluster VPCs: + + ```bash + aws ec2 create-vpc-peering-connection \ + --vpc-id vpc-cluster1 \ + --peer-vpc-id vpc-cluster2 \ + --peer-region us-west-2 + ``` + + Update route tables to allow traffic between VPCs. + +=== "GCP (GKE)" + + Enable VPC peering between GKE cluster networks: + + ```bash + gcloud compute networks peerings create peer-to-cluster2 \ + --network=cluster1-network \ + --peer-network=cluster2-network + ``` + +#### Networking management + +Configure inter-cluster networking using `spec.clusterReplication.crossCloudNetworkingStrategy`: + +**Valid options:** + +- **None** (default): Direct service-to-service connections using standard Kubernetes service names for the PostgreSQL backend server +- **Istio**: Use Istio service mesh for cross-cluster connectivity +- **AzureFleet**: Use Azure Fleet Networking for cross-cluster communication (separate from KubeFleet) + +**Example:** + +```yaml +spec: + clusterReplication: + primary: member-eastus2-cluster + crossCloudNetworkingStrategy: Istio # or AzureFleet, None + clusterList: + - name: member-eastus2-cluster + - name: member-westus3-cluster +``` + +## Deployment options + +Choose a deployment approach based on your infrastructure and operational preferences. + +### With KubeFleet (recommended) + +KubeFleet systems simplify multi-region operations by: + +- **Centralized control:** Define resources once, deploy everywhere +- **Automatic propagation:** Resources sync to member Kubernetes clusters automatically +- **Coordinated updates:** Roll out changes across regions consistently + +#### Step 1: Deploy fleet infrastructure + +Install KubeFleet or another fleet management system: + +Configure member Kubernetes clusters to join the fleet. See +[deploy-fleet-bicep.sh](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/aks-fleet-deployment/deploy-fleet-bicep.sh) +"KUBEFLEET SETUP" for a complete automated setup example. + +#### Step 2: Install cert-manager and DocumentDB operator + +Install the cert manager and DocumentDB operator to the hub per the +[Required Components](#required-components) section, then create `ClusterResourcePlacements` +to deploy them both to all member Kubernetes clusters. + +- [cert-manager CRP](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/aks-fleet-deployment/cert-manager-crp.yaml) +- [documentdb-operator CRP](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/aks-fleet-deployment/documentdb-operator-crp.yaml) + +#### Step 3: Deploy multi-region DocumentDB + +Create a DocumentDB resource with replication configuration. The example uses substitutions +with a script, so you will need to replace all the {{PLACEHOLDERS}}. + +- [DocumentDB CRD, secret, and CRP](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/aks-fleet-deployment/documentdb-resource-crp.yaml) + +Within the CRD The `clusterReplication` section enables multi-region deployment, +`primary` specifies which Kubernetes cluster accepts write operations, and `clusterList` +lists all member Kubernetes clusters that host DocumentDB instances (including the +primary) and accepts a more granular `environment` and `storageClass` variable. + +### Without KubeFleet + +If you are not using KubeFleet, deploy DocumentDB resources to each Kubernetes cluster individually. + +#### Step 1: Identify Kubernetes cluster names + +Determine the name for each Kubernetes cluster. These names are used in the replication configuration: + +```bash +# List your clusters +kubectl config get-contexts + +# Or for cloud-managed clusters: +az aks list --query "[].name" -o table # Azure +aws eks list-clusters --query "clusters" --output table # AWS +gcloud container clusters list --format="table(name)" # GCP +``` + +#### Step 2: Create Kubernetes cluster identification + +On each Kubernetes cluster, create a ConfigMap to identify the Kubernetes cluster name: + +```bash +# Run on each Kubernetes cluster +CLUSTER_NAME="cluster-region-name" # for example: member-eastus2-cluster + +kubectl create configmap cluster-identity \ + --namespace kube-system \ + --from-literal=cluster-name="${CLUSTER_NAME}" +``` + +#### Step 3: Deploy cert-manager and DocumentDB operator to each cluster + +Install the cert manager and DocumentDB operator to the hub per the +[Required Components](#required-components) section on each Kubernetes cluster. +When making changes to any resource, you must make that same change across +each Kubernetes cluster so they are all in sync, as the operator works under +the assumption that all members have the same resources. + +### Storage configuration + +Each Kubernetes cluster in a multi-region deployment can use different storage classes. +Configure storage at the global level or override per member Kubernetes cluster: + +**Global storage configuration:** + +```yaml +spec: + resource: + storage: + pvcSize: 100Gi + storageClass: default-storage-class # Used by all Kubernetes clusters +``` + +**Per-Kubernetes-cluster storage override:** + +```yaml +spec: + resource: + storage: + pvcSize: 100Gi + storageClass: default-storage-class # Fallback + clusterReplication: + primary: member-eastus2-cluster + clusterList: + - name: member-westus3-cluster + storageClass: managed-csi-premium # Override for this Kubernetes cluster + - name: member-uksouth-cluster + storageClass: azuredisk-standard-ssd # Override for this Kubernetes cluster + - name: member-eastus2-cluster + # Uses global storageClass +``` + +**Cloud-specific storage classes:** + +=== "Azure (AKS)" + + ```yaml + - name: member-eastus2-cluster + storageClass: managed-csi # Azure Disk managed CSI driver + environment: aks + ``` + +=== "AWS (EKS)" + + ```yaml + - name: member-us-east-1-cluster + storageClass: gp3 # AWS EBS GP3 volumes + environment: eks + ``` + +=== "GCP (GKE)" + + ```yaml + - name: member-us-central1-cluster + storageClass: standard-rwo # GCP Persistent Disk + environment: gke + ``` + +### Service exposure + +Configure how DocumentDB is exposed in each region: + +=== "LoadBalancer" + + **Best for:** Production deployments with external access + + ```yaml + spec: + exposeViaService: + serviceType: LoadBalancer + ``` + + Each Kubernetes cluster gets a public IP for client connections. When you use the `environment` + configuration at either the DocumentDB cluster or Kubernetes cluster level, the tags for the + LoadBalancer change. See the + cloud-specific setup docs for more details. + +=== "ClusterIP" + + **Best for:** In-cluster access only or Ingress-based routing + + ```yaml + spec: + exposeViaService: + serviceType: ClusterIP + ``` + + Clients must access DocumentDB through Ingress or service mesh. + +## Troubleshooting + +### Replication not established + +If replicas don't receive data from the primary: + +1. **Verify network connectivity:** + + ```bash + # From a replica Kubernetes cluster, test connectivity to primary + kubectl --context replica1 run test-pod --rm -it --image=nicolaka/netshoot -- \ + nc -zv primary-service-endpoint 5432 + ``` + +2. **Check PostgreSQL replication status on primary:** + + ```bash + kubectl --context primary exec -it -n documentdb-preview-ns \ + documentdb-preview-1 -- psql -U postgres -c "SELECT * FROM pg_stat_replication;" + ``` + +3. **Review operator logs:** + + ```bash + kubectl --context primary logs -n documentdb-operator \ + -l app.kubernetes.io/name=documentdb-operator --tail=100 + ``` + +### Kubernetes cluster name mismatch + +If a Kubernetes cluster doesn't recognize itself as primary or replica: + +1. **Check cluster-identity ConfigMap:** + + ```bash + kubectl --context cluster1 get configmap cluster-identity \ + -n kube-system -o jsonpath='{.data.cluster-name}' + ``` + +2. **Verify the name matches the DocumentDB spec:** + + The returned name must exactly match one of the Kubernetes cluster names in `spec.clusterReplication.clusterList[*].name`. + +3. **Update ConfigMap if incorrect:** + + ```bash + kubectl --context cluster1 create configmap cluster-identity \ + --namespace kube-system \ + --from-literal=cluster-name="correct-cluster-name" \ + --dry-run=client -o yaml | kubectl apply -f - + ``` + +### Storage issues + +If PVCs aren't provisioning: + +1. **Verify storage class exists:** + + ```bash + kubectl --context cluster1 get storageclass + ``` + +2. **Check for VolumeSnapshotClass (required for backups):** + + ```bash + kubectl --context cluster1 get volumesnapshotclass + ``` + +3. **Review PVC events:** + + ```bash + kubectl --context cluster1 get events -n documentdb-preview-ns \ + --field-selector involvedObject.kind=PersistentVolumeClaim + ``` + +## Next steps + +- [Failover procedures](failover-procedures.md) - Learn how to perform planned and unplanned failovers +- [Backup and restore](../backup-and-restore.md) - Configure multi-region backup strategies +- [TLS configuration](../configuration/tls.md) - Secure connections with proper TLS certificates +- [AKS Fleet deployment example](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/aks-fleet-deployment/README.md) - Automated Azure multi-region setup diff --git a/documentdb-playground/aks-fleet-deployment/README.md b/documentdb-playground/aks-fleet-deployment/README.md index 853b1c7f..4b107a8d 100644 --- a/documentdb-playground/aks-fleet-deployment/README.md +++ b/documentdb-playground/aks-fleet-deployment/README.md @@ -158,7 +158,7 @@ Load aliases: source ~/.bashrc ``` -## Fleet Management +## KubeFleet ```bash # List member clusters diff --git a/documentdb-playground/aks-fleet-deployment/deploy-fleet-bicep.sh b/documentdb-playground/aks-fleet-deployment/deploy-fleet-bicep.sh index eab2d86f..dae119e0 100755 --- a/documentdb-playground/aks-fleet-deployment/deploy-fleet-bicep.sh +++ b/documentdb-playground/aks-fleet-deployment/deploy-fleet-bicep.sh @@ -109,6 +109,8 @@ while read -r cluster; do if [[ "$cluster" == *"$HUB_REGION"* ]]; then HUB_CLUSTER="$cluster"; fi done <<< "$MEMBER_CLUSTER_NAMES" +######### KUBEFLEET SETUP ######### + kubeDir=$(mktemp -d) git clone https://github.com/kubefleet-dev/kubefleet.git $kubeDir pushd $kubeDir @@ -200,6 +202,8 @@ done <<< "$MEMBER_CLUSTER_NAMES" popd +####### END KUBEFLEET SETUP ######## + # Create kubectl aliases and export FLEET_ID (k-hub and k-) persisted in ~/.bashrc ALIASES_BLOCK_START="# BEGIN aks aliases" ALIASES_BLOCK_END="# END aks aliases" diff --git a/mkdocs.yml b/mkdocs.yml index 1bc4e919..f642b5c6 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -32,6 +32,10 @@ nav: - Networking: preview/configuration/networking.md - TLS: preview/configuration/tls.md - Storage: preview/configuration/storage.md + - Multi-Region Deployment: + - Overview: preview/multi-region-deployment/overview.md + - Setup Guide: preview/multi-region-deployment/setup.md + - Failover Procedures: preview/multi-region-deployment/failover-procedures.md - Advanced Configuration: preview/advanced-configuration/README.md - Backup and Restore: preview/backup-and-restore.md - API Reference: preview/api-reference.md