Update disaster recovery documentation for Harbor

kycheng · kycheng · commit b7296c53b7e7 · 2025-11-18T15:43:30.000+08:00
- Clarified the access address for the Primary Object Storage in the disaster recovery steps.
- Renamed the "Primary-Standby Switchover Procedure" section to "Failover" for better clarity.
- Expanded the "Disaster Recovery" section to include recovery steps for the original Primary Harbor.
- Added details on automatic start/stop mechanisms for the disaster recovery instance, including configuration and script examples for managing Harbor and PostgreSQL instances.
diff --git a/docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md b/docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md
@@ -108,7 +108,7 @@ You need to create a CephObjectStoreUser in advance to obtain the access credent
    You only need to create the CephObjectStoreUser on the Primary Object Storage. The user information will be automatically synchronized to the Secondary Object Storage through the disaster recovery replication mechanism.
    :::
 
-2. This `PRIMARY_OBJECT_STORAGE_ADDRESS` is the access address of the Object Storage, you can get it from the step [Configure External Access for Primary Zone](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html#configure-external-access-for-primary-zone) of `Object Storage Disaster Recovery`.
+2. This `PRIMARY_OBJECT_STORAGE_ADDRESS` is the access address of the Object Storage, you can get it from the step [Configure External Access for Primary Zone](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html#address) of `Object Storage Disaster Recovery`.
 
 3. Create a Harbor registry bucket on Primary Object Storage using mc, in this example, the bucket name is `harbor-registry`.
 
@@ -279,7 +279,7 @@ spec:
       replicas: 0
 ```
 
-### Primary-Standby Switchover Procedure in Disaster Scenarios
+### Failover
 
 1. First confirm that all Primary Harbor components are not in working state, otherwise stop all Primary Harbor components first.
 2. Promote Secondary PostgreSQL to Primary PostgreSQL. Refer to `PostgreSQL Hot Standby Cluster Configuration Guide`, the switchover procedure.
@@ -311,7 +311,27 @@ spec:
 5. Test image push and pull to verify that Harbor is working properly.
 6. Switch external access addresses to Secondary Harbor.
 
-### Disaster Recovery Data Check
+### Disaster Recovery
+
+When the primary cluster recovers from a disaster, you can restore the original Primary Harbor to operate as a Secondary Harbor. Follow these steps to perform the recovery:
+
+1. Set the replica count of all Harbor components to 0.
+2. Configure the original Primary PostgreSQL to operate as Secondary PostgreSQL according to the `PostgreSQL Hot Standby Cluster Configuration Guide`.
+3. Convert the original Primary Object Storage to Secondary Object Storage.
+
+  ```bash
+  # From within the recovered zone, pull the latest realm configuration from the current master zone:
+  radosgw-admin realm pull --url={url-to-master-zone-gateway} \
+                            --access-key={access-key} --secret={secret}
+  # Make the recovered zone the master and default zone:
+  radosgw-admin zone modify --rgw-realm=<realm-name> --rgw-zonegroup=<zone-group-name> --rgw-zone=<primary-zone-name> --master
+  ```
+
+After completing these steps, the original Primary Harbor will operate as a Secondary Harbor.
+
+If you need to restore the original Primary Harbor to continue operating as the Primary Harbor, follow the Failover procedure to promote the current Secondary Harbor to Primary Harbor, and then configure the new Primary Harbor to operate as Secondary Harbor.
+
+### Data sync check
 
 Check the synchronization status of Object Storage and PostgreSQL to ensure that the disaster recovery is successful.
 
@@ -353,3 +373,179 @@ The RTO represents the maximum acceptable downtime during disaster recovery. Thi
 The operational steps are similar to building a Harbor disaster recovery solution with `Alauda Build of Rook-Ceph` and `Alauda support for PostgreSQL`. Simply replace Object Storage and PostgreSQL with other object storage and PostgreSQL solutions.
 
 Ensure that the Object Storage and PostgreSQL solutions support disaster recovery capabilities.
+
+## Automatic Start/Stop of Disaster Recovery Instance
+
+This mechanism enables automatic activation of the Secondary Harbor instance when a disaster occurs. It supports custom check mechanisms through user-defined scripts and provides control over Harbor dependency configurations.
+
+```mermaid
+flowchart TD
+  Start[Monitoring Program] --> CheckScript[Check if Instance Should Start]
+  CheckScript -->|"Yes (Script exit 0)"| StartScript[Execute StartScript]
+  CheckScript -->|"No (Script exit Non-zero)"| StopScript[Execute StopScript]
+```
+
+### How to Configure and Run the Auto Start/Stop Program
+
+1. Prepare the configuration file `config.yaml`:
+
+    ```yaml
+    check_script: /path/to/check.sh # Path to the script that checks if the instance should start
+    start_script: /path/to/start.sh # Path to the script that starts the Harbor instance
+    stop_script: /path/to/stop.sh # Path to the script that stops the Harbor instance
+    check_interval: 30s
+    failure_threshold: 3
+    script_timeout: 10s
+    ```
+
+2. Create the corresponding script files:
+
+    - **check.sh**: This script must be customized based on your internal implementation. It should return exit code 0 when the current cluster instance should be started, and a non-zero exit code otherwise. The following is a simple DNS IP check example (do not use directly in production):
+
+      ```bash
+      HARBOR_DOMAIN="${HARBOR_DOMAIN:-}"
+      HARBOR_IP="${HARBOR_IP:-}"
+
+      RESOLVED_IP=$(nslookup "$HARBOR_DOMAIN" 2>/dev/null | grep -A 1 "Name:" | grep "Address:" | awk '{print $2}' | head -n 1)
+      if [ "$RESOLVED_IP" = "$HARBOR_IP" ]; then
+        exit 0
+      else
+        exit 1
+      fi
+      ```
+
+    - **start.sh**: The start script should include checks for Harbor dependencies and the startup of the Harbor instance.
+
+      ```bash
+      # Check and control dependencies, such as verifying if the database is the primary instance
+      # and if the object storage is ready
+      dependencies start script
+
+      # Start Harbor script - this section is required
+      HARBOR_NAMESPACE="${HARBOR_NAMESPACE:-harbor-ns}"
+      HARBOR_NAME="${HARBOR_NAME:-harbor}"
+      HARBOR_REPLICAS="${HARBOR_REPLICAS:-1}"
+      kubectl -n "$HARBOR_NAMESPACE" patch harbor "$HARBOR_NAME" --type=merge -p "{\"spec\":{\"helmValues\":{\"core\":{\"replicas\":$HARBOR_REPLICAS},\"portal\":{\"replicas\":$HARBOR_REPLICAS},\"jobservice\":{\"replicas\":$HARBOR_REPLICAS},\"registry\":{\"replicas\":$HARBOR_REPLICAS},\"trivy\":{\"replicas\":$HARBOR_REPLICAS}}}}"
+      ```
+
+    - **stop.sh**: The stop script should include shutdown procedures for Harbor dependencies and the Harbor instance.
+
+      ```bash
+      # Stop Harbor script - this section is required
+      HARBOR_NAMESPACE="${HARBOR_NAMESPACE:-harbor-ns}"
+      HARBOR_NAME="${HARBOR_NAME:-harbor}"
+      kubectl -n "$HARBOR_NAMESPACE" patch harbor "$HARBOR_NAME" --type=merge -p '{"spec":{"helmValues":{"core":{"replicas":0},"portal":{"replicas":0},"jobservice":{"replicas":0},"registry":{"replicas":0},"trivy":{"replicas":0}}}}'
+      
+      # Check and control dependencies, such as setting the database to replica mode
+      dependencies stop script
+      ```
+
+3. Deploy the control program as a Deployment in the Harbor namespace:
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: harbor-disaster-recovery-controller
+  namespace: harbor-ns  # Use the same namespace where Harbor is deployed
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: harbor-disaster-recovery-controller
+  template:
+    metadata:
+      labels:
+        app: harbor-disaster-recovery-controller
+    spec:
+      containers:
+      - name: controller
+        image: xxx  # Replace with your control program image
+        command: ["--", "-c", "/opt/config/config.yaml"]
+        volumeMounts:
+        - name: script
+          mountPath: /opt/script
+        - name: config
+          mountPath: /opt/config
+      volumes:
+      - name: script
+        hostPath:
+          path: <script dir>  # Replace with your script directory path
+      - name: config
+        hostPath:
+          path: <config dir>  # Replace with your config directory path
+```
+
+> **Note**: Ensure that the ServiceAccount used by the Deployment has the necessary RBAC permissions to operate on Harbor resources and any other resources controlled by your custom scripts (such as database resources, object storage configurations, etc.) in the target namespace. The control program needs permissions to modify Harbor CRD resources to start and stop Harbor components, as well as permissions for any resources managed by the custom start/stop scripts.
+
+Apply the Deployment:
+
+```bash
+kubectl apply -f harbor-disaster-recovery-controller.yaml
+```
+
+### `Alauda support for PostgreSQL` Start/Stop Script Examples
+
+When using the `Alauda support for PostgreSQL` solution with the `PostgreSQL Hot Standby Cluster Configuration Guide` to configure a disaster recovery cluster, you need to configure replication information in both Primary and Secondary PostgreSQL clusters. This ensures that during automatic failover, you only need to modify `clusterReplication.isReplica` and `numberOfInstances` to complete the switchover:
+
+**Primary Configuration:**
+
+```yaml
+clusterReplication:
+  enabled: true
+  isReplica: false
+  peerHost: 192.168.130.206  # Secondary cluster node IP
+  peerPort: 31661            # Secondary cluster NodePort
+  replSvcType: NodePort
+  bootstrapSecret: standby-bootstrap-secret
+```
+
+The `standby-bootstrap-secret` should be configured according to the `Standby Cluster Configuration` section in the `PostgreSQL Hot Standby Cluster Configuration Guide`, using the same value as the Secondary cluster.
+
+**Secondary Configuration:**
+
+```yaml
+clusterReplication:
+  enabled: true
+  isReplica: true
+  peerHost: 192.168.12.108  # Primary cluster node IP
+  peerPort: 30078            # Primary cluster NodePort
+  replSvcType: NodePort
+  bootstrapSecret: standby-bootstrap-secret
+```
+
+#### Start Script Example
+
+```bash
+POSTGRES_NAMESPACE="${POSTGRES_NAMESPACE:-pg-namespace}"
+POSTGRES_CLUSTER="${POSTGRES_CLUSTER:-acid-pg}"
+kubectl -n "$POSTGRES_NAMESPACE" patch pg "$POSTGRES_CLUSTER" --type=merge -p '{"spec":{"clusterReplication":{"isReplica":false},"numberOfInstances":2}}'
+```
+
+#### Stop Script Example
+
+```bash
+POSTGRES_NAMESPACE="${POSTGRES_NAMESPACE:-pg-namespace}"
+POSTGRES_CLUSTER="${POSTGRES_CLUSTER:-acid-pg}"
+kubectl -n "$POSTGRES_NAMESPACE" patch pg "$POSTGRES_CLUSTER" --type=merge -p '{"spec":{"clusterReplication":{"isReplica":true},"numberOfInstances":1}}'
+```
+
+### Alauda Build of Rook-Ceph Start/Stop Script Examples
+
+- **Start Script Example**: For more details, refer to [Object Storage Disaster Recovery](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html)
+
+  ```bash
+  REALM_NAME="${REALM_NAME:-real}"
+  ZONE_GROUP_NAME="${ZONE_GROUP_NAME:-group}"
+  ZONE_NAME="${ZONE_NAME:-zone}"
+
+  ACCESS_KEY=$(kubectl -n rook-ceph get secrets "${REALM_NAME}-keys" -o jsonpath='{.data.access-key}' 2>/dev/null | base64 -d)
+  SECRET_KEY=$(kubectl -n rook-ceph get secrets "${REALM_NAME}-keys" -o jsonpath='{.data.secret-key}' 2>/dev/null | base64 -d)
+  ENDPOINT=$(kubectl -n rook-ceph get cephobjectzone realm-zone -o jsonpath='{.spec.customEndpoints[0]}')
+  TOOLS_POD=$(kubectl -n rook-ceph get po -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
+
+  kubectl -n rook-ceph exec "$TOOLS_POD" -- radosgw-admin realm pull --url="$ENDPOINT" --access-key="$ACCESS_KEY" --secret="$SECRET_KEY";
+  kubectl -n rook-ceph exec "$TOOLS_POD" -- radosgw-admin zone modify --rgw-realm="$REALM_NAME" --rgw-zonegroup="$ZONE_GROUP_NAME" --rgw-zone="$ZONE_NAME" --master
+  ```
+
+- **Stop Script**: No action is required when stopping Alauda Build of Rook-Ceph, so you can add an empty script or skip this step.