diff --git a/docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md b/docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md new file mode 100644 index 00000000..656cbcf8 --- /dev/null +++ b/docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md @@ -0,0 +1,479 @@ +--- +kind: + - Solution +products: + - Alauda DevOps +ProductsVersion: + - 4.x +id: KB251200003 +--- + +# How to Perform Disaster Recovery for GitLab + +## Issue + +This solution describes how to build a GitLab disaster recovery solution based on Ceph and PostgreSQL disaster recovery capabilities. The solution implements a **hot data, cold compute** architecture, where data is continuously synchronized to the secondary cluster through Ceph and PostgreSQL disaster recovery mechanisms. When the primary cluster fails, a secondary GitLab instance is deployed, and the secondary GitLab will quickly start using the disaster recovery data and provide services. The solution primarily focuses on data disaster recovery processing, and users need to implement their own GitLab access address switching mechanism. + +## Environment + +GitLab CE Operator: >=v17.11.1 + +## Terminology + +| Term | Description | +|-------------------------|-----------------------------------------------------------------------------| +| **Primary GitLab** | The active GitLab instance that serves normal business operations and user requests. This instance is fully operational with all components running. | +| **Secondary GitLab** | The standby GitLab instance planned to be deployed in a different cluster/region, remaining dormant until activated during disaster recovery scenarios. | +| **Primary PostgreSQL** | The active PostgreSQL database cluster that handles all data transactions and serves as the source for data replication to the secondary database. | +| **Secondary PostgreSQL**| The hot standby PostgreSQL database that receives real-time data replication from the primary database. It can be promoted to primary role during failover. | +| **Primary Object Storage**| The active S3-compatible object storage system that stores all GitLab attachment data and serves as the source for object storage replication. | +| **Secondary Object Storage**| The synchronized backup object storage system that receives data replication from the primary storage. It ensures data availability during disaster recovery. | +| **Gitaly** | Responsible for Git repository storage. | +| **Rails Secret**| The encryption key used by the GitLab Rails application to encrypt sensitive data. Primary GitLab and Secondary GitLab instances **must use the same key**. | +| **Recovery Point Objective (RPO)** | The maximum acceptable amount of data loss measured in time (e.g., 5 minutes, 1 hour). It defines how much data can be lost during a disaster before it becomes unacceptable. | +| **Recovery Time Objective (RTO)** | The maximum acceptable downtime measured in time (e.g., 15 minutes, 2 hours). It defines how quickly the system must be restored after a disaster. | +| **Failover** | The process of switching from the primary system to the secondary system when the primary system becomes unavailable or fails. | +| **Data Synchronization**| The continuous process of replicating data from primary systems to secondary systems to maintain consistency and enable disaster recovery. | +| **Hot Data, Cold Compute**| An architectural pattern where data is continuously synchronized (hot), while compute resources remain inactive (cold) until failover. | + +## Architecture + +![gitlab dr](../../public/gitlab-disaster-recovery.drawio.svg) + +The GitLab disaster recovery solution implements a **hot data, cold compute architecture** for GitLab services. This architecture provides disaster recovery capabilities through near-real-time data synchronization and manual GitLab service failover procedures. The architecture consists of two GitLab instances deployed across different clusters or regions, with the secondary GitLab instance not deployed in advance until activated during disaster scenarios, while the database and storage layers maintain continuous synchronization. + +### Data Synchronization Strategy + +The solution leverages three independent data synchronization mechanisms: + +1. **Database Layer**: PostgreSQL streaming replication ensures real-time transaction log synchronization between primary and secondary databases, including GitLab application database and Praefect metadata database +2. **Gitaly Storage Layer**: Block storage replication through Ceph disaster recovery mechanisms ensures Git repository data synchronization to the secondary cluster +3. **Attachment Storage Layer**: Object storage replication maintains GitLab attachment data consistency between primary and secondary storage systems + +::: tip +The following data is stored in attachment storage. If you assess that this data is not important, you can choose not to perform disaster recovery. + +| Object Type | Function Description | Default Bucket Name | +|--------------------|----------|--------------------| +| uploads | User uploaded files (avatars, attachments, etc.) | gitlab-uploads | +| lfs | Git LFS large file objects | gitlab-lfs | +| artifacts | CI/CD Job artifacts | gitlab-artifacts | +| packages | Package management data (e.g., PyPI, Maven, NuGet) | gitlab-packages | +| external_mr_diffs | Merge Request diff data | gitlab-mr-diffs | +| terraform_state | Terraform state files | gitlab-terraform-state | +| ci_secure_files | CI secure files (sensitive certificates, keys, etc.) | gitlab-ci-secure-files | +| dependency_proxy | Dependency proxy cache | gitlab-dependency-proxy | +| pages | GitLab Pages content | gitlab-pages | + +::: + +### Disaster Recovery Configuration + +1. **Deploy Primary GitLab**: Configure the primary instance in high availability mode, configure domain access, connect to the primary PostgreSQL database (GitLab and Praefect databases), use primary object storage for attachments, and configure Gitaly to use block storage +2. **Prepare Secondary GitLab Deployment Environment**: Configure the PV, PVC, and Secret resources required for the secondary instance to enable rapid recovery when disasters occur + +### Failover Procedure + +When a disaster occurs, the following steps ensure transition to the secondary environment: + +1. **Verify Primary Failure**: Confirm that all primary GitLab components are unavailable +2. **Promote Database**: Use database failover procedures to promote secondary PostgreSQL to primary +3. **Promote Object Storage**: Activate secondary object storage as primary +4. **Promote Ceph RBD**: Promote secondary Ceph RBD to primary +5. **Restore PVCs Used by Gitaly**: According to the Ceph block storage disaster recovery documentation, restore the PVCs used by Gitaly in the secondary cluster +6. **Deploy Secondary GitLab**: Quickly deploy the GitLab instance in the secondary cluster using disaster recovery data +7. **Update Routing**: Switch external access addresses to point to the secondary GitLab instance + +## GitLab Disaster Recovery Configuration + +::: warning + +To simplify the configuration process and reduce configuration difficulty, it is recommended to use consistent information in both primary and secondary environments, including: + +- Consistent database instance names and passwords +- Consistent Redis instance names and passwords +- Consistent Ceph storage pool names and storage class names +- Consistent GitLab instance names +- Consistent namespace names + +::: + +### Prerequisites + +1. Prepare a primary cluster and a disaster recovery cluster (or a cluster containing different regions) in advance. +2. Complete the deployment of `Alauda support for PostgreSQL` disaster recovery configuration. +3. Complete the deployment of `Alauda Build of Rook-Ceph` object storage disaster recovery configuration ([optional if conditions are met](#data-synchronization-strategy)). +4. Complete the deployment of `Alauda Build of Rook-Ceph` block storage disaster recovery configuration. + +:::warning +For `Alauda Build of Rook-Ceph` block storage disaster recovery configuration, you need to set a reasonable [synchronization interval](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#create-volumereplicationclass), which directly affects the RPO metric of disaster recovery. +::: + +### Building PostgreSQL Disaster Recovery Cluster with `Alauda support for PostgreSQL` + +Refer to `PostgreSQL Hot Standby Cluster Configuration Guide` to build a disaster recovery cluster using `Alauda support for PostgreSQL`. + +Ensure that Primary PostgreSQL and Secondary PostgreSQL are in different clusters (or different regions). + +You can search for `PostgreSQL Hot Standby Cluster Configuration Guide` on [Alauda Knowledge](https://cloud.alauda.io/knowledges#/) to obtain it. + +:::warning + +`PostgreSQL Hot Standby Cluster Configuration Guide` is a document that describes how to build a disaster recovery cluster using `Alauda support for PostgreSQL`. Please ensure compatibility with the appropriate ACP version when using this configuration. + +::: + +### Building Block Storage Disaster Recovery Cluster with `Alauda Build of Rook-Ceph` + +Build a block storage disaster recovery cluster using `Alauda Build of Rook-Ceph`. Refer to [Block Storage Disaster Recovery](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html) documentation to build a disaster recovery cluster. + +### Building Object Storage Disaster Recovery Cluster with `Alauda Build of Rook-Ceph` + +Build an object storage disaster recovery cluster using `Alauda Build of Rook-Ceph`. Refer to [Object Storage Disaster Recovery](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html) documentation to build an object storage disaster recovery cluster. + +You need to create a CephObjectStoreUser in advance to obtain the access credentials for Object Storage, and prepare a GitLab object storage bucket on Primary Object Storage: + +1. Create a CephObjectStoreUser on Primary Object Storage to obtain access credentials: [Create CephObjectStoreUser](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/create_object_user.html). + + :::info + You only need to create the CephObjectStoreUser on the Primary Object Storage. The user information will be automatically synchronized to the Secondary Object Storage through the disaster recovery replication mechanism. + ::: + +2. Obtain the object storage access address `PRIMARY_OBJECT_STORAGE_ADDRESS`. You can get it from the step [Configure External Access for Primary Zone](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html#configure-external-access-for-primary-zone) of `Object Storage Disaster Recovery`. + + ```bash + $ mc alias set primary-s3 + Added `primary-s3` successfully. + $ mc alias list + primary-s3 + URL : + AccessKey : + SecretKey : + API : s3v4 + Path : auto + Src : /home/demo/.mc/config.json + ``` + +3. Use mc to create GitLab object storage buckets on Primary Object Storage. In this example, two buckets `gitlab-uploads` and `gitlab-lfs` are created. + + ```bash + # Create + mc mb primary-s3/gitlab-uploads + mc mb primary-s3/gitlab-lfs + + # Check + mc ls primary-s3/gitlab-uploads + mc ls primary-s3/gitlab-lfs + ``` + + :::info + Depending on the GitLab features used, you may also need to use [other buckets](#data-synchronization-strategy), which can be created as needed. + ::: + +### Set Up Primary GitLab + +Deploy the Primary GitLab instance by following the [GitLab Instance Deployment](https://docs.alauda.io/alauda-build-of-gitlab/17.11/en/install/03_gitlab_deploy.html#deploying-from-the-gitlab-high-availability-template) guide. Configure it in high availability mode, configure domain access, connect to the Primary PostgreSQL database (GitLab application database and Praefect database), use Primary Object Storage for attachments, and configure Gitaly to use Primary block storage. + +Configuration example (only includes configuration items related to disaster recovery, see product documentation for complete configuration items): + +```yaml +apiVersion: operator.alaudadevops.io/v1alpha1 +kind: GitlabOfficial +metadata: + name: + namespace: +spec: + externalURL: http://gitlab-ha.example.com # GitLab access domain + helmValues: + gitlab: + gitaly: + persistence: # Configure gitaly storage, use ceph RBD storage class, high availability mode will automatically create 3 replicas + enabled: true + size: 5Gi + storageClass: ceph-rdb # Storage class name, specify as the storage class configured for disaster recovery + webservice: + ingress: + enabled: true + global: + appConfig: + object_store: + connection: # Configure object storage, connect to primary object storage + secret: gitlab-object-storage + key: connection + enabled: true + praefect: # Configure praefect database, connect to primary PostgreSQL database + dbSecret: + key: password + secret: gitlab-pg-prefact + enabled: true + psql: + dbName: gitlab_prefact + host: acid-gitlab.test.svc + port: 5432 + sslMode: require + user: postgres + virtualStorages: + - gitalyReplicas: 3 + maxUnavailable: 1 + name: default + psql: # Configure application database, connect to primary PostgreSQL database + database: gitlab + host: acid-gitlab.test.svc + password: + key: password + secret: gitlab-pg + port: 5432 + username: postgres +``` + +After deploying Primary GitLab, you need to configure RBD Mirror for the PVCs used by the Gitaly component. After configuration, PVC data will be periodically synchronized to the secondary Ceph cluster. For specific parameter configuration, refer to [Ceph RBD Mirror](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#enable-mirror-for-pvc). + +```bash +cat << EOF | kubectl apply -f - +apiVersion: replication.storage.openshift.io/v1alpha1 +kind: VolumeReplication +metadata: + name: + namespace: +spec: + autoResync: true # Auto resync + volumeReplicationClass: rbd-volumereplicationclass + replicationState: primary # Mark as primary cluster + dataSource: + apiGroup: "" + kind: PersistentVolumeClaim + name: +EOF +``` + +Check the Ceph RBD Mirror status. You can see that all three PVCs of Gitaly have been configured with Ceph RBD Mirror. + +```bash +❯ kubectl -n $GITLAB_NAMESPACE get volumereplication +NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE +repo-data-dr-gitlab-ha-gitaly-default-0 15s rbd-volumereplicationclass repo-data-dr-gitlab-ha-gitaly-default-0 primary Primary +repo-data-dr-gitlab-ha-gitaly-default-1 15s rbd-volumereplicationclass repo-data-dr-gitlab-ha-gitaly-default-1 primary Primary +repo-data-dr-gitlab-ha-gitaly-default-2 14s rbd-volumereplicationclass repo-data-dr-gitlab-ha-gitaly-default-2 primary Primary +``` + +Check the Ceph RBD Mirror status from the Ceph side. `CEPH_BLOCK_POOL` is the name of the Ceph RBD storage pool. The `SCHEDULE` column indicates the synchronization frequency (the example below shows synchronization every 1 minute). + +```bash +❯ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- rbd mirror snapshot schedule ls --pool $CEPH_BLOCK_POOL --recursive +POOL NAMESPACE IMAGE SCHEDULE +myblock csi-vol-135ec569-0a3a-49c1-a0b1-46d669510200 every 1m +myblock csi-vol-459e6f28-a158-4ae9-b5da-163448c35119 every 1m +myblock csi-vol-7f13040d-d543-40ed-b416-3ecf639cf4c9 every 1m +``` + +Check the Ceph RBD Mirror status. A state of `up+stopped` (primary cluster normal) and peer_sites.state of `up+replaying` (secondary cluster normal) indicates normal synchronization. + +```bash +❯ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- rbd mirror image status $CEPH_BLOCK_POOL/$GITALY_BLOCK_IMAGE_NAME +csi-vol-459e6f28-a158-4ae9-b5da-163448c35119: + global_id: 98bbf3bf-7c61-42b4-810b-cb2a7cd6d6b1 + state: up+stopped + description: local image is primary + service: a on 192.168.129.233 + last_update: 2025-11-19 01:42:07 + peer_sites: + name: ecf558fa-1e8a-43f1-bf6b-1478e73f272e + state: up+replaying + description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":5742592.0,"last_snapshot_bytes":5742592,"last_snapshot_sync_seconds":0,"local_snapshot_timestamp":1763516344,"remote_snapshot_timestamp":1763516344,"replay_state":"idle"} + last_update: 2025-11-19 01:42:27 + snapshots: + 75 .mirror.primary.98bbf3bf-7c61-42b4-810b-cb2a7cd6d6b1.3d3402a5-f298-4048-8c50-84979949355d (peer_uuids:[66d8fb19-c610-438c-ae73-42a95ea4e86e]) +``` + +### Set Up Secondary GitLab + +:::warning +When Ceph RBD is in secondary state, the synchronized storage blocks cannot be mounted, so GitLab in the secondary cluster cannot be deployed successfully. + +If you need to verify whether GitLab in the secondary cluster can be deployed successfully, you can temporarily promote the Ceph RBD of the secondary cluster to primary, and after testing is complete, set it back to secondary state. At the same time, you need to delete all gitlabofficial, PV, and PVC resources created during testing. +::: + +1. Backup the Secrets used by Primary GitLab +2. Backup the PVC and PV resource YAMLs of the Primary cluster GitLab Gitaly component (note: high availability mode will have at least 3 PVC and PV resources) +3. Backup the Primary cluster GitLab gitlabofficial resource YAML +4. Deploy the Redis instance used by Secondary GitLab + +#### Backup Secrets Used by Primary GitLab + +Obtain the PostgreSQL Secret YAML used by Primary GitLab and create the Secret in the secondary cluster with the same namespace name. + +```bash +export GITLAB_NAMESPACE= +export GITLAB_NAME= +``` + +```bash +# PostgreSQL Secret +PG_SECRET=$(kubectl -n "$GITLAB_NAMESPACE" get gitlabofficial "$GITLAB_NAME" -o jsonpath='{.spec.helmValues.global.psql.password.secret}') +[[ -n "$PG_SECRET" ]] && kubectl -n "$GITLAB_NAMESPACE" get secret "$PG_SECRET" -o yaml > pg-secret.yaml + +# Praefect PostgreSQL Secret +PRAEFECT_PG_SECRET=$(kubectl -n "$GITLAB_NAMESPACE" get gitlabofficial "$GITLAB_NAME" -o jsonpath='{.spec.helmValues.global.praefect.dbSecret.secret}') +[[ -n "$PRAEFECT_PG_SECRET" ]] && kubectl -n "$GITLAB_NAMESPACE" get secret "$PRAEFECT_PG_SECRET" -o yaml > praefect-secret.yaml + +# Rails Secret +RAILS_SECRET=$(kubectl -n "$GITLAB_NAMESPACE" get gitlabofficial "$GITLAB_NAME" -o jsonpath='{.spec.helmValues.global.railsSecrets.secret}' || echo "${GITLAB_NAME}-rails-secret") +[[ -z "$RAILS_SECRET" ]] && export RAILS_SECRET="${GITLAB_NAME}-rails-secret" # use default secret name if not found +[[ -n "$RAILS_SECRET" ]] && kubectl -n "$GITLAB_NAMESPACE" get secret "$RAILS_SECRET" -o yaml > rails-secret.yaml + +# Object Storage Secret +OBJECT_STORAGE_SECRET=$(kubectl -n "$GITLAB_NAMESPACE" get gitlabofficial "$GITLAB_NAME" -o jsonpath='{.spec.helmValues.global.appConfig.object_store.connection.secret}') +[[ -n "$OBJECT_STORAGE_SECRET" ]] && kubectl -n "$GITLAB_NAMESPACE" get secret "$OBJECT_STORAGE_SECRET" -o yaml > object-storage-secret.yaml + +# Root Password Secret +ROOT_USER_SECRET=$(kubectl -n "$GITLAB_NAMESPACE" get gitlabofficial "$GITLAB_NAME" -o jsonpath='{.spec.helmValues.global.initialRootPassword.secret}') +[[ -n "$ROOT_USER_SECRET" ]] && kubectl -n "$GITLAB_NAMESPACE" get secret "$ROOT_USER_SECRET" -o yaml > root-user-secret.yaml +``` + +Make the following modifications to the backed up files: + +- pg-secret.yaml: Change the `host` and `password` fields to the PostgreSQL connection address and password of the secondary cluster +- praefect-secret.yaml: Change the `host` and `password` fields to the Praefect PostgreSQL connection address and password of the secondary cluster +- object-storage-secret.yaml: Change the `endpoint` field in `connection` to the object storage connection address of the secondary cluster + +Create the backed up YAML files in the disaster recovery environment with the same namespace name. + +#### Backup PVC and PV Resources of Primary GitLab Gitaly Component + +:::tip +PV resources contain volume attribute information, which is critical information for disaster recovery restoration and needs to be backed up properly. + +```bash + volumeAttributes: + clusterID: rook-ceph + imageFeatures: layering + imageFormat: "2" + imageName: csi-vol-459e6f28-a158-4ae9-b5da-163448c35119 + journalPool: myblock + pool: myblock + storage.kubernetes.io/csiProvisionerIdentity: 1763446982673-7963-rook-ceph.rbd.csi.ceph.com +``` + +::: + +Execute the following command to backup the PVC and PV resources of the Primary GitLab Gitaly component to the current directory (if other PVCs are used, they need to be backed up manually): + +```bash +kubectl -n "$GITLAB_NAMESPACE" \ + get pvc -l app=gitaly,release="$GITLAB_NAME" \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' \ +| while read -r pvc; do + + echo "=> Exporting PVC $pvc" + + # Export PVC + kubectl -n "$GITLAB_NAMESPACE" get pvc "$pvc" -o yaml > "pvc-${pvc}.yaml" + + # Get PV + PV=$(kubectl -n "$GITLAB_NAMESPACE" get pvc "$pvc" -o jsonpath='{.spec.volumeName}') + + if [[ -n "$PV" ]]; then + echo " ↳ Exporting PV $PV" + kubectl get pv "$PV" -o yaml > "pv-${PV}.yaml" + fi + + echo "" +done +``` + +Modify the three backed up PV files and delete all `spec.claimRef` fields in the yaml. + +Create the backed up PVC and PV YAML files directly in the disaster recovery environment with the same namespace name. + +#### Backup Primary GitLab Instance YAML + +```bash +kubectl -n "$GITLAB_NAMESPACE" get gitlabofficial "$GITLAB_NAME" -oyaml > gitlabofficial.yaml +``` + +Modify the information in `gitlabofficial.yaml` according to the actual situation of the disaster recovery environment, including PostgreSQL connection address, Redis connection address, etc. + +:::warning +The `GitlabOfficial` resource **does not need** to be created in the disaster recovery environment immediately. It only needs to be created in the secondary cluster when a disaster occurs and disaster recovery switchover is performed. +::: + +:::warning +If you need to perform disaster recovery drills, you can follow the steps in [Primary-Secondary Switchover Procedure in Disaster Scenarios](#primary-secondary-switchover-procedure-in-disaster-scenarios) for drills. After the drill is complete, you need to perform the following cleanup operations in the disaster recovery environment: + +- Delete the `GitlabOfficial` instance in the disaster recovery environment +- Delete the created PVCs and PVs +- Switch the PostgreSQL cluster to secondary state +- Switch the Ceph object storage to secondary state +- Switch the Ceph RBD to secondary state + +::: + +#### Deploy Redis Instance Used by Secondary GitLab + +Refer to the Redis instance configuration of the primary cluster, and deploy a Redis instance in the disaster recovery environment with the same namespace name using the same instance name and password. + +### Recovery Objectives + +#### Recovery Point Objective (RPO) + +The RPO represents the maximum acceptable data loss during a disaster recovery scenario. In this GitLab disaster recovery solution: + +- **Database Layer**: Near-zero data loss due to PostgreSQL hot standby streaming replication (applicable to GitLab application database and Praefect metadata database) +- **Attachment Storage Layer**: Near-zero data loss due to object storage streaming replication used by GitLab attachment storage +- **Gitaly Storage Layer**: Due to Ceph RBD block storage replication for Git repository data, synchronized through scheduled snapshots, data loss depends on the synchronization interval, which can be [configured](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#create-volumereplicationclass) +- **Overall RPO**: The overall RPO depends on the synchronization interval of Ceph RBD block storage replication. + +#### Recovery Time Objective (RTO) + +The RTO represents the maximum acceptable downtime during disaster recovery. This solution provides: + +- **Manual Components**: GitLab service activation and external routing updates require manual intervention +- **Typical RTO**: 6-16 minutes for complete service restoration + +**RTO Breakdown:** + +- Database failover: 1-2 minutes (manual) +- Object storage failover: 1-2 minutes (manual) +- Ceph RBD failover: 1-2 minutes (manual) +- GitLab service activation: 2-5 minutes (manual) +- External routing updates: 1-5 minutes (manual, depends on DNS propagation) + +## Primary-Secondary Switchover Procedure in Disaster Scenarios + +1. **Confirm Primary GitLab Failure**: Confirm that all primary GitLab components are in non-working state, otherwise stop all primary GitLab components first. + +2. **Promote Secondary PostgreSQL**: Promote Secondary PostgreSQL to Primary PostgreSQL. Refer to the switchover procedure in `PostgreSQL Hot Standby Cluster Configuration Guide`. + +3. **Promote Secondary Object Storage**: Promote Secondary Object Storage to Primary Object Storage. Refer to the switchover procedure in [Alauda Build of Rook-Ceph Failover](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html#procedures-1). + +4. **Promote Secondary Ceph RBD**: Promote Secondary Ceph RBD to Primary Ceph RBD. Refer to the switchover procedure in [Alauda Build of Rook-Ceph Failover](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#procedures-1). + +5. **Restore PVC and PV Resources**: Restore the backed up PVC and PV resources to the disaster recovery environment with the same namespace name, and check whether the PVC status in the secondary cluster is `Bound`: + + ```bash + ❯ kubectl -n $GITLAB_NAMESPACE get pvc,pv + NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE + persistentvolumeclaim/repo-data-dr-gitlab-ha-gitaly-default-0 Bound pvc-231a9021-2548-433e-8583-f7b56d74aca7 5Gi RWO ceph-rdb 45s + persistentvolumeclaim/repo-data-dr-gitlab-ha-gitaly-default-1 Bound pvc-2995a8a7-648c-4e99-a3d3-c73a483a601b 5Gi RWO ceph-rdb 30s + persistentvolumeclaim/repo-data-dr-gitlab-ha-gitaly-default-2 Bound pvc-e4a94d84-d5e2-419f-bbbd-285fa88b6b5e 5Gi RWO ceph-rdb 19s + + NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE + persistentvolume/pvc-231a9021-2548-433e-8583-f7b56d74aca7 5Gi RWO Delete Bound fm-1-ns/repo-data-dr-gitlab-ha-gitaly-default-0 ceph-rdb 63s + persistentvolume/pvc-2995a8a7-648c-4e99-a3d3-c73a483a601b 5Gi RWO Delete Bound fm-1-ns/repo-data-dr-gitlab-ha-gitaly-default-1 ceph-rdb 30s + persistentvolume/pvc-e4a94d84-d5e2-419f-bbbd-285fa88b6b5e 5Gi RWO Delete Bound fm-1-ns/repo-data-dr-gitlab-ha-gitaly-default-2 ceph-rdb 19s + ``` + +6. **Deploy Secondary GitLab**: Restore the backed up `gitlabofficial.yaml` to the disaster recovery environment with the same namespace name. GitLab will automatically start using the disaster recovery data. + +7. **Verify GitLab Components**: Verify that all GitLab components are running and healthy. Test GitLab functionality (repository access, CI/CD pipelines, user authentication) to verify that GitLab is working properly. + +8. **Switch Access Address**: Switch external access addresses to Secondary GitLab. + +## Building GitLab Disaster Recovery Solution with Other Object Storage and PostgreSQL + +The operational steps are similar to building a GitLab disaster recovery solution with `Alauda Build of Rook-Ceph` and `Alauda support for PostgreSQL`. Simply replace storage and PostgreSQL with other object storage and PostgreSQL solutions that support disaster recovery. + +:::warning +Ensure that the selected storage and PostgreSQL solutions support disaster recovery capabilities, and perform sufficient disaster recovery drills before using in production environments. +::: + diff --git a/docs/en/solutions/How_to_perform_disaster_recovery_for_nexus.md b/docs/en/solutions/How_to_perform_disaster_recovery_for_nexus.md new file mode 100644 index 00000000..80bcd4ea --- /dev/null +++ b/docs/en/solutions/How_to_perform_disaster_recovery_for_nexus.md @@ -0,0 +1,306 @@ +--- +kind: + - Solution +products: + - Alauda DevOps +ProductsVersion: + - 4.x +id: KB251200004 +--- + +# How to Perform Disaster Recovery for Nexus + +## Issue + +This solution describes how to build a Nexus disaster recovery solution based on Ceph block storage disaster recovery capabilities. The solution implements a **hot data, cold compute** architecture, where data is continuously synchronized to the secondary cluster through Ceph block storage disaster recovery mechanisms. When the primary cluster fails, a secondary Nexus instance is deployed, and the secondary Nexus will quickly start using the disaster recovery data and provide services. The solution primarily focuses on data disaster recovery processing, and users need to implement their own Nexus access address switching mechanism. + +## Environment + +Nexus Operator: >=v3.81.1 + +## Terminology + +| Term | Description | +|-------------------------|-----------------------------------------------------------------------------| +| **Primary Nexus** | The active Nexus instance that serves normal business operations and user requests. This instance is fully operational with all components running. | +| **Secondary Nexus** | The standby Nexus instance planned to be deployed in a different cluster/region, remaining dormant until activated during disaster recovery scenarios. | +| **Primary Block Storage**| The active block storage system that stores all Nexus data, serving as the source for block storage replication. | +| **Secondary Block Storage**| The synchronized backup block storage system that receives data replication from the primary block storage. It ensures data availability during disaster recovery. | +| **Recovery Point Objective (RPO)** | The maximum acceptable amount of data loss measured in time (e.g., 5 minutes, 1 hour). It defines how much data can be lost during a disaster before it becomes unacceptable. | +| **Recovery Time Objective (RTO)** | The maximum acceptable downtime measured in time (e.g., 15 minutes, 2 hours). It defines how quickly the system must be restored after a disaster. | +| **Failover** | The process of switching from the primary system to the secondary system when the primary system becomes unavailable or fails. | +| **Data Synchronization**| The continuous process of replicating data from primary systems to secondary systems to maintain consistency and enable disaster recovery. | +| **Hot Data, Cold Compute**| An architectural pattern where data is continuously synchronized (hot), while compute resources remain inactive (cold) until failover. | + +## Architecture + +The Nexus disaster recovery solution implements a **hot data, cold compute architecture** for Nexus services. This architecture provides disaster recovery capabilities through near-real-time data synchronization and manual Nexus service failover procedures. The architecture consists of two Nexus instances deployed across different clusters or regions, with the secondary Nexus instance not deployed in advance until activated during disaster scenarios, while the storage layer maintains continuous synchronization. + +### Data Synchronization Strategy + +The solution ensures Nexus data synchronization to the secondary cluster through Ceph RBD Mirror block storage replication. All Nexus data is stored in PVCs, which are periodically synchronized to the secondary cluster through the Ceph RBD Mirror mechanism. + +### Disaster Recovery Configuration + +1. **Deploy Primary Nexus**: Configure domain access, use primary block storage for data storage +2. **Prepare Secondary Nexus Deployment Environment**: Configure PV, PVC, and Secret resources required for the secondary instance to enable rapid recovery when disasters occur + +### Failover Procedure + +When a disaster occurs, the following steps ensure transition to the secondary environment: + +1. **Verify Primary Failure**: Confirm that all primary Nexus components are unavailable +2. **Promote Ceph RBD**: Promote secondary Ceph RBD to primary Ceph RBD +3. **Restore PVC and PV Resources**: According to the Ceph block storage disaster recovery documentation, restore the PVCs used by Nexus in the secondary cluster +4. **Deploy Secondary Nexus**: Quickly deploy the Nexus instance in the secondary cluster using disaster recovery data +5. **Update Routing**: Switch external access addresses to point to the secondary Nexus instance + +## Nexus Disaster Recovery Configuration + +::: warning + +To simplify the configuration process and reduce configuration difficulty, it is recommended to use consistent information in both primary and secondary environments, including: + +- Consistent Ceph storage pool names and storage class names +- Consistent Nexus instance names +- Consistent namespace names + +::: + +### Prerequisites + +1. Prepare a primary cluster and a disaster recovery cluster (or a cluster containing different regions) in advance. +2. Complete the deployment of `Alauda Build of Rook-Ceph` block storage disaster recovery configuration. + +:::warning +The `Alauda Build of Rook-Ceph` block storage disaster recovery configuration requires setting a reasonable [synchronization interval](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#create-volumereplicationclass), which directly affects the RPO metric of disaster recovery. +::: + +### Building Block Storage Disaster Recovery Cluster with `Alauda Build of Rook-Ceph` + +Build a block storage disaster recovery cluster using `Alauda Build of Rook-Ceph`. Refer to the [Block Storage Disaster Recovery](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html) documentation to build the disaster recovery cluster. + +### Set Up Primary Nexus + +Deploy the Primary Nexus instance by following the Nexus instance deployment guide. Configure domain access, use primary block storage for data storage. + +Configuration example (only includes configuration items related to disaster recovery, see product documentation for complete configuration items): + +```yaml +apiVersion: operator.alaudadevops.io/v1alpha1 +kind: Nexus +metadata: + name: + namespace: +spec: + externalURL: http://nexus-ddrs.alaudatech.net + helmValues: + pvc: + storage: 5Gi + volumeClaimTemplate: + enabled: true + storageClass: + name: ceph-rdb # Set the configured storage class name +``` + +After deploying the primary Nexus, you need to configure RBD Mirror for the PVCs used by Nexus components. After configuration, PVC data will be periodically synchronized to the secondary Ceph cluster. For specific parameter configuration, refer to [Ceph RBD Mirror](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#enable-mirror-for-pvc). + +```bash +export NEXUS_NAMESPACE= +export NEXUS_NAME= +export NEXUS_PVC_NAME=nexus-data-${NEXUS_NAME}-nxrm-ha-0 + +cat << EOF | kubectl apply -f - +apiVersion: replication.storage.openshift.io/v1alpha1 +kind: VolumeReplication +metadata: + name: ${NEXUS_PVC_NAME} + namespace: ${NEXUS_NAMESPACE} +spec: + autoResync: true # Auto sync + volumeReplicationClass: rbd-volumereplicationclass + replicationState: primary # Mark as primary cluster + dataSource: + apiGroup: "" + kind: PersistentVolumeClaim + name: ${NEXUS_PVC_NAME} +EOF +``` + +Check the Ceph RBD Mirror status to see that the Nexus PVC has been configured with Ceph RBD Mirror. + +```bash +❯ kubectl -n $NEXUS_NAMESPACE get volumereplication +NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE +nexus-data-nexus-ddrs-nxrm-ha-0 15s rbd-volumereplicationclass nexus-data-nexus-ddrs-nxrm-ha-0 primary Primary +``` + +View the Ceph RBD Mirror status from the Ceph side. `CEPH_BLOCK_POOL` is the name of the Ceph RBD storage pool. The `SCHEDULE` column indicates the synchronization frequency (the example below shows synchronization every 1 minute). + +```bash +❯ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- rbd mirror snapshot schedule ls --pool $CEPH_BLOCK_POOL --recursive +POOL NAMESPACE IMAGE SCHEDULE +myblock csi-vol-459e6f28-a158-4ae9-b5da-163448c35119 every 1m +``` + +Check the Ceph RBD Mirror status. When state is `up+stopped` (primary cluster normal) and peer_sites.state is `up+replaying` (secondary cluster normal), it indicates normal synchronization. + +```bash +❯ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- rbd mirror image status $CEPH_BLOCK_POOL/$NEXUS_BLOCK_IMAGE_NAME +csi-vol-459e6f28-a158-4ae9-b5da-163448c35119: + global_id: 98bbf3bf-7c61-42b4-810b-cb2a7cd6d6b1 + state: up+stopped + description: local image is primary + service: a on 192.168.129.233 + last_update: 2025-11-19 01:42:07 + peer_sites: + name: ecf558fa-1e8a-43f1-bf6b-1478e73f272e + state: up+replaying + description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":5742592.0,"last_snapshot_bytes":5742592,"last_snapshot_sync_seconds":0,"local_snapshot_timestamp":1763516344,"remote_snapshot_timestamp":1763516344,"replay_state":"idle"} + last_update: 2025-11-19 01:42:27 + snapshots: + 75 .mirror.primary.98bbf3bf-7c61-42b4-810b-cb2a7cd6d6b1.3d3402a5-f298-4048-8c50-84979949355d (peer_uuids:[66d8fb19-c610-438c-ae73-42a95ea4e86e]) +``` + +### Set Up Secondary Nexus + +:::warning +When Ceph RBD is in secondary state, the synchronized storage blocks cannot be mounted, so Nexus in the secondary cluster cannot be deployed successfully. + +If you need to verify whether Nexus in the secondary cluster can be deployed successfully, you can temporarily promote the Ceph RBD of the secondary cluster to primary, and after testing is complete, set it back to secondary state. At the same time, you need to delete all Nexus, PV, and PVC resources created during testing. +::: + +1. Backup Secrets Used by Primary Nexus +2. Backup PVC and PV Resource YAMLs of Primary Nexus Components +3. Backup Primary Nexus Instance YAML + +#### Backup Secrets Used by Primary Nexus + +Get the Password Secret YAML used by the primary Nexus and create the Secret in the secondary cluster with the same namespace name. + +```bash +apiVersion: v1 +data: + password: xxxxxx +kind: Secret +metadata: + name: nexus-root-password + namespace: nexus-dr +type: Opaque +``` + +#### Backup PVC and PV Resources of Primary Nexus Components + +:::tip +The PV resource contains volume attribute information, which is critical information for disaster recovery restoration and needs to be backed up properly. + +```bash + volumeAttributes: + clusterID: rook-ceph + imageFeatures: layering + imageFormat: "2" + imageName: csi-vol-459e6f28-a158-4ae9-b5da-163448c35119 + journalPool: myblock + pool: myblock + storage.kubernetes.io/csiProvisionerIdentity: 1763446982673-7963-rook-ceph.rbd.csi.ceph.com +``` + +::: + +Execute the following command to backup the PVC and PV resources of the primary Nexus components to the current directory: + +```bash +export NEXUS_PVC_NAME= + +echo "=> Exporting PVC $NEXUS_PVC_NAME" + +# Export PVC +kubectl -n "$NEXUS_NAMESPACE" get pvc "$NEXUS_PVC_NAME" -o yaml > "pvc-${NEXUS_PVC_NAME}.yaml" + +# Get PV +PV=$(kubectl -n "$NEXUS_NAMESPACE" get pvc "$NEXUS_PVC_NAME" -o jsonpath='{.spec.volumeName}') + +if [[ -n "$PV" ]]; then + echo " ↳ Exporting PV $PV" + kubectl get pv "$PV" -o yaml > "pv-${PV}.yaml" +fi +``` + +Modify the backed up PV file and delete all `spec.claimRef` fields in the yaml. + +Create the backed up PVC and PV YAML files directly in the disaster recovery environment with the same namespace name. + +#### Backup Primary Nexus Instance YAML + +```bash +kubectl -n "$NEXUS_NAMESPACE" get nexus "$NEXUS_NAME" -oyaml > nexus.yaml +``` + +Modify the information in `nexus.yaml` according to the actual situation of the disaster recovery environment. + +:::warning +The `Nexus` resource **does not need** to be created in the disaster recovery environment immediately. It only needs to be created in the secondary cluster when a disaster occurs and disaster recovery switchover is performed. +::: + +:::warning +If you need to perform disaster recovery drills, you can follow the steps in [Disaster Switchover](#disaster-switchover) for drills. After the drill is complete, you need to perform the following cleanup operations in the disaster recovery environment: + +- Delete the `Nexus` instance in the disaster recovery environment +- Delete the created PVCs and PVs +- Switch Ceph RBD back to secondary state + +::: + +### Recovery Objectives + +#### Recovery Point Objective (RPO) + +The RPO represents the maximum acceptable data loss during a disaster recovery scenario. In this Nexus disaster recovery solution: + +- **Storage Layer**: Due to Ceph RBD block storage replication for Nexus data, through periodic snapshot synchronization, data loss depends on the synchronization interval, which can be [configured](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#create-volumereplicationclass) +- **Overall RPO**: The overall RPO depends on the synchronization interval of Ceph RBD block storage replication. + +#### Recovery Time Objective (RTO) + +The RTO represents the maximum acceptable downtime during disaster recovery. This solution provides: + +- **Manual Components**: Nexus service activation and external routing updates require manual intervention +- **Typical RTO**: 4-10 minutes for complete service restoration + +**RTO Breakdown:** + +- Ceph RBD failover: 1-2 minutes (manual) +- Nexus service activation: 2-5 minutes (manual) +- External routing updates: 1-3 minutes (manual, depends on DNS propagation) + +## Disaster Switchover + +1. **Confirm Primary Nexus Failure**: Confirm that all primary Nexus components are in non-working state, otherwise stop all primary Nexus components first. + +2. **Promote Secondary Ceph RBD**: Promote secondary Ceph RBD to primary Ceph RBD. Refer to the switchover procedure in [Alauda Build of Rook-Ceph Failover](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#procedures-1). + +3. **Restore PVC and PV Resources**: Restore the backed up PVC and PV resources to the disaster recovery environment with the same namespace name, and check that the PVC status in the secondary cluster is `Bound`: + + ```bash + ❯ kubectl -n $NEXUS_NAMESPACE get pvc,pv + NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE + persistentvolumeclaim/nexus-data-nexus-ddrs-nxrm-ha-0 Bound pvc-231a9021-2548-433e-8583-f7b56d74aca7 5Gi RWO ceph-rdb 45s + + NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE + persistentvolume/pvc-231a9021-2548-433e-8583-f7b56d74aca7 5Gi RWO Delete Bound nexus-dr/nexus-data-nexus-ddrs-nxrm-ha-0 ceph-rdb 63s + ``` + +4. **Deploy Secondary Nexus**: Restore the backed up `nexus.yaml` to the disaster recovery environment with the same namespace name. Nexus will automatically start using the disaster recovery data. + +5. **Verify Nexus Components**: Verify that all Nexus components are running and healthy. Test Nexus functionality (repository access, package upload/download, user authentication) to verify that Nexus is working properly. + +6. **Switch Access Address**: Switch external access addresses to Secondary Nexus. + +## Building Nexus Disaster Recovery Solution with Other Block Storage + +The operational steps are similar to building a Nexus disaster recovery solution with `Alauda Build of Rook-Ceph`. Simply replace block storage with other block storage solutions that support disaster recovery. + +:::warning +Ensure that the selected block storage solution supports disaster recovery capabilities, and perform sufficient disaster recovery drills before using in production environments. +::: + diff --git a/docs/en/solutions/How_to_perform_disaster_recovery_for_sonarqube.md b/docs/en/solutions/How_to_perform_disaster_recovery_for_sonarqube.md new file mode 100644 index 00000000..12dad367 --- /dev/null +++ b/docs/en/solutions/How_to_perform_disaster_recovery_for_sonarqube.md @@ -0,0 +1,216 @@ +--- +kind: + - Solution +products: + - Alauda DevOps +ProductsVersion: + - 4.x +id: KB251200005 +--- + +# How to Perform Disaster Recovery for SonarQube + +## Issue + +This solution describes how to build a SonarQube disaster recovery solution based on PostgreSQL disaster recovery capabilities. The solution implements a **hot data, cold compute** architecture, where data is continuously synchronized to the secondary cluster through PostgreSQL disaster recovery mechanisms. When the primary cluster fails, a secondary SonarQube instance is deployed, and the secondary SonarQube will quickly start using the disaster recovery data and provide services. The solution primarily focuses on data disaster recovery processing, and users need to implement their own SonarQube access address switching mechanism. + +## Environment + +SonarQube Operator: >=v2025.1.0 + +## Terminology + +| Term | Description | +|-------------------------|-----------------------------------------------------------------------------| +| **Primary SonarQube** | The active SonarQube instance that serves normal business operations and user requests. This instance is fully operational with all components running. | +| **Secondary SonarQube** | The standby SonarQube instance planned to be deployed in a different cluster/region, remaining dormant until activated during disaster recovery scenarios. | +| **Primary PostgreSQL** | The active PostgreSQL database cluster that handles all data transactions and serves as the source for data replication to the secondary database. | +| **Secondary PostgreSQL**| The hot standby PostgreSQL database that receives real-time data replication from the primary database. It can be promoted to primary role during failover. | +| **Recovery Point Objective (RPO)** | The maximum acceptable amount of data loss measured in time (e.g., 5 minutes, 1 hour). It defines how much data can be lost during a disaster before it becomes unacceptable. | +| **Recovery Time Objective (RTO)** | The maximum acceptable downtime measured in time (e.g., 15 minutes, 2 hours). It defines how quickly the system must be restored after a disaster. | +| **Failover** | The process of switching from the primary system to the secondary system when the primary system becomes unavailable or fails. | +| **Data Synchronization**| The continuous process of replicating data from primary systems to secondary systems to maintain consistency and enable disaster recovery. | +| **Hot Data, Cold Compute**| An architectural pattern where data is continuously synchronized (hot), while compute resources remain inactive (cold) until failover. | + +## Architecture + +The SonarQube disaster recovery solution implements a **hot data, cold compute architecture** for SonarQube services. This architecture provides disaster recovery capabilities through near-real-time data synchronization and manual SonarQube service failover procedures. The architecture consists of two SonarQube instances deployed across different clusters or regions, with the secondary SonarQube instance not deployed in advance until activated during disaster scenarios, while the database layer maintains continuous synchronization. + +### Data Synchronization Strategy + +The solution ensures real-time transaction log synchronization between primary and secondary databases through PostgreSQL streaming replication, including all SonarQube application data + +### Disaster Recovery Configuration + +1. **Deploy Primary SonarQube**: Configure domain access, connect to the primary PostgreSQL database +2. **Prepare Secondary SonarQube Deployment Environment**: Configure the Secret resources required for the secondary instance to enable rapid recovery when disasters occur + +### Failover Procedure + +When a disaster occurs, the following steps ensure transition to the secondary environment: + +1. **Verify Primary Failure**: Confirm that all primary SonarQube components are unavailable +2. **Promote Database**: Use database failover procedures to promote secondary PostgreSQL to primary +3. **Deploy Secondary SonarQube**: Quickly deploy the SonarQube instance in the secondary cluster using disaster recovery data +4. **Update Routing**: Switch external access addresses to point to the secondary SonarQube instance + +## SonarQube Disaster Recovery Configuration + +::: warning + +To simplify the configuration process and reduce configuration difficulty, it is recommended to use consistent information in both primary and secondary environments, including: + +- Consistent database instance names and passwords +- Consistent SonarQube instance names +- Consistent namespace names + +::: + +### Prerequisites + +1. Prepare a primary cluster and a disaster recovery cluster (or a cluster containing different regions) in advance. +2. Complete the deployment of `Alauda support for PostgreSQL` disaster recovery configuration. + +### Building PostgreSQL Disaster Recovery Cluster with `Alauda support for PostgreSQL` + +Refer to `PostgreSQL Hot Standby Cluster Configuration Guide` to build a disaster recovery cluster using `Alauda support for PostgreSQL`. + +Ensure that Primary PostgreSQL and Secondary PostgreSQL are in different clusters (or different regions). + +You can search for `PostgreSQL Hot Standby Cluster Configuration Guide` on [Alauda Knowledge](https://cloud.alauda.io/knowledges#/) to obtain it. + +:::warning + +`PostgreSQL Hot Standby Cluster Configuration Guide` is a document that describes how to build a disaster recovery cluster using `Alauda support for PostgreSQL`. Please ensure compatibility with the appropriate ACP version when using this configuration. + +::: + +### Set Up Primary SonarQube + +Deploy the Primary SonarQube instance by following the SonarQube instance deployment guide. Configure domain access, connect to the primary PostgreSQL database. + +Configuration example (only includes configuration items related to disaster recovery, see product documentation for complete configuration items): + +```yaml +apiVersion: operator.alaudadevops.io/v1alpha1 +kind: Sonarqube +metadata: + name: + namespace: +spec: + externalURL: http://dr-sonar.alaudatech.net # Configure domain and resolve to primary cluster + helmValues: + ingress: + enabled: true + hosts: + - name: dr-sonar.alaudatech.net + jdbcOverwrite: + enable: true + jdbcSecretName: sonarqube-pg + jdbcUrl: jdbc:postgresql://sonar-dr.sonar-dr:5432/sonar_db? # Connect to primary PostgreSQL + jdbcUsername: postgres +``` + +### Set Up Secondary SonarQube + +:::warning +When PostgreSQL is in secondary state, the secondary database cannot accept write operations, so SonarQube in the secondary cluster cannot be deployed successfully. + +If you need to verify whether SonarQube in the secondary cluster can be deployed successfully, you can temporarily promote the PostgreSQL of the secondary cluster to primary, and after testing is complete, set it back to secondary state. At the same time, you need to delete `sonarqube` resource created during testing. +::: + +1. Create Secrets Used by Secondary SonarQube +2. Backup Primary SonarQube Instance YAML + +#### Create Secrets Used by Secondary SonarQube + +Secondary SonarQube requires two secrets, one for database connection (connect to secondary PostgreSQL) and one for root password. Refer to [SonarQube Deployment Documentation](https://docs.alauda.cn/alauda-build-of-sonarqube/2025.1/install/02_sonarqube_credential.html#pg-credentials) to create them (keep the Secret names consistent with those used in Primary SonarQube configuration). + +Example: + +```bash +apiVersion: v1 +stringData: + host: sonar-dr.sonar-dr + port: "5432" + username: postgres + jdbc-password: xxxx + database: sonar_db +kind: Secret +metadata: + name: sonarqube-pg + namespace: $SONARQUBE_NAMESPACE +type: Opaque +--- +apiVersion: v1 +stringData: + password: xxxxx +kind: Secret +metadata: + name: sonarqube-root-password + namespace: $SONARQUBE_NAMESPACE +type: Opaque +``` + +#### Backup Primary SonarQube Instance YAML + +```bash +kubectl -n "$SONARQUBE_NAMESPACE" get sonarqube "$SONARQUBE_NAME" -oyaml > sonarqube.yaml +``` + +Modify the information in `sonarqube.yaml` according to the actual situation of the disaster recovery environment, including PostgreSQL connection address, etc. + +:::warning +The `sonarqube` resource **does not need** to be created in the disaster recovery environment immediately. It only needs to be created in the secondary cluster when a disaster occurs and disaster recovery switchover is performed. +::: + +:::warning +If you need to perform disaster recovery drills, you can follow the steps in [Primary-Secondary Switchover Procedure in Disaster Scenarios](#disaster-switchover) for drills. After the drill is complete, you need to perform the following cleanup operations in the disaster recovery environment: + +- Delete the `sonarqube` instance in the disaster recovery environment +- Switch the PostgreSQL cluster to secondary state + +::: + +### Recovery Objectives + +#### Recovery Point Objective (RPO) + +The RPO represents the maximum acceptable data loss during a disaster recovery scenario. In this SonarQube disaster recovery solution: + +- **Database Layer**: Near-zero data loss due to PostgreSQL hot standby streaming replication +- **Overall RPO**: The overall RPO is near-zero, depending on the delay of PostgreSQL streaming replication + +#### Recovery Time Objective (RTO) + +The RTO represents the maximum acceptable downtime during disaster recovery. This solution provides: + +- **Manual Components**: SonarQube service activation and external routing updates require manual intervention +- **Typical RTO**: 5-20 minutes for complete service restoration + +**RTO Breakdown:** + +- Database failover: 1-2 minutes (manual) +- SonarQube service activation: 3-15 minutes (manual) +- External routing updates: 1-3 minutes (manual, depends on DNS propagation) + +## Disaster Switchover + +1. **Confirm Primary SonarQube Failure**: Confirm that all primary SonarQube components are in non-working state, otherwise stop all primary SonarQube components first. + +2. **Promote Secondary PostgreSQL**: Promote Secondary PostgreSQL to Primary PostgreSQL. Refer to the switchover procedure in `PostgreSQL Hot Standby Cluster Configuration Guide`. + +3. **Deploy Secondary SonarQube**: Restore the backed up `sonarqube.yaml` to the disaster recovery environment with the same namespace name. SonarQube will automatically start using the disaster recovery data. + +4. **Verify SonarQube Components**: Verify that all SonarQube components are running and healthy. Test SonarQube functionality (project access, code analysis, user authentication) to verify that SonarQube is working properly. + +5. **Switch Access Address**: Switch external access addresses to Secondary SonarQube. + +## Building SonarQube Disaster Recovery Solution with Other PostgreSQL + +The operational steps are similar to building a SonarQube disaster recovery solution with `Alauda support for PostgreSQL`. Simply replace PostgreSQL with other PostgreSQL solutions that support disaster recovery. + +:::warning +Ensure that the selected PostgreSQL solution supports disaster recovery capabilities, and perform sufficient disaster recovery drills before using in production environments. +::: + diff --git a/docs/public/gitlab-disaster-recovery.drawio.svg b/docs/public/gitlab-disaster-recovery.drawio.svg new file mode 100644 index 00000000..4824dcc5 --- /dev/null +++ b/docs/public/gitlab-disaster-recovery.drawio.svg @@ -0,0 +1,570 @@ + + + + + + + + + + +
+
+
+
+ + Secondary Gitlab + +
+
+
+
+
+ + Secondary Gitlab + +
+
+
+ + + + + + + + +
+
+
+ + Object Storage + +
+
+
+
+ + Object Storage + +
+
+
+ + + + + + + +
+
+
+ Primary + + Gitlab + +
+
+
+
+ + Primary Gitlab + +
+
+
+ + + + + + + + +
+
+
+ Perform disaster recovery using a backup instance +
+
+
+
+ + Perform disaster recovery using a backup instance + +
+
+
+ + + + + + + + +
+
+
+ Primary +
+ DB +
+
+
+
+
+ + Primary... + +
+
+
+ + + + + + + + + + + + +
+
+
+ + Secondary + +
+ DB +
+
+
+
+
+ + Secondary... + +
+
+
+ + + + + + + + + + + + +
+
+
+ Sync Data +
+
+
+
+ + Sync Data + +
+
+
+ + + + + + + + +
+
+
+ + Secondary + +
+ S3 +
+
+
+
+
+ + Secondary... + +
+
+
+ + + + + + + + +
+
+
+ Object Storage +
+
+
+
+ + Object Storage + +
+
+
+ + + + + + + + +
+
+
+ + Primary + +
+ S3 +
+
+
+
+
+ + Primary... + +
+
+
+ + + + + + + + +
+
+
+ Sync Data +
+
+
+
+ + Sync Data + +
+
+
+ + + + + + + + + + + + +
+
+
+ User +
+
+
+
+ + User + +
+
+
+ + + + + + + + + + + +
+
+
+ + + DNS(or other switching mechanisms) + + +
+
+
+
+ + DNS(or other switchi... + +
+
+
+ + + + + + + +
+
+
+

+ When enabling the standby GitLab, need to manually switch the access traffic. +

+
+
+
+
+ + When enabling the standby GitLab, need to... + +
+
+
+ + + + + + + +
+
+
+ readonly +
+
+
+
+ + readonly + +
+
+
+ + + + + + + + +
+
+
+
+ Primary +
+ Ceph +
+ RBD +
+
+
+
+ + PrimaryCep... + +
+
+
+ + + + + + + + +
+
+
+ Snapshot Mirror +
+
+
+
+ + Snapshot Mirror + +
+
+
+ + + + + + + + +
+
+
+ Git Repo Data +
+
+
+
+ + Git Repo Data + +
+
+
+ + + + + + + + +
+
+
+ Git Repo Data +
+
+
+
+ + Git Repo Data + +
+
+
+ + + + + + + +
+
+
+ readonly +
+
+
+
+ + readonly + +
+
+
+ + + + + + + +
+
+
+ Ready to ready +
+
+
+
+ + Ready to ready + +
+
+
+ + + + + + + + +
+
+
+
+ Secondary +
+ Ceph +
+ RBD +
+
+
+
+ + SecondaryC... + +
+
+
+ + + + + + + +
+
+
+ readonly +
+
+
+
+ + readonly + +
+
+
+
+ + + + + Text is not SVG - cannot display + + + +
diff --git a/docs/zh/solutions/How_to_perform_disaster_recovery_for_gitlab.md b/docs/zh/solutions/How_to_perform_disaster_recovery_for_gitlab.md new file mode 100644 index 00000000..6902a368 --- /dev/null +++ b/docs/zh/solutions/How_to_perform_disaster_recovery_for_gitlab.md @@ -0,0 +1,480 @@ +--- +kind: + - Solution +products: + - Alauda DevOps +ProductsVersion: + - 4.x +id: KB251200003 +--- + +# 如何为 GitLab 执行灾难恢复 + +## 问题 + +本解决方案描述了如何基于 Ceph 和 PostgreSQL 的灾难恢复能力构建 GitLab 灾难恢复解决方案。该解决方案实现了**热数据、冷计算**架构,其中数据通过 Ceph 和 PostgreSQL 灾难恢复机制持续同步到备用集群,当主集群发生故障时部署备用 GitLab 实例,备用 GitLab 会使用容灾数据快速启动并提供服务。该解决方案主要关注数据灾难恢复处理,用户需要自行实现 GitLab 访问地址切换机制。 + +## 环境 + +GitLab CE Operator: >=v17.11.1 + +## 术语 + +| 术语 | 描述 | +|-------------------------|-----------------------------------------------------------------------------| +| **主 GitLab** | 处理正常业务操作和用户请求的活跃 GitLab 实例。该实例完全运行,所有组件都在运行。 | +| **备用 GitLab** | 计划部署在不同集群/区域的备用 GitLab 实例,在灾难恢复场景激活之前保持休眠状态。 | +| **主 PostgreSQL** | 处理所有数据事务的活跃 PostgreSQL 数据库集群,作为数据复制到备用数据库的源。 | +| **备用 PostgreSQL**| 从主数据库接收实时数据复制的热备用 PostgreSQL 数据库。它可以在故障转移期间提升为主角色。 | +| **主对象存储**| 存储所有 GitLab 附件数据的活跃 S3 兼容对象存储系统,作为对象存储复制的源。 | +| **备用对象存储**| 从主对象存储接收数据复制的同步备份对象存储系统。它确保在灾难恢复期间的数据可用性。 | +| **Gitaly** | 负责 Git 仓库存储。 | +| **Rails Secret**| GitLab Rails 应用程序用于加密敏感数据的加密密钥。主 GitLab 和备用 GitLab 实例**必须使用相同的密钥**。 | +| **恢复点目标 (RPO)** | 以时间衡量的最大可接受数据丢失量(例如,5 分钟,1 小时)。它定义了在灾难发生前可以丢失多少数据才变得不可接受。 | +| **恢复时间目标 (RTO)** | 以时间衡量的最大可接受停机时间(例如,15 分钟,2 小时)。它定义了系统在灾难后必须恢复的速度。 | +| **故障转移** | 当主系统变得不可用或失败时,从主系统切换到备用系统的过程。 | +| **数据同步**| 从主系统到备用系统持续复制数据以保持一致性并启用灾难恢复的过程。 | +| **热数据,冷计算**| 一种架构模式,其中数据持续同步(热),而计算资源保持非活动状态(冷),直到故障转移。 | + +## 架构 + +![gitlab dr](../../public/gitlab-disaster-recovery.drawio.svg) + +GitLab 灾难恢复解决方案为 GitLab 服务实现了**热数据、冷计算架构**。这种架构通过准实时数据同步和手动 GitLab 服务故障转移程序提供灾难恢复能力。架构由部署在不同集群或区域的两个 GitLab 实例组成,备用 GitLab 并不会提前部署,直到在灾难场景中激活,而数据库和存储层保持持续同步。 + +### 数据同步策略 + +该解决方案利用三种独立的数据同步机制: + +1. **数据库层**:通过 PostgreSQL 流式复制确保主数据库和备用数据库之间的实时事务日志同步,包括 GitLab 应用程序数据库和 Praefect 元数据数据库 +2. **Gitaly 存储层**:通过 Ceph 灾难恢复机制的块存储复制确保 Git 仓库数据同步到备用集群 +3. **附件存储层**:通过对象存储复制保持主存储和备用存储系统之间 GitLab 附件数据一致性 + +::: tip +附件存储中保存以下数据,如果评估这些数据不重要,可以选择不进行容灾。 + +| 对象类型 | 功能说明 | 默认 bucket 名称 | +|--------------------|----------|--------------------| +| uploads | 用户上传文件(头像、附件等) | gitlab-uploads | +| lfs | Git LFS 大文件对象 | gitlab-lfs | +| artifacts | CI/CD Job 产物(artifacts) | gitlab-artifacts | +| packages | 包管理数据(如 PyPI、Maven、NuGet) | gitlab-packages | +| external_mr_diffs | Merge Request 差异数据 | gitlab-mr-diffs | +| terraform_state | Terraform 状态文件 | gitlab-terraform-state | +| ci_secure_files | CI 安全文件(敏感证书、密钥等) | gitlab-ci-secure-files | +| dependency_proxy | 依赖代理缓存 | gitlab-dependency-proxy | +| pages | GitLab Pages 内容 | gitlab-pages | + +::: + +### 灾难恢复配置 + +1. **部署主 GitLab**:在高可用模式下配置主实例,配置域名访问,连接到主 PostgreSQL 数据库(GitLab 和 Praefect 数据库),使用主对象存储存储附件,并配置 Gitaly 使用块存储 +2. **准备备用 GitLab 部署环境**:配置备用实例所需要的 pv、pvc 和 secret 资源,以便于灾难发生时快速恢复 + +### 故障转移程序 + +当发生灾难时,以下步骤确保转换到备用环境: + +1. **验证主故障**:确认所有主 GitLab 组件都不可用 +2. **提升数据库**:使用数据库故障转移程序将备用 PostgreSQL 提升为主 +3. **提升对象存储**:将备用对象存储激活为主 +4. **提升 Ceph RBD**:将备用 Ceph RBD 提升为主 +5. **恢复 Gitaly 所使用的 PVC**:根据 Ceph 块存储灾难恢复文档,将 Gitaly 所使用的 PVC 在备集群恢复 +6. **部署备用 GitLab**:在备集群使用灾备数据快速部署 GitLab 实例 +7. **更新路由**:将外部访问地址切换到指向备用 GitLab 实例 + +## GitLab 容灾配置 + +::: warning + +为了简化配置过程,降低配置难度,推荐主备两个环境中使用一致的信息,包括: + +- 一致的数据库实例名称和密码 +- 一致的 Redis 实例名称和密码 +- 一致的 Ceph 存储池名称和存储类名称 +- 一致的 GitLab 实例名称 +- 一致的命名空间名称 + +::: + +### 前置条件 + +1. 提前准备一个主集群和一个灾难恢复集群(或包含不同区域的集群)。 +2. 完成 `Alauda support for PostgreSQL` 灾难恢复配置的部署。 +3. 完成 `Alauda Build of Rook-Ceph` 对象存储的灾难恢复配置的部署([满足条件可选](#数据同步策略))。 +4. 完成 `Alauda Build of Rook-Ceph` 块存储的灾难恢复配置的部署。 + +:::warning +`Alauda Build of Rook-Ceph` 块存储的灾难恢复配置,需要设置合理的[同步间隔时间](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#create-volumereplicationclass),这会直接影响容灾的 RPO 指标。 +::: + +### 使用 `Alauda support for PostgreSQL` 构建 PostgreSQL 灾难恢复集群 + +参考 `PostgreSQL 热备用集群配置指南`,使用 `Alauda support for PostgreSQL` 构建灾难恢复集群。 + +确保主 PostgreSQL 和备用 PostgreSQL 位于不同的集群(或不同的区域)。 + +您可以在 [Alauda Knowledge](https://cloud.alauda.io/knowledges#/) 上搜索 `PostgreSQL 热备用集群配置指南` 来获取它。 + +:::warning + +`PostgreSQL 热备用集群配置指南` 是一份描述如何使用 `Alauda support for PostgreSQL` 构建灾难恢复集群的文档。使用此配置时,请确保与相应的 ACP 版本兼容。 + +::: + +### 使用 `Alauda Build of Rook-Ceph` 构建块存储灾难恢复集群 + +使用 `Alauda Build of Rook-Ceph` 构建块存储灾难恢复集群。参考 [块存储灾难恢复](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html) 文档构建灾难恢复集群。 + +### 使用 `Alauda Build of Rook-Ceph` 构建对象存储灾难恢复集群 + +使用 `Alauda Build of Rook-Ceph` 构建对象存储灾难恢复集群。参考 [对象存储灾难恢复](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html) 文档构建对象存储灾难恢复集群。 + +您需要提前创建一个 CephObjectStoreUser 以获取对象存储的访问凭据,并在主对象存储上准备一个 GitLab 对象存储桶: + +1. 在主对象存储上创建一个 CephObjectStoreUser 以获取访问凭据:[创建 CephObjectStoreUser](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/create_object_user.html)。 + + :::info + 您只需要在主对象存储上创建 CephObjectStoreUser。用户信息将通过灾难恢复复制机制自动同步到备用对象存储。 + ::: + +2. 获取对象存储的访问地址 `PRIMARY_OBJECT_STORAGE_ADDRESS`,您可以从 `对象存储灾难恢复` 的步骤 [为主区域配置外部访问](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html#configure-external-access-for-primary-zone) 中获取。 + + ```bash + $ mc alias set primary-s3 + Added `primary-s3` successfully. + $ mc alias list + primary-s3 + URL : + AccessKey : + SecretKey : + API : s3v4 + Path : auto + Src : /home/demo/.mc/config.json + ``` + +3. 使用 mc 在主对象存储上创建 GitLab 对象存储桶,在此示例中,创建了 `gitlab-uploads` 和 `gitlab-lfs` 两个存储桶。 + + ```bash + # 创建 + mc mb primary-s3/gitlab-uploads + mc mb primary-s3/gitlab-lfs + + # 检查 + mc ls primary-s3/gitlab-uploads + mc ls primary-s3/gitlab-lfs + ``` + + :::info + 根据使用的 GitLab 功能不同,可能还需要使用到[其他存储桶](#数据同步策略),可按照需要创建。 + ::: + +### 设置主 GitLab + +按照 [GitLab 实例部署](https://docs.alauda.cn/alauda-build-of-gitlab/17.11/en/install/03_gitlab_deploy.html#deploying-from-the-gitlab-high-availability-template) 指南部署主 GitLab 实例。在高可用模式下配置它,配置域名访问,连接到主 PostgreSQL 数据库(GitLab 应用程序数据库和 Praefect 数据库),使用主对象存储存储附件,并配置 Gitaly 使用主块存储。 + +配置示例(仅包含了容灾关注的配置项,完整配置项见产品文档): + +```yaml +apiVersion: operator.alaudadevops.io/v1alpha1 +kind: GitlabOfficial +metadata: + name: + namespace: +spec: + externalURL: http://gitlab-ha.example.com # GitLab 访问域名 + helmValues: + gitlab: + gitaly: + persistence: # 配置 gitaly 存储,使用 ceph RBD 存储类,因为是高可用模式,会自动创建3个副本 + enabled: true + size: 5Gi + storageClass: ceph-rdb # 存储类名称,指定为配置到好了容灾的存储类 + webservice: + ingress: + enabled: true + global: + appConfig: + object_store: + connection: # 配置对象存储,连接到主对象存储 + secret: gitlab-object-storage + key: connection + enabled: true + praefect: # 配置 praefect 数据库,连接到主 PostgreSQL 数据库 + dbSecret: + key: password + secret: gitlab-pg-prefact + enabled: true + psql: + dbName: gitlab_prefact + host: acid-gitlab.test.svc + port: 5432 + sslMode: require + user: postgres + virtualStorages: + - gitalyReplicas: 3 + maxUnavailable: 1 + name: default + psql: # 配置应用数据库,连接到主 PostgreSQL 数据库 + database: gitlab + host: acid-gitlab.test.svc + password: + key: password + secret: gitlab-pg + port: 5432 + username: postgres +``` + +部署主 GitLab 后,需要为 Gitaly 组件使用的 PVC 配置 RBD Mirror,配置后才会将 PVC 数据定时同步到备 Ceph 集群。具体参数配置参考 [Ceph RBD Mirror](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#enable-mirror-for-pvc)。 + +```bash +cat << EOF | kubectl apply -f - +apiVersion: replication.storage.openshift.io/v1alpha1 +kind: VolumeReplication +metadata: + name: + namespace: +spec: + autoResync: true # 自动同步 + volumeReplicationClass: rbd-volumereplicationclass + replicationState: primary # 标记为主集群 + dataSource: + apiGroup: "" + kind: PersistentVolumeClaim + name: +EOF +``` + +检查 Ceph RBD Mirror 状态,可以看到 Gitaly 的三个 pvc 都已经配置了 Ceph RBD Mirror。 + +```bash +❯ kubectl -n $GITLAB_NAMESPACE get volumereplication +NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE +repo-data-dr-gitlab-ha-gitaly-default-0 15s rbd-volumereplicationclass repo-data-dr-gitlab-ha-gitaly-default-0 primary Primary +repo-data-dr-gitlab-ha-gitaly-default-1 15s rbd-volumereplicationclass repo-data-dr-gitlab-ha-gitaly-default-1 primary Primary +repo-data-dr-gitlab-ha-gitaly-default-2 14s rbd-volumereplicationclass repo-data-dr-gitlab-ha-gitaly-default-2 primary Primary +``` + +从 Ceph 端查看 Ceph RBD Mirror 状态,`CEPH_BLOCK_POOL` 是 Ceph RBD 存储池的名称。`SCHEDULE` 列标识了同步的频率(下面的示例是 1 分钟同步一次)。 + +```bash +❯ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- rbd mirror snapshot schedule ls --pool $CEPH_BLOCK_POOL --recursive +POOL NAMESPACE IMAGE SCHEDULE +myblock csi-vol-135ec569-0a3a-49c1-a0b1-46d669510200 every 1m +myblock csi-vol-459e6f28-a158-4ae9-b5da-163448c35119 every 1m +myblock csi-vol-7f13040d-d543-40ed-b416-3ecf639cf4c9 every 1m +``` + +检查 Ceph RBD Mirror 状态,state 为 `up+stopped`(主集群正常)并且 peer_sites.state 为 `up+replaying`(备集群正常)表示同步正常。 + +```bash +❯ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- rbd mirror image status $CEPH_BLOCK_POOL/$GITALY_BLOCK_IMAGE_NAME +csi-vol-459e6f28-a158-4ae9-b5da-163448c35119: + global_id: 98bbf3bf-7c61-42b4-810b-cb2a7cd6d6b1 + state: up+stopped + description: local image is primary + service: a on 192.168.129.233 + last_update: 2025-11-19 01:42:07 + peer_sites: + name: ecf558fa-1e8a-43f1-bf6b-1478e73f272e + state: up+replaying + description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":5742592.0,"last_snapshot_bytes":5742592,"last_snapshot_sync_seconds":0,"local_snapshot_timestamp":1763516344,"remote_snapshot_timestamp":1763516344,"replay_state":"idle"} + last_update: 2025-11-19 01:42:27 + snapshots: + 75 .mirror.primary.98bbf3bf-7c61-42b4-810b-cb2a7cd6d6b1.3d3402a5-f298-4048-8c50-84979949355d (peer_uuids:[66d8fb19-c610-438c-ae73-42a95ea4e86e]) +``` + +### 设置备用 GitLab + +:::warning +当 Ceph RBD 处于备用状态时,同步过来的存储块无法挂载,因此备集群的 GitLab 无法部署成功。 + +如需验证备集群 GitLab 是否可以部署成功,可以临时将备集群的 Ceph RBD 提升为主集群,测试完成后再设置回备用状态。同时需要将测试过程中创建的 gitlabofficial、PV 和 PVC 资源都删除。 +::: + +1. 备份主 GitLab 使用的 Secret +2. 备份主集群 GitLab Gitaly 组件的 PVC 和 PV 资源 YAML(注意,高可用模式至少会有3个 PVC 和 PV 资源) +3. 备份主集群 GitLab 的 gitlabofficial 资源 YAML +4. 部署备 GitLab 使用的 Redis 实例 + +#### 备份主 GitLab 使用的 Secret + +获取主 GitLab 使用的 PostgreSQL Secret YAML,并将 Secret 创建到备集群同名命名空间中。 + +```bash +export GITLAB_NAMESPACE= +export GITLAB_NAME= +``` + +```bash +# PostgreSQL Secret +PG_SECRET=$(kubectl -n "$GITLAB_NAMESPACE" get gitlabofficial "$GITLAB_NAME" -o jsonpath='{.spec.helmValues.global.psql.password.secret}') +[[ -n "$PG_SECRET" ]] && kubectl -n "$GITLAB_NAMESPACE" get secret "$PG_SECRET" -o yaml > pg-secret.yaml + +# Praefect PostgreSQL Secret +PRAEFECT_PG_SECRET=$(kubectl -n "$GITLAB_NAMESPACE" get gitlabofficial "$GITLAB_NAME" -o jsonpath='{.spec.helmValues.global.praefect.dbSecret.secret}') +[[ -n "$PRAEFECT_PG_SECRET" ]] && kubectl -n "$GITLAB_NAMESPACE" get secret "$PRAEFECT_PG_SECRET" -o yaml > praefect-secret.yaml + +# Rails Secret +RAILS_SECRET=$(kubectl -n "$GITLAB_NAMESPACE" get gitlabofficial "$GITLAB_NAME" -o jsonpath='{.spec.helmValues.global.railsSecrets.secret}' || echo "${GITLAB_NAME}-rails-secret") +[[ -z "$RAILS_SECRET" ]] && export RAILS_SECRET="${GITLAB_NAME}-rails-secret" # use default secret name if not found +[[ -n "$RAILS_SECRET" ]] && kubectl -n "$GITLAB_NAMESPACE" get secret "$RAILS_SECRET" -o yaml > rails-secret.yaml + +# Object Storage Secret +OBJECT_STORAGE_SECRET=$(kubectl -n "$GITLAB_NAMESPACE" get gitlabofficial "$GITLAB_NAME" -o jsonpath='{.spec.helmValues.global.appConfig.object_store.connection.secret}') +[[ -n "$OBJECT_STORAGE_SECRET" ]] && kubectl -n "$GITLAB_NAMESPACE" get secret "$OBJECT_STORAGE_SECRET" -o yaml > object-storage-secret.yaml + +# Root Password Secret +ROOT_USER_SECRET=$(kubectl -n "$GITLAB_NAMESPACE" get gitlabofficial "$GITLAB_NAME" -o jsonpath='{.spec.helmValues.global.initialRootPassword.secret}') +[[ -n "$ROOT_USER_SECRET" ]] && kubectl -n "$GITLAB_NAMESPACE" get secret "$ROOT_USER_SECRET" -o yaml > root-user-secret.yaml +``` + +对备份出来的文件做如下修改: + +- pg-secret.yaml:将 `host` 和 `password` 字段改成备集群的 PostgreSQL 连接地址和密码 +- praefect-secret.yaml:将 `host` 和 `password` 字段改成备集群的 Praefect PostgreSQL 连接地址和密码 +- object-storage-secret.yaml:将 `connection` 中的 `endpoint` 字段改成备集群的对象存储连接地址 + +将备份的 YAML 文件在容灾环境同名命名空间中创建。 + +#### 备份主 GitLab Gitaly 组件的 PVC 和 PV 资源 + +:::tip +PV 资源中保存了 volume 属性信息,这些信息是容灾恢复时的关键信息,需要备份好。 + +```bash + volumeAttributes: + clusterID: rook-ceph + imageFeatures: layering + imageFormat: "2" + imageName: csi-vol-459e6f28-a158-4ae9-b5da-163448c35119 + journalPool: myblock + pool: myblock + storage.kubernetes.io/csiProvisionerIdentity: 1763446982673-7963-rook-ceph.rbd.csi.ceph.com +``` + +::: + +执行以下命令将主 GitLab Gitaly 组件的 PVC 和 PV 资源备份到当前目录(如果使用的是其他 PVC,需要手动备份): + +```bash +kubectl -n "$GITLAB_NAMESPACE" \ + get pvc -l app=gitaly,release="$GITLAB_NAME" \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' \ +| while read -r pvc; do + + echo "=> Exporting PVC $pvc" + + # 导出 PVC + kubectl -n "$GITLAB_NAMESPACE" get pvc "$pvc" -o yaml > "pvc-${pvc}.yaml" + + # 获取 PV + PV=$(kubectl -n "$GITLAB_NAMESPACE" get pvc "$pvc" -o jsonpath='{.spec.volumeName}') + + if [[ -n "$PV" ]]; then + echo " ↳ Exporting PV $PV" + kubectl get pv "$PV" -o yaml > "pv-${PV}.yaml" + fi + + echo "" +done +``` + +修改备份出来的三个 pv 文件,将 yaml 中的 `spec.claimRef` 字段全部删除。 + +将备份出来的 PVC 和 PV YAML 文件直接创建到容灾环境同名命名空间中。 + +#### 备份主 GitLab 实例 YAML + +```bash +kubectl -n "$GITLAB_NAMESPACE" get gitlabofficial "$GITLAB_NAME" -oyaml > gitlabofficial.yaml +``` + +根据容灾环境实际情况修改 `gitlabofficial.yaml` 中的信息,包括 PostgreSQL 连接地址、Redis 连接地址等。 + +:::warning +`GitlabOfficial` 资源**不需要**立即创建在容灾环境,只需要在灾难发生时,执行容灾切换时创建到备集群即可。 +::: + +:::warning +如需进行容灾演练,可以按照 [灾难切换](#灾难切换) 中的步骤进行演练。演练完毕后需要在容灾环境完成以下清理操作: + +- 将容灾环境中的 `GitlabOfficial` 实例删除 +- 将创建的 PVC 和 PV 删除 +- 将 PostgreSQL 集群切换为备用状态 +- 将 Ceph 对象存储切换为备用状态 +- 将 Ceph RBD 切换为备用状态 + +::: + +#### 部署备 GitLab 使用的 Redis 实例 + +参考主集群的 redis 实例配置,使用相同的实例名称和密码在容灾环境同名命名空间部署 Redis 实例。 + +### 恢复目标 + +#### 恢复点目标 (RPO) + +RPO 表示在灾难恢复场景中最大可接受的数据丢失。在此 GitLab 灾难恢复解决方案中: + +- **数据库层**:由于 PostgreSQL 热备用流式复制(适用于 GitLab 应用程序数据库和 Praefect 元数据数据库),数据丢失接近零 +- **附件存储层**:由于 GitLab 附件存储使用的对象存储流式复制,数据丢失接近零 +- **Gitaly 存储层**:由于 Git 仓库数据的 Ceph RBD 块存储复制,通过快照定时同步,数据丢失情况取决于同步间隔,间隔时间可以[配置](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#create-volumereplicationclass) +- **总体 RPO**:总体 RPO 取决 Ceph RBD 块存储复制的同步间隔时间。 + +#### 恢复时间目标 (RTO) + +RTO 表示在灾难恢复期间最大可接受的停机时间。此解决方案提供: + +- **手动组件**:GitLab 服务激活和外部路由更新需要手动干预 +- **典型 RTO**:完整服务恢复需要 6-16 分钟 + +**RTO 分解:** + +- 数据库故障转移:1-2 分钟(手动) +- 对象存储故障转移:1-2 分钟(手动) +- Ceph RBD 故障转移:1-2 分钟(手动) +- GitLab 服务激活:2-5 分钟(手动) +- 外部路由更新:1-5 分钟(手动,取决于 DNS 传播) + +## 灾难切换 + +1. **确认主 GitLab 故障**:确认所有主 GitLab 组件都处于非工作状态,否则先停止所有主 GitLab 组件。 + +2. **提升备用 PostgreSQL**:将备用 PostgreSQL 提升为主 PostgreSQL。参考 `PostgreSQL 热备用集群配置指南` 的切换程序。 + +3. **提升备用对象存储**:将备用对象存储提升为主对象存储。参考 [Alauda Build of Rook-Ceph 故障转移](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html#procedures-1) 的切换程序。 + +4. **提升备用 Ceph RBD**:将备用 Ceph RBD 提升为主 Ceph RBD。参考 [Alauda Build of Rook-Ceph 故障转移](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#procedures-1) 的切换程序。 + +5. **恢复 PVC 和 PV 资源**:恢复备份的 PVC 和 PV 资源到容灾环境同名命名空间中,并检查备集群 PVC 状态是否为 `Bound` 状态: + + ```bash + ❯ kubectl -n $GITLAB_NAMESPACE get pvc,pv + NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE + persistentvolumeclaim/repo-data-dr-gitlab-ha-gitaly-default-0 Bound pvc-231a9021-2548-433e-8583-f7b56d74aca7 5Gi RWO ceph-rdb 45s + persistentvolumeclaim/repo-data-dr-gitlab-ha-gitaly-default-1 Bound pvc-2995a8a7-648c-4e99-a3d3-c73a483a601b 5Gi RWO ceph-rdb 30s + persistentvolumeclaim/repo-data-dr-gitlab-ha-gitaly-default-2 Bound pvc-e4a94d84-d5e2-419f-bbbd-285fa88b6b5e 5Gi RWO ceph-rdb 19s + + NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE + persistentvolume/pvc-231a9021-2548-433e-8583-f7b56d74aca7 5Gi RWO Delete Bound fm-1-ns/repo-data-dr-gitlab-ha-gitaly-default-0 ceph-rdb 63s + persistentvolume/pvc-2995a8a7-648c-4e99-a3d3-c73a483a601b 5Gi RWO Delete Bound fm-1-ns/repo-data-dr-gitlab-ha-gitaly-default-1 ceph-rdb 30s + persistentvolume/pvc-e4a94d84-d5e2-419f-bbbd-285fa88b6b5e 5Gi RWO Delete Bound fm-1-ns/repo-data-dr-gitlab-ha-gitaly-default-2 ceph-rdb 19s + ``` + +6. **部署备用 GitLab**:恢复备份的 `gitlabofficial.yaml` 到容灾环境同名命名空间中。GitLab 会利用容灾数据自动启动。 + +7. **验证 GitLab 组件**:验证所有 GitLab 组件正在运行且健康。测试 GitLab 功能(仓库访问、CI/CD 流水线、用户认证)以验证 GitLab 是否正常工作。 + +8. **切换访问地址**:将外部访问地址切换到备用 GitLab。 + + + +## 使用其他对象存储和 PostgreSQL 构建 GitLab 灾难恢复解决方案 + +操作步骤与使用 `Alauda Build of Rook-Ceph` 和 `Alauda support for PostgreSQL` 构建 GitLab 灾难恢复解决方案类似。只需将存储和 PostgreSQL 替换为其他支持灾难恢复的存储和 PostgreSQL 解决方案。 + +:::warning +确保所选存储和 PostgreSQL 解决方案支持灾难恢复能力,并在生产环境使用前进行充分的容灾演练。 +::: diff --git a/docs/zh/solutions/How_to_perform_disaster_recovery_for_nexus.md b/docs/zh/solutions/How_to_perform_disaster_recovery_for_nexus.md new file mode 100644 index 00000000..4edd452d --- /dev/null +++ b/docs/zh/solutions/How_to_perform_disaster_recovery_for_nexus.md @@ -0,0 +1,306 @@ +--- +kind: + - Solution +products: + - Alauda DevOps +ProductsVersion: + - 4.x +id: KB251200004 +--- + +# 如何为 Nexus 执行灾难恢复 + +## 问题 + +本解决方案描述了如何基于 Ceph 块存储的灾难恢复能力构建 Nexus 灾难恢复解决方案。该解决方案实现了**热数据、冷计算**架构,其中数据通过 Ceph 块存储灾难恢复机制持续同步到备用集群,当主集群发生故障时部署备用 Nexus 实例,备用 Nexus 会使用容灾数据快速启动并提供服务。该解决方案主要关注数据灾难恢复处理,用户需要自行实现 Nexus 访问地址切换机制。 + +## 环境 + +Nexus Operator: >=v3.81.1 + +## 术语 + +| 术语 | 描述 | +|-------------------------|-----------------------------------------------------------------------------| +| **主 Nexus** | 处理正常业务操作和用户请求的活跃 Nexus 实例。该实例完全运行,所有组件都在运行。 | +| **备用 Nexus** | 计划部署在不同集群/区域的备用 Nexus 实例,在灾难恢复场景激活之前保持休眠状态。 | +| **主块存储**| 存储所有 Nexus 数据的活跃块存储系统,作为块存储复制的源。 | +| **备用块存储**| 从主块存储接收数据复制的同步备份块存储系统。它确保在灾难恢复期间的数据可用性。 | +| **恢复点目标 (RPO)** | 以时间衡量的最大可接受数据丢失量(例如,5 分钟,1 小时)。它定义了在灾难发生前可以丢失多少数据才变得不可接受。 | +| **恢复时间目标 (RTO)** | 以时间衡量的最大可接受停机时间(例如,15 分钟,2 小时)。它定义了系统在灾难后必须恢复的速度。 | +| **故障转移** | 当主系统变得不可用或失败时,从主系统切换到备用系统的过程。 | +| **数据同步**| 从主系统到备用系统持续复制数据以保持一致性并启用灾难恢复的过程。 | +| **热数据,冷计算**| 一种架构模式,其中数据持续同步(热),而计算资源保持非活动状态(冷),直到故障转移。 | + +## 架构 + +Nexus 灾难恢复解决方案为 Nexus 服务实现了**热数据、冷计算架构**。这种架构通过准实时数据同步和手动 Nexus 服务故障转移程序提供灾难恢复能力。架构由部署在不同集群或区域的两个 Nexus 实例组成,备用 Nexus 并不会提前部署,直到在灾难场景中激活,而存储层保持持续同步。 + +### 数据同步策略 + +该解决方案通过 Ceph RBD Mirror 块存储复制确保 Nexus 数据同步到备用集群。Nexus 的所有数据都存储在 PVC 中,通过 Ceph RBD Mirror 机制定时同步到备用集群。 + +### 灾难恢复配置 + +1. **部署主 Nexus**:配置域名访问,使用主块存储存储数据 +2. **准备备用 Nexus 部署环境**:配置备用实例所需要的 pv、pvc 和 secret 资源,以便于灾难发生时快速恢复 + +### 故障转移程序 + +当发生灾难时,以下步骤确保转换到备用环境: + +1. **验证主故障**:确认所有主 Nexus 组件都不可用 +2. **提升 Ceph RBD**:将备用 Ceph RBD 提升为主 Ceph RBD +3. **恢复 PVC 和 PV 资源**:根据 Ceph 块存储灾难恢复文档,将 Nexus 所使用的 PVC 在备集群恢复 +4. **部署备用 Nexus**:在备集群使用灾备数据快速部署 Nexus 实例 +5. **更新路由**:将外部访问地址切换到指向备用 Nexus 实例 + +## Nexus 容灾配置 + +::: warning + +为了简化配置过程,降低配置难度,推荐主备两个环境中使用一致的信息,包括: + +- 一致的 Ceph 存储池名称和存储类名称 +- 一致的 Nexus 实例名称 +- 一致的命名空间名称 + +::: + +### 前置条件 + +1. 提前准备一个主集群和一个灾难恢复集群(或包含不同区域的集群)。 +2. 完成 `Alauda Build of Rook-Ceph` 块存储的灾难恢复配置的部署。 + +:::warning +`Alauda Build of Rook-Ceph` 块存储的灾难恢复配置,需要设置合理的[同步间隔时间](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#create-volumereplicationclass),这会直接影响容灾的 RPO 指标。 +::: + +### 使用 `Alauda Build of Rook-Ceph` 构建块存储灾难恢复集群 + +使用 `Alauda Build of Rook-Ceph` 构建块存储灾难恢复集群。参考 [块存储灾难恢复](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html) 文档构建灾难恢复集群。 + +### 设置主 Nexus + +按照 Nexus 实例部署指南部署主 Nexus 实例。配置域名访问,使用主块存储存储数据。 + +配置示例(仅包含了容灾关注的配置项,完整配置项见产品文档): + +```yaml +apiVersion: operator.alaudadevops.io/v1alpha1 +kind: Nexus +metadata: + name: + namespace: +spec: + externalURL: http://nexus-ddrs.alaudatech.net + helmValues: + pvc: + storage: 5Gi + volumeClaimTemplate: + enabled: true + storageClass: + name: ceph-rdb # 设置已经配置了存储类名称 +``` + +部署主 Nexus 后,需要为 Nexus 组件使用的 PVC 配置 RBD Mirror,配置后才会将 PVC 数据定时同步到备 Ceph 集群。具体参数配置参考 [Ceph RBD Mirror](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#enable-mirror-for-pvc)。 + +```bash +export NEXUS_NAMESPACE= +export NEXUS_NAME= +export NEXUS_PVC_NAME=nexus-data-${NEXUS_NAME}-nxrm-ha-0 + +cat << EOF | kubectl apply -f - +apiVersion: replication.storage.openshift.io/v1alpha1 +kind: VolumeReplication +metadata: + name: ${NEXUS_PVC_NAME} + namespace: ${NEXUS_NAMESPACE} +spec: + autoResync: true # 自动同步 + volumeReplicationClass: rbd-volumereplicationclass + replicationState: primary # 标记为主集群 + dataSource: + apiGroup: "" + kind: PersistentVolumeClaim + name: ${NEXUS_PVC_NAME} +EOF +``` + +检查 Ceph RBD Mirror 状态,可以看到 Nexus 的 PVC 已经配置了 Ceph RBD Mirror。 + +```bash +❯ kubectl -n $NEXUS_NAMESPACE get volumereplication +NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE +nexus-data-nexus-ddrs-nxrm-ha-0 15s rbd-volumereplicationclass nexus-data-nexus-ddrs-nxrm-ha-0 primary Primary +``` + +从 Ceph 端查看 Ceph RBD Mirror 状态,`CEPH_BLOCK_POOL` 是 Ceph RBD 存储池的名称。`SCHEDULE` 列标识了同步的频率(下面的示例是 1 分钟同步一次)。 + +```bash +❯ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- rbd mirror snapshot schedule ls --pool $CEPH_BLOCK_POOL --recursive +POOL NAMESPACE IMAGE SCHEDULE +myblock csi-vol-459e6f28-a158-4ae9-b5da-163448c35119 every 1m +``` + +检查 Ceph RBD Mirror 状态,state 为 `up+stopped`(主集群正常)并且 peer_sites.state 为 `up+replaying`(备集群正常)表示同步正常。 + +```bash +❯ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- rbd mirror image status $CEPH_BLOCK_POOL/$NEXUS_BLOCK_IMAGE_NAME +csi-vol-459e6f28-a158-4ae9-b5da-163448c35119: + global_id: 98bbf3bf-7c61-42b4-810b-cb2a7cd6d6b1 + state: up+stopped + description: local image is primary + service: a on 192.168.129.233 + last_update: 2025-11-19 01:42:07 + peer_sites: + name: ecf558fa-1e8a-43f1-bf6b-1478e73f272e + state: up+replaying + description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":5742592.0,"last_snapshot_bytes":5742592,"last_snapshot_sync_seconds":0,"local_snapshot_timestamp":1763516344,"remote_snapshot_timestamp":1763516344,"replay_state":"idle"} + last_update: 2025-11-19 01:42:27 + snapshots: + 75 .mirror.primary.98bbf3bf-7c61-42b4-810b-cb2a7cd6d6b1.3d3402a5-f298-4048-8c50-84979949355d (peer_uuids:[66d8fb19-c610-438c-ae73-42a95ea4e86e]) +``` + +### 设置备用 Nexus + +:::warning +当 Ceph RBD 处于备用状态时,同步过来的存储块无法挂载,因此备集群的 Nexus 无法部署成功。 + +如需验证备集群 Nexus 是否可以部署成功,可以临时将备集群的 Ceph RBD 提升为主集群,测试完成后再设置回备用状态。同时需要将测试过程中创建的 Nexus、PV 和 PVC 资源都删除。 +::: + +1. 备份主 Nexus 使用的 Secret +2. 备份主集群 Nexus 组件的 PVC 和 PV 资源 YAML +3. 备份主集群 Nexus 的 Nexus 资源 YAML + +#### 备份主 Nexus 使用的 Secret + +获取主 Nexus 使用的 Password Secret YAML,并将 Secret 创建到备集群同名命名空间中。 + +```bash +apiVersion: v1 +data: + password: xxxxxx +kind: Secret +metadata: + name: nexus-root-password + namespace: nexus-dr +type: Opaque +``` + +#### 备份主 Nexus 组件的 PVC 和 PV 资源 + +:::tip +PV 资源中保存了 volume 属性信息,这些信息是容灾恢复时的关键信息,需要备份好。 + +```bash + volumeAttributes: + clusterID: rook-ceph + imageFeatures: layering + imageFormat: "2" + imageName: csi-vol-459e6f28-a158-4ae9-b5da-163448c35119 + journalPool: myblock + pool: myblock + storage.kubernetes.io/csiProvisionerIdentity: 1763446982673-7963-rook-ceph.rbd.csi.ceph.com +``` + +::: + +执行以下命令将主 Nexus 组件的 PVC 和 PV 资源备份到当前目录: + +```bash +export NEXUS_PVC_NAME= + +echo "=> Exporting PVC $NEXUS_PVC_NAME" + +# 导出 PVC +kubectl -n "$NEXUS_NAMESPACE" get pvc "$NEXUS_PVC_NAME" -o yaml > "pvc-${NEXUS_PVC_NAME}.yaml" + +# 获取 PV +PV=$(kubectl -n "$NEXUS_NAMESPACE" get pvc "$NEXUS_PVC_NAME" -o jsonpath='{.spec.volumeName}') + +if [[ -n "$PV" ]]; then + echo " ↳ Exporting PV $PV" + kubectl get pv "$PV" -o yaml > "pv-${PV}.yaml" +fi +``` + +修改备份出来的 PV 文件,将 yaml 中的 `spec.claimRef` 字段全部删除。 + +将备份出来的 PVC 和 PV YAML 文件直接创建到容灾环境同名命名空间中。 + +#### 备份主 Nexus 实例 YAML + +```bash +kubectl -n "$NEXUS_NAMESPACE" get nexus "$NEXUS_NAME" -oyaml > nexus.yaml +``` + +根据容灾环境实际情况修改 `nexus.yaml` 中的信息。 + +:::warning +`Nexus` 资源**不需要**立即创建在容灾环境,只需要在灾难发生时,执行容灾切换时创建到备集群即可。 +::: + +:::warning +如需进行容灾演练,可以按照 [灾难切换](#灾难切换) 中的步骤进行演练。演练完毕后需要在容灾环境完成以下清理操作: + +- 将容灾环境中的 `Nexus` 实例删除 +- 将创建的 PVC 和 PV 删除 +- 将 Ceph RBD 切换为备用状态 + +::: + +### 恢复目标 + +#### 恢复点目标 (RPO) + +RPO 表示在灾难恢复场景中最大可接受的数据丢失。在此 Nexus 灾难恢复解决方案中: + +- **存储层**:由于 Nexus 数据的 Ceph RBD 块存储复制,通过快照定时同步,数据丢失情况取决于同步间隔,间隔时间可以[配置](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#create-volumereplicationclass) +- **总体 RPO**:总体 RPO 取决于 Ceph RBD 块存储复制的同步间隔时间。 + +#### 恢复时间目标 (RTO) + +RTO 表示在灾难恢复期间最大可接受的停机时间。此解决方案提供: + +- **手动组件**:Nexus 服务激活和外部路由更新需要手动干预 +- **典型 RTO**:完整服务恢复需要 4-10 分钟 + +**RTO 分解:** + +- Ceph RBD 故障转移:1-2 分钟(手动) +- Nexus 服务激活:2-5 分钟(手动) +- 外部路由更新:1-3 分钟(手动,取决于 DNS 传播) + +## 灾难切换 + +1. **确认主 Nexus 故障**:确认所有主 Nexus 组件都处于非工作状态,否则先停止所有主 Nexus 组件。 + +2. **提升备用 Ceph RBD**:将备用 Ceph RBD 提升为主 Ceph RBD。参考 [Alauda Build of Rook-Ceph 故障转移](https://docs.alauda.cn/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_block.html#procedures-1) 的切换程序。 + +3. **恢复 PVC 和 PV 资源**:恢复备份的 PVC 和 PV 资源到容灾环境同名命名空间中,并检查备集群 PVC 状态是否为 `Bound` 状态: + + ```bash + ❯ kubectl -n $NEXUS_NAMESPACE get pvc,pv + NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE + persistentvolumeclaim/nexus-data-nexus-ddrs-nxrm-ha-0 Bound pvc-231a9021-2548-433e-8583-f7b56d74aca7 5Gi RWO ceph-rdb 45s + + NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE + persistentvolume/pvc-231a9021-2548-433e-8583-f7b56d74aca7 5Gi RWO Delete Bound nexus-dr/nexus-data-nexus-ddrs-nxrm-ha-0 ceph-rdb 63s + ``` + +4. **部署备用 Nexus**:恢复备份的 `nexus.yaml` 到容灾环境同名命名空间中。Nexus 会利用容灾数据自动启动。 + +5. **验证 Nexus 组件**:验证所有 Nexus 组件正在运行且健康。测试 Nexus 功能(仓库访问、包上传下载、用户认证)以验证 Nexus 是否正常工作。 + +6. **切换访问地址**:将外部访问地址切换到备用 Nexus。 + +## 使用其他块存储构建 Nexus 灾难恢复解决方案 + +操作步骤与使用 `Alauda Build of Rook-Ceph` 构建 Nexus 灾难恢复解决方案类似。只需将块存储替换为其他支持灾难恢复的块存储解决方案。 + +:::warning +确保所选块存储解决方案支持灾难恢复能力,并在生产环境使用前进行充分的容灾演练。 +::: + diff --git a/docs/zh/solutions/How_to_perform_disaster_recovery_for_sonarqube.md b/docs/zh/solutions/How_to_perform_disaster_recovery_for_sonarqube.md new file mode 100644 index 00000000..8d151172 --- /dev/null +++ b/docs/zh/solutions/How_to_perform_disaster_recovery_for_sonarqube.md @@ -0,0 +1,215 @@ +--- +kind: + - Solution +products: + - Alauda DevOps +ProductsVersion: + - 4.x +id: KB251200005 +--- + +# 如何为 SonarQube 执行灾难恢复 + +## 问题 + +本解决方案描述了如何基于 PostgreSQL 的灾难恢复能力构建 SonarQube 灾难恢复解决方案。该解决方案实现了**热数据、冷计算**架构,其中数据通过 PostgreSQL 灾难恢复机制持续同步到备用集群,当主集群发生故障时部署备用 SonarQube 实例,备用 SonarQube 会使用容灾数据快速启动并提供服务。该解决方案主要关注数据灾难恢复处理,用户需要自行实现 SonarQube 访问地址切换机制。 + +## 环境 + +SonarQube Operator: >=v2025.1.0 + +## 术语 + +| 术语 | 描述 | +|-------------------------|-----------------------------------------------------------------------------| +| **主 SonarQube** | 处理正常业务操作和用户请求的活跃 SonarQube 实例。该实例完全运行,所有组件都在运行。 | +| **备用 SonarQube** | 计划部署在不同集群/区域的备用 SonarQube 实例,在灾难恢复场景激活之前保持休眠状态。 | +| **主 PostgreSQL** | 处理所有数据事务的活跃 PostgreSQL 数据库集群,作为数据复制到备用数据库的源。 | +| **备用 PostgreSQL**| 从主数据库接收实时数据复制的热备用 PostgreSQL 数据库。它可以在故障转移期间提升为主角色。 | +| **恢复点目标 (RPO)** | 以时间衡量的最大可接受数据丢失量(例如,5 分钟,1 小时)。它定义了在灾难发生前可以丢失多少数据才变得不可接受。 | +| **恢复时间目标 (RTO)** | 以时间衡量的最大可接受停机时间(例如,15 分钟,2 小时)。它定义了系统在灾难后必须恢复的速度。 | +| **故障转移** | 当主系统变得不可用或失败时,从主系统切换到备用系统的过程。 | +| **数据同步**| 从主系统到备用系统持续复制数据以保持一致性并启用灾难恢复的过程。 | +| **热数据,冷计算**| 一种架构模式,其中数据持续同步(热),而计算资源保持非活动状态(冷),直到故障转移。 | + +## 架构 + +SonarQube 灾难恢复解决方案为 SonarQube 服务实现了**热数据、冷计算架构**。这种架构通过准实时数据同步和手动 SonarQube 服务故障转移程序提供灾难恢复能力。架构由部署在不同集群或区域的两个 SonarQube 实例组成,备用 SonarQube 并不会提前部署,直到在灾难场景中激活,而数据库层保持持续同步。 + +### 数据同步策略 + +该解决方案通过 PostgreSQL 流式复制确保主数据库和备用数据库之间的实时事务日志同步,包括所有 SonarQube 应用程序数据 + +### 灾难恢复配置 + +1. **部署主 SonarQube**:配置域名访问,连接到主 PostgreSQL 数据库 +2. **准备备用 SonarQube 部署环境**:配置备用实例所需要的 secret 资源,以便于灾难发生时快速恢复 + +### 故障转移程序 + +当发生灾难时,以下步骤确保转换到备用环境: + +1. **验证主故障**:确认所有主 SonarQube 组件都不可用 +2. **提升数据库**:使用数据库故障转移程序将备用 PostgreSQL 提升为主 +3. **部署备用 SonarQube**:在备集群使用灾备数据快速部署 SonarQube 实例 +4. **更新路由**:将外部访问地址切换到指向备用 SonarQube 实例 + +## SonarQube 容灾配置 + +::: warning + +为了简化配置过程,降低配置难度,推荐主备两个环境中使用一致的信息,包括: + +- 一致的数据库实例名称和密码 +- 一致的 SonarQube 实例名称 +- 一致的命名空间名称 + +::: + +### 前置条件 + +1. 提前准备一个主集群和一个灾难恢复集群(或包含不同区域的集群)。 +2. 完成 `Alauda support for PostgreSQL` 灾难恢复配置的部署。 + +### 使用 `Alauda support for PostgreSQL` 构建 PostgreSQL 灾难恢复集群 + +参考 `PostgreSQL 热备用集群配置指南`,使用 `Alauda support for PostgreSQL` 构建灾难恢复集群。 + +确保主 PostgreSQL 和备用 PostgreSQL 位于不同的集群(或不同的区域)。 + +您可以在 [Alauda Knowledge](https://cloud.alauda.io/knowledges#/) 上搜索 `PostgreSQL 热备用集群配置指南` 来获取它。 + +:::warning + +`PostgreSQL 热备用集群配置指南` 是一份描述如何使用 `Alauda support for PostgreSQL` 构建灾难恢复集群的文档。使用此配置时,请确保与相应的 ACP 版本兼容。 + +::: + +### 设置主 SonarQube + +按照 SonarQube 实例部署指南部署主 SonarQube 实例。配置域名访问,连接到主 PostgreSQL 数据库。 + +配置示例(仅包含了容灾关注的配置项,完整配置项见产品文档): + +```yaml +apiVersion: operator.alaudadevops.io/v1alpha1 +kind: Sonarqube +metadata: + name: + namespace: +spec: + externalURL: http://dr-sonar.alaudatech.net # 配置域名并解析到主集群 + helmValues: + ingress: + enabled: true + hosts: + - name: dr-sonar.alaudatech.net + jdbcOverwrite: + enable: true + jdbcSecretName: sonarqube-pg + jdbcUrl: jdbc:postgresql://sonar-dr.sonar-dr:5432/sonar_db? # 连接到主 PostgreSql + jdbcUsername: postgres +``` + +### 设置备用 SonarQube + +:::warning +当 PostgreSQL 处于备用状态时,备用数据库无法接受写操作,因此备集群的 SonarQube 无法部署成功。 + +如需验证备集群 SonarQube 是否可以部署成功,可以临时将备集群的 PostgreSQL 提升为主集群,测试完成后再设置回备用状态。同时需要将测试过程中创建的 SonarQube 资源都删除。 +::: + +1. 创建备 SonarQube 使用的 Secret +2. 备份主 SonarQube 实例 YAML + +#### 创建备 SonarQube 使用的 Secret + +备 SonarQube 需要两个 secret,分别保存数据库连接 (连接到备 PostgreSQL) 和 root 密码。参考 [SonarQube 部署文档](https://docs.alauda.cn/alauda-build-of-sonarqube/2025.1/install/02_sonarqube_credential.html#pg-credentials) 创建(Secret 名称保持和主 SonarQube 配置时使用的名称一致)。 + +示例: + +```bash +apiVersion: v1 +stringData: + host: sonar-dr.sonar-dr + port: "5432" + username: postgres + jdbc-password: xxxx + database: sonar_db +kind: Secret +metadata: + name: sonarqube-pg + namespace: $SONARQUBE_NAMESPACE +type: Opaque +--- +apiVersion: v1 +stringData: + password: xxxxx +kind: Secret +metadata: + name: sonarqube-root-password + namespace: $SONARQUBE_NAMESPACE +type: Opaque +``` + +#### 备份主 SonarQube 实例 YAML + +```bash +kubectl -n "$SONARQUBE_NAMESPACE" get sonarqube "$SONARQUBE_NAME" -oyaml > sonarqube.yaml +``` + +根据容灾环境实际情况修改 `sonarqube.yaml` 中的信息,包括 PostgreSQL 连接地址等。 + +:::warning +`SonarQube` 资源**不需要**立即创建在容灾环境,只需要在灾难发生时,执行容灾切换时创建到备集群即可。 +::: + +:::warning +如需进行容灾演练,可以按照 [灾难场景中的主备切换程序](#灾难切换) 中的步骤进行演练。演练完毕后需要在容灾环境完成以下清理操作: + +- 将容灾环境中的 `SonarQube` 实例删除 +- 将 PostgreSQL 集群切换为备用状态 + +::: + +### 恢复目标 + +#### 恢复点目标 (RPO) + +RPO 表示在灾难恢复场景中最大可接受的数据丢失。在此 SonarQube 灾难恢复解决方案中: + +- **数据库层**:由于 PostgreSQL 热备用流式复制,数据丢失接近零 +- **总体 RPO**:总体 RPO 接近零,取决于 PostgreSQL 流式复制的延迟 + +#### 恢复时间目标 (RTO) + +RTO 表示在灾难恢复期间最大可接受的停机时间。此解决方案提供: + +- **手动组件**:SonarQube 服务激活和外部路由更新需要手动干预 +- **典型 RTO**:完整服务恢复需要 7-20 分钟 + +**RTO 分解:** + +- 数据库故障转移:1-2 分钟(手动) +- SonarQube 服务激活:5-15 分钟(手动) +- 外部路由更新:1-3 分钟(手动,取决于 DNS 传播) + +## 灾难切换 + +1. **确认主 SonarQube 故障**:确认所有主 SonarQube 组件都处于非工作状态,否则先停止所有主 SonarQube 组件。 + +2. **提升备用 PostgreSQL**:将备用 PostgreSQL 提升为主 PostgreSQL。参考 `PostgreSQL 热备用集群配置指南` 的切换程序。 + +3. **部署备用 SonarQube**:恢复备份的 `sonarqube.yaml` 到容灾环境同名命名空间中。SonarQube 会利用容灾数据自动启动。 + +4. **验证 SonarQube 组件**:验证所有 SonarQube 组件正在运行且健康。测试 SonarQube 功能(项目访问、代码分析、用户认证)以验证 SonarQube 是否正常工作。 + +5. **切换访问地址**:将外部访问地址切换到备用 SonarQube。 + +## 使用其他 PostgreSQL 构建 SonarQube 灾难恢复解决方案 + +操作步骤与使用 `Alauda support for PostgreSQL` 构建 SonarQube 灾难恢复解决方案类似。只需将 PostgreSQL 替换为其他支持灾难恢复的 PostgreSQL 解决方案。 + +:::warning +确保所选 PostgreSQL 解决方案支持灾难恢复能力,并在生产环境使用前进行充分的容灾演练。 +:::