Skip to content

Conversation

@kycheng
Copy link
Contributor

@kycheng kycheng commented Nov 18, 2025

  • Clarified the access address for the Primary Object Storage in the disaster recovery steps.
  • Renamed the "Primary-Standby Switchover Procedure" section to "Failover" for better clarity.
  • Expanded the "Disaster Recovery" section to include recovery steps for the original Primary Harbor.
  • Added details on automatic start/stop mechanisms for the disaster recovery instance, including configuration and script examples for managing Harbor and PostgreSQL instances.

Summary by CodeRabbit

  • Documentation
    • Restructured and renamed disaster recovery content with a clearer "Failover" section and a full recovery workflow covering database/object storage promotion and component scaling.
    • Expanded automation: added an "Automatic Start/Stop" workflow, flowchart, configuration guidance, orchestration scripts, deployment patterns, and RBAC guidance.
    • Added comprehensive runnable examples and templates for object storage, PostgreSQL, and application components to support automated recovery operations.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 18, 2025

Walkthrough

Documentation overhaul of Harbor disaster-recovery: corrected an anchor, renamed and expanded Failover and Disaster Recovery workflows to include promoting Secondary PostgreSQL and object storage and scaling Harbor, and added an "Automatic Start/Stop of Disaster Recovery Instance" automation section with manifests, scripts, and RBAC guidance.

Changes

Cohort / File(s) Summary
Disaster Recovery Documentation
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md
Fixed PRIMARY_OBJECT_STORAGE_ADDRESS anchor; renamed "Primary-Standby Switchover Procedure in Disaster Scenarios" to "Failover" and expanded steps to promote Secondary PostgreSQL and object storage and scale Secondary Harbor components (includes replica YAML and kubectl patch/scale snippets).
Recovery Workflow Expansion
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md
Replaced "Disaster Recovery Data Check" with a full "Disaster Recovery" workflow: setting Harbor replicas to zero, converting Primary↔Secondary pipelines, pulling realm config, promoting zones, and rollback guidance to revert to primary.
Automatic Start/Stop Automation
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md
Added "Automatic Start/Stop of Disaster Recovery Instance" with Mermaid flowchart, config.yaml template, check/start/stop/status scripts, DR controller deployment manifest, and RBAC notes for automation across Harbor, PostgreSQL, and object storage.
Start/Stop Script & Examples
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md
Added extensive start/stop examples and templates for PostgreSQL hot-standby promotion, Rook-Ceph/Alauda object storage activation, Harbor components (core/portal/jobservice/registry/trivy), Docker and kubectl invocation examples.
Formatting & Cross-refs
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md
Minor formatting refinements and cross-reference updates to integrate automation content with disaster-recovery guidance.

Sequence Diagram(s)

%%{init: {"themeVariables":{"actorTextFill":"#000000","actorBorder":"#7F8C8D"}}}%%
sequenceDiagram
    participant Admin
    participant DRController as DR Controller
    participant K8s as Kubernetes API
    participant Scripts as Start/Stop Scripts
    participant Harbor as Harbor Components

    rect rgb(230,245,255)
      Admin->>DRController: apply config.yaml (enable automation)
      DRController->>K8s: create Deployment + RBAC
      K8s-->>DRController: Deployment running
    end

    rect rgb(245,255,230)
      Admin->>DRController: trigger start/recover
      DRController->>Scripts: invoke start.sh (promote DB, activate storage)
      Scripts->>K8s: kubectl patch/scale/postgres/object-storage
      K8s->>Harbor: scale/start components
      Harbor-->>Admin: health checks OK
    end

    rect rgb(255,240,240)
      Admin->>DRController: trigger stop
      DRController->>Scripts: invoke stop.sh (scale down/demote)
      Scripts->>K8s: kubectl patch/scale to 0 / demote services
      K8s-->>Harbor: services stopped
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

  • Check ordering and correctness of DB promotion and object-storage conversion commands.
  • Validate YAML manifests, kubectl patch/scale snippets, and DR controller RBAC scope.
  • Ensure consistency between diagrams, scripts, and written steps.

Possibly related PRs

Suggested reviewers

  • Tongcaiyun
  • tyzhou111
  • chengjingtao

Poem

🐰 I nibbled yaml by moonlit glow,
scripts in paw and patches in tow.
Start the DR, promote with care,
stop it gentle, roots laid bare.
Harbor dreams safe — nibble, hop, and go. 🌿✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the PR - updating Harbor disaster recovery documentation with expanded failover, recovery, and automation guidance.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (3)

405-415: Consider more portable DNS/connectivity check methods in the example.

The check.sh example on line 409 uses nslookup for DNS resolution, which may not be available in lightweight container images (e.g., alpine-based) commonly used in Kubernetes environments. The example includes a disclaimer at line 403 not to use directly in production, but you may want to suggest more portable alternatives for users who adapt this example:

  • dig (from bind-utils package, similar availability to nslookup)
  • getent hosts (uses system resolver, widely available)
  • curl with health endpoint (e.g., curl -f https://harbor-domain/api/v2.0/health)
  • Direct network socket checks (e.g., /dev/tcp)

Consider adding a note recommending that users verify tool availability in their target container environment.


428-428: Normalize JSON quoting in kubectl patch commands for consistency.

The start.sh script on line 428 uses double-quoted JSON with escaped quotes ("{\"spec\":{...}}"), while the stop.sh script on line 437 uses single quotes ('{"spec":{...}}'). Although both are syntactically valid, using single quotes consistently (as in stop.sh) is cleaner, more readable, and reduces escaping complexity. This improves maintainability for users who adapt these examples.

Apply this diff to normalize the quoting style:

  # Start Harbor script - this section is required
  HARBOR_NAMESPACE="${HARBOR_NAMESPACE:-harbor-ns}"
  HARBOR_NAME="${HARBOR_NAME:-harbor}"
  HARBOR_REPLICAS="${HARBOR_REPLICAS:-1}"
- kubectl -n "$HARBOR_NAMESPACE" patch harbor "$HARBOR_NAME" --type=merge -p "{\"spec\":{\"helmValues\":{\"core\":{\"replicas\":$HARBOR_REPLICAS},\"portal\":{\"replicas\":$HARBOR_REPLICAS},\"jobservice\":{\"replicas\":$HARBOR_REPLICAS},\"registry\":{\"replicas\":$HARBOR_REPLICAS},\"trivy\":{\"replicas\":$HARBOR_REPLICAS}}}}"
+ kubectl -n "$HARBOR_NAMESPACE" patch harbor "$HARBOR_NAME" --type=merge -p '{"spec":{"helmValues":{"core":{"replicas":'$HARBOR_REPLICAS'},"portal":{"replicas":'$HARBOR_REPLICAS'},"jobservice":{"replicas":'$HARBOR_REPLICAS'},"registry":{"replicas":'$HARBOR_REPLICAS'},"trivy":{"replicas":'$HARBOR_REPLICAS'}}}}'

Alternatively, consider using a separate JSON file or heredoc for better readability:

kubectl -n "$HARBOR_NAMESPACE" patch harbor "$HARBOR_NAME" --type=merge -p @- <<EOF
{
  "spec": {
    "helmValues": {
      "core": {"replicas": $HARBOR_REPLICAS},
      "portal": {"replicas": $HARBOR_REPLICAS},
      "jobservice": {"replicas": $HARBOR_REPLICAS},
      "registry": {"replicas": $HARBOR_REPLICAS},
      "trivy": {"replicas": $HARBOR_REPLICAS}
    }
  }
}
EOF

479-493: PostgreSQL examples correctly omit peerHost/peerPort updates—consider adding explanatory note.

The Start and Stop script examples for PostgreSQL correctly modify only isReplica and numberOfInstances without updating peerHost and peerPort parameters. This aligns with the PostgreSQL hot standby cluster design where the standby cluster's connection information is established during initial configuration and persists correctly during switchover operations.

To improve clarity for users who may expect these parameters to change during failover, consider adding a brief inline comment or footnote explaining why peerHost and peerPort are not modified during the switchover, similar to the note already present in the Primary/Secondary Configuration sections (lines 449-451). Based on learnings

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d9c3db7 and 633da2d.

📒 Files selected for processing (1)
  • docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-15T09:04:27.174Z
Learnt from: SuJinpei
Repo: alauda/knowledge PR: 70
File: docs/en/solutions/How_to_Use_PostgreSQL_Hot_Standby_Cluster.md:249-271
Timestamp: 2025-10-15T09:04:27.174Z
Learning: In PostgreSQL hot standby cluster setups with ACP (Alauda Container Platform), the standby cluster's host and port information are stored in the database during initial setup. Therefore, when performing switchover operations between primary and standby clusters, it is not necessary to update the peerHost and peerPort parameters in the clusterReplication configuration - the replication will continue to work correctly without these updates.

Applied to files:

  • docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md
🔇 Additional comments (4)
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (4)

111-111: Verify that the anchor link target exists in the referenced documentation.

Line 111 references an anchor #address in the external Object Storage Disaster Recovery documentation. Confirm that the target URL includes this section anchor and that the link will correctly navigate users to the "Configure External Access for Primary Zone" section.

To verify, you can:

  1. Visit the URL in a browser: https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html#address
  2. Confirm that the page contains a section with this anchor identifier

282-313: Clear and logically sequenced failover procedure.

The restructured "Failover" section presents the disaster recovery activation steps in a well-organized manner. The procedural flow is logical, and the YAML configuration example correctly demonstrates how to scale Harbor components. The section effectively bridges the architecture overview with practical execution steps.


322-329: Verify Ceph object storage zone recovery command completeness.

The radosgw-admin zone modify command on line 327 promotes the recovered zone to master status. However, verify whether the --master flag alone is sufficient or if an additional --default flag is needed to set it as the default zone. Ceph documentation may require both flags or a separate command to complete the recovery.

Consider checking the Ceph RGW documentation or providing a link to the Object Storage Disaster Recovery guide that details the complete zone promotion procedure.


377-512: Well-structured automation section with practical examples.

The new "Automatic Start/Stop of Disaster Recovery Instance" section is comprehensive and well-organized. The Mermaid flowchart clearly illustrates the decision logic, the configuration structure is straightforward, and the script examples provide practical templates. The technology-specific subsections for PostgreSQL and Rook-Ceph give users clear guidance for their chosen infrastructure. The use of environment variables with sensible defaults improves usability.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 633da2d and b7296c5.

📒 Files selected for processing (1)
  • docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-15T09:04:27.174Z
Learnt from: SuJinpei
Repo: alauda/knowledge PR: 70
File: docs/en/solutions/How_to_Use_PostgreSQL_Hot_Standby_Cluster.md:249-271
Timestamp: 2025-10-15T09:04:27.174Z
Learning: In PostgreSQL hot standby cluster setups with ACP (Alauda Container Platform), the standby cluster's host and port information are stored in the database during initial setup. Therefore, when performing switchover operations between primary and standby clusters, it is not necessary to update the peerHost and peerPort parameters in the clusterReplication configuration - the replication will continue to work correctly without these updates.

Applied to files:

  • docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md
🔇 Additional comments (6)
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (6)

111-111: Verify the link anchor exists in the target documentation.

Line 111 updates the link anchor from #configure-external-access-for-primary-zone to #address. Ensure that the target documentation at dr_object.html contains this anchor.


282-313: Failover section improvements are well-structured.

The renamed "Failover" section provides a clear, sequential procedure with appropriate YAML examples. The steps correctly sequence database promotion, storage failover, and Harbor component scaling. References to external configuration guides are properly cited.


314-333: Disaster Recovery section provides complete recovery workflow.

The expanded "Disaster Recovery" section clearly documents how to restore the original Primary Harbor as a Secondary Harbor after failover. The radosgw-admin commands for realm and zone configuration are properly formatted with variable substitution. The closing guidance about potential full restoration is helpful context.


403-415: DNS check script example is appropriately cautious.

The DNS-based health check example is clearly marked as non-production and encourages customization based on internal requirements. The documentation correctly emphasizes that the check script should be tailored to specific deployment scenarios. This balanced approach is good documentation practice.


487-531: PostgreSQL switchover approach aligns with platform capabilities.

The PostgreSQL start/stop script examples correctly demonstrate that switchover requires only modifying clusterReplication.isReplica and numberOfInstances, without touching peerHost and peerPort. This approach aligns with ACP's PostgreSQL hot standby architecture. Based on learnings


533-549: Rook-Ceph automation examples are properly structured.

The Rook-Ceph start/stop script examples correctly use Kubernetes API to retrieve credentials and dynamically locate the tools pod, then execute proper radosgw-admin commands for realm and zone configuration. The approach of providing no-op for stop script is reasonable for this use case.

```bash
# Check and control dependencies, such as verifying if the database is the primary instance
# and if the object storage is ready
dependencies start script
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

增加一些明确的注释,说明下 不同组件的 启动"顺序", 例如

####
#  Start Your PG
####

####
#  Start Your S3
####

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (1)

484-484: Deployment command syntax remains incomplete—duplicate of past review issue. Line 484 shows command: ["--", "-c", "/opt/config/config.yaml"], which is invalid and was flagged in the prior review. The command array must start with the executable path, e.g., ["/opt/app/controller", "-c", "/opt/config/config.yaml"], or be omitted if the image already has the correct ENTRYPOINT/CMD.

Apply this diff to fix the command field:

      - name: controller
        image: xxx  # Replace with your control program image
-       command: ["--", "-c", "/opt/config/config.yaml"]
+       command: ["/path/to/controller", "-c", "/opt/config/config.yaml"]
🧹 Nitpick comments (1)
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (1)

420-438: Enhance startup order comments for clarity. While the start script includes sections for PostgreSQL, Object Storage, and Harbor, the comments could be more prominent to match the previously requested format for clear component startup order documentation.

Consider restructuring the section markers like this for better visibility:

      # Check and control dependencies, such as verifying if the database is the primary instance
      # and if the object storage is ready
+     #####################################
+     #  Start Your PostgreSQL
+     #####################################
      #####################################
      # Add your PostgreSQL start script here.
      # This script should promote the secondary PostgreSQL to primary role and ensure
      # the database is ready to serve Harbor before starting Harbor components.
      #####################################

+     #####################################
+     #  Start Your Object Storage (S3)
+     #####################################
      #####################################
      # Add your S3/Object Storage start script here.
      # This script should promote the secondary object storage to primary role and ensure
      # the storage system is ready to serve Harbor before starting Harbor components.
      #####################################

+     #####################################
+     #  Start Your Harbor
+     #####################################
      # Start Harbor script - this section is required

This makes the startup sequence more visually explicit for operators implementing the scripts.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7296c5 and 5522794.

📒 Files selected for processing (1)
  • docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-15T09:04:27.174Z
Learnt from: SuJinpei
Repo: alauda/knowledge PR: 70
File: docs/en/solutions/How_to_Use_PostgreSQL_Hot_Standby_Cluster.md:249-271
Timestamp: 2025-10-15T09:04:27.174Z
Learning: In PostgreSQL hot standby cluster setups with ACP (Alauda Container Platform), the standby cluster's host and port information are stored in the database during initial setup. Therefore, when performing switchover operations between primary and standby clusters, it is not necessary to update the peerHost and peerPort parameters in the clusterReplication configuration - the replication will continue to work correctly without these updates.

Applied to files:

  • docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (7)
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (7)

111-111: Anchor reference update is correct. The clarification to point to #address makes the documentation more precise and helpful for users locating the correct configuration step.


282-312: Failover section is well-structured and comprehensive. The expanded procedure with step-by-step guidance and YAML examples clearly guides operators through the failover process. The inclusion of a test step before switching external routing is a good practice.


314-332: Recovery workflow is comprehensive and well-documented. The section correctly outlines the steps to restore the original Primary Harbor as a Secondary Harbor, including proper radosgw-admin commands with clear intent comments. The guidance on reverting to primary via the failover procedure is helpful.


507-551: PostgreSQL configuration and scripts correctly implement switchover pattern. The approach of configuring peerHost and peerPort during initial setup (lines 511-535) and managing failover through only isReplica and numberOfInstances changes (lines 542, 550) aligns with disaster recovery best practices and avoids unnecessary reconfiguration during switchover. This reduces operational complexity during critical failover moments.


381-420: Flowchart, configuration template, and script structure are well-designed. The Mermaid flowchart clearly illustrates the check-and-decide-then-act pattern, and the config.yaml template provides all necessary parameters (check_script, start_script, stop_script, check_interval, failure_threshold, script_timeout) with sensible defaults. The check.sh example appropriately warns against production use and includes a disclaimer.


553-569: Rook-Ceph start/stop examples are practical and production-ready. The start script properly demonstrates dynamic secret retrieval, tools pod discovery, and the sequence of realm pull followed by zone modification to promote the secondary zone to master. The note that no action is required on stop is appropriate for the disaster recovery use case.


465-497: Verify RBAC permissions before deploying the controller.

The Deployment YAML structure is correct. Before applying this manifest, ensure the ServiceAccount has permissions to:

  • Patch Harbor CRD resources in the target namespace
  • Access/modify resources managed by your custom start/stop scripts (PostgreSQL, object storage, etc.)

To verify RBAC configuration in your environment, run:

# Replace <service-account-name> and <namespace> with actual values
SERVICE_ACCOUNT="<service-account-name>"
NAMESPACE="<namespace>"

# Check role bindings for the ServiceAccount
kubectl get rolebindings,clusterrolebindings -A -o jsonpath='{range .items[?(@.subjects[*].name=="'"$SERVICE_ACCOUNT"'")]}{.metadata.namespace}{"\t"}{.metadata.name}{"\n"}{end}'

# Check permissions for harbor resources
kubectl auth can-i patch harbor --as=system:serviceaccount:"$NAMESPACE":"$SERVICE_ACCOUNT" -n "$NAMESPACE"

--access-key={access-key} --secret={secret}
# Make the recovered zone the master and default zone:
radosgw-admin zone modify --rgw-realm=<realm-name> --rgw-zonegroup=<zone-group-name> --rgw-zone=<primary-zone-name> --master
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

通常原从(现主)节点已经运行了一段时间,

这时候, PG 和 S3 产生的新数据,能正常同步回来吗?

- Clarified the access address for the Primary Object Storage in the disaster recovery steps.
- Renamed the "Primary-Standby Switchover Procedure" section to "Failover" for better clarity.
- Expanded the "Disaster Recovery" section to include recovery steps for the original Primary Harbor.
- Added details on automatic start/stop mechanisms for the disaster recovery instance, including configuration and script examples for managing Harbor and PostgreSQL instances.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (1)

564-564: Replace placeholder container image.

Line 564 still contains the placeholder xxx for the container image. This makes the Deployment manifest non-functional. Replace it with the actual container image that includes required tools (kubectl, radosgw-admin, etc.) for executing the disaster recovery scripts.

Example correction:

       containers:
       - name: controller
-        image: xxx  # Replace with the correct tool image that contains the tools required by the script.
+        image: <your-registry>/<disaster-recovery-tools>:<tag>  # Image must include kubectl, radosgw-admin, and other required tools
         command:

Ensure the selected image includes:

  • kubectl for Harbor and PostgreSQL patching
  • radosgw-admin for Ceph operations (if using Alauda Build of Rook-Ceph)
  • Any other tools required by your custom scripts
🧹 Nitpick comments (2)
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (2)

485-506: Add explicit component startup order comments in start.sh.

The start.sh script contains placeholders for PostgreSQL and Object Storage startup, but lacks explicit comments about the required startup sequence. Past reviews requested clear comments about component startup order. Clarify that Harbor depends on both PostgreSQL and Object Storage being ready first.

Add explicit section markers to make the startup order unambiguous:

 set -euo pipefail
 # Check and control dependencies, such as verifying if the database is the primary instance
 # and if the object storage is ready
+
+#####################################
+# Step 1: Start PostgreSQL
+# Promote the secondary PostgreSQL to primary role and ensure
+# the database is ready to serve Harbor before starting Harbor components.
 #####################################
 # Add your PostgreSQL start script here.
 # This script should promote the secondary PostgreSQL to primary role and ensure
 # the database is ready to serve Harbor before starting Harbor components.
 #####################################
+
+#####################################
+# Step 2: Start Object Storage
+# Promote the secondary object storage to primary role and ensure
+# the storage system is ready to serve Harbor before starting Harbor components.
 #####################################
 # Add your S3/Object Storage start script here.
 # This script should promote the secondary object storage to primary role and ensure
 # the storage system is ready to serve Harbor before starting Harbor components.
 #####################################
+
+#####################################
+# Step 3: Start Harbor
+# Scale up Harbor components only after dependencies are ready.
 #####################################

 # Start Harbor script - this section is required

510-529: Add explicit component shutdown order comments in stop.sh.

Mirror the improvement suggested for start.sh by adding explicit comments about shutdown order in stop.sh (typically reverse of startup order).

 set -euo pipefail
+
+#####################################
+# Step 1: Stop Harbor
+# Scale down Harbor components first.
+#####################################
+
 # Stop Harbor script - this section is required
 HARBOR_NAMESPACE="${HARBOR_NAMESPACE:-harbor-ns}"
 HARBOR_NAME="${HARBOR_NAME:-harbor}"
 kubectl -n "$HARBOR_NAMESPACE" patch harbor "$HARBOR_NAME" --type=merge -p '{"spec":{"helmValues":{"core":{"replicas":0},"portal":{"replicas":0},"jobservice":{"replicas":0},"registry":{"replicas":0},"trivy":{"replicas":0}}}}'
+
+#####################################
+# Step 2: Stop PostgreSQL
+# Configure the PostgreSQL cluster to operate as a replica
 # Check and control dependencies, such as setting the database to replica mode
 #####################################
 # Add your PostgreSQL stop script here.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5522794 and 29df184.

📒 Files selected for processing (1)
  • docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-15T09:04:27.174Z
Learnt from: SuJinpei
Repo: alauda/knowledge PR: 70
File: docs/en/solutions/How_to_Use_PostgreSQL_Hot_Standby_Cluster.md:249-271
Timestamp: 2025-10-15T09:04:27.174Z
Learning: In PostgreSQL hot standby cluster setups with ACP (Alauda Container Platform), the standby cluster's host and port information are stored in the database during initial setup. Therefore, when performing switchover operations between primary and standby clusters, it is not necessary to update the peerHost and peerPort parameters in the clusterReplication configuration - the replication will continue to work correctly without these updates.

Applied to files:

  • docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md
🔇 Additional comments (3)
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (3)

110-111: Verify the external documentation anchor reference.

The anchor reference was updated from an unspecified target to #address. Ensure this matches the correct section heading in the external Alauda Ceph documentation at the referenced URL.


282-313: Section reorganization looks good.

The renamed "Failover" section is clearer and the structured steps are well-organized. The YAML examples are correct and properly guide users through the failover process.


314-332: Disaster recovery workflow is well-structured.

The expanded section properly documents the recovery procedure for the original Primary Harbor after failover. The workflow is logical and references to external configuration guides are appropriate. The PostgreSQL switchover approach (updating isReplica and numberOfInstances without changing peerHost/peerPort) aligns with the documented PostgreSQL replication architecture where standby information is stored during initial setup.

- Streamlined the description of the `condition_check.sh` script for clarity.
- Added Kubernetes resource definitions for ServiceAccount, ClusterRole, ClusterRoleBinding, and ConfigMap to support disaster recovery operations.
- Updated the Deployment configuration to reflect new naming conventions and improved structure for better readability and maintainability.
- Included notes on necessary RBAC permissions for the ServiceAccount to operate effectively within the Harbor namespace.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (1)

541-544: Placeholder RBAC permissions require user action.

The ClusterRole definition (lines 541-544) leaves rules: [] empty with a comment to add permissions. While documentation is provided (lines 654-684), consider moving the concrete permissions example directly into the manifest template to reduce the likelihood of users deploying with incomplete RBAC. Currently users must copy rules from documentation after the manifest, creating an extra step.

For better user experience, update the manifest to include the documented permissions directly, or at minimum add a clear cross-reference.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29df184 and 40a9b23.

📒 Files selected for processing (1)
  • docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-15T09:04:27.174Z
Learnt from: SuJinpei
Repo: alauda/knowledge PR: 70
File: docs/en/solutions/How_to_Use_PostgreSQL_Hot_Standby_Cluster.md:249-271
Timestamp: 2025-10-15T09:04:27.174Z
Learning: In PostgreSQL hot standby cluster setups with ACP (Alauda Container Platform), the standby cluster's host and port information are stored in the database during initial setup. Therefore, when performing switchover operations between primary and standby clusters, it is not necessary to update the peerHost and peerPort parameters in the clusterReplication configuration - the replication will continue to work correctly without these updates.

Applied to files:

  • docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md
🔇 Additional comments (7)
docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md (7)

111-111: Documentation clarification is helpful.

The updated anchor reference now specifically directs readers to the correct section (#address) for obtaining the PRIMARY_OBJECT_STORAGE_ADDRESS. This improves clarity for users following the guide.


282-313: Failover section is well-structured.

The rename to "Failover" and expanded procedure with clear numbered steps (PostgreSQL promotion, storage activation, Harbor scaling, routing update) provides a comprehensive, actionable failover workflow. External references to detailed procedures are appropriate.


314-333: Disaster recovery procedure is comprehensive.

The expanded section now provides clear recovery steps for restoring the original Primary Harbor to secondary role, including specific PostgreSQL and object storage commands. The guidance aligns with known PostgreSQL switchover best practices—not requiring peerHost/peerPort updates—and notes the reversible nature of failover.


377-404: Automation architecture and configuration are well-explained.

The Mermaid flowchart clearly depicts the condition-check-to-action flow, and the configuration structure with documented parameters (including failure threshold and script timeout) provides necessary safeguards against transient failures.


410-527: Bash scripts are well-constructed with proper error handling.

The scripts include defensive practices: set -euo pipefail, proper timeout handling in curl commands (line 444), error handling for kubectl queries (lines 450-454), and the previous syntax error at the elif conditional (line 469) has been corrected with the required semicolon before then. The status.sh script appropriately returns three states (started/stopped/unknown) to handle edge cases, and example scripts include templated sections for user customization.


600-606: Verify container image is replaced for production use.

The container command structure is now correct (previous syntax issue has been fixed). However, the image reference points to a test registry (build-harbor.alauda.cn/test/...). Ensure that deployment documentation or comments clearly indicate that users must replace this with their production image before deploying.


686-764: PostgreSQL configuration examples align with switchover best practices.

The Primary and Secondary configuration examples, along with the start/stop script examples, correctly focus on changing isReplica and numberOfInstances without modifying peerHost/peerPort. This aligns with known best practices where replication information configured during initial setup persists correctly through failover without requiring manual updates.

Comment on lines +640 to +650
- configMap:
name: disaster-recovery-config
name: scripts
- configMap:
items:
- key: config.yaml
path: config.yaml
name: disaster-recovery-config
name: config
path: config.yaml
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix YAML indentation error in volumes section.

Line 649 contains a duplicate and misaligned path: config.yaml entry that will cause YAML parsing errors. This line should be removed—the volume definition is complete at line 648.

Apply this diff to fix the YAML structure:

       - configMap:
           items:
           - key: config.yaml
             path: config.yaml
           name: disaster-recovery-config
         name: config
-            path: config.yaml
🤖 Prompt for AI Agents
In docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md around lines
640 to 650, there is a YAML indentation/duplication error: a second misaligned
"path: config.yaml" (line 649) is duplicated and will break parsing; remove that
duplicate line so the volume items block only contains a single "path:
config.yaml" entry under the configMap items, preserving correct indentation for
the surrounding configMap/volume entries.

Comment on lines +816 to +830
- **Start Script Example**: For more details, refer to [Object Storage Disaster Recovery](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html)
```bash
REALM_NAME="${REALM_NAME:-real}"
ZONE_GROUP_NAME="${ZONE_GROUP_NAME:-group}"
ZONE_NAME="${ZONE_NAME:-zone}"
ACCESS_KEY=$(kubectl -n rook-ceph get secrets "${REALM_NAME}-keys" -o jsonpath='{.data.access-key}' 2>/dev/null | base64 -d)
SECRET_KEY=$(kubectl -n rook-ceph get secrets "${REALM_NAME}-keys" -o jsonpath='{.data.secret-key}' 2>/dev/null | base64 -d)
ENDPOINT=$(kubectl -n rook-ceph get cephobjectzone realm-zone -o jsonpath='{.spec.customEndpoints[0]}')
TOOLS_POD=$(kubectl -n rook-ceph get po -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
kubectl -n rook-ceph exec "$TOOLS_POD" -- radosgw-admin realm pull --url="$ENDPOINT" --access-key="$ACCESS_KEY" --secret="$SECRET_KEY";
kubectl -n rook-ceph exec "$TOOLS_POD" -- radosgw-admin zone modify --rgw-realm="$REALM_NAME" --rgw-zonegroup="$ZONE_GROUP_NAME" --rgw-zone="$ZONE_NAME" --master
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Rook-Ceph automation examples are well-structured.

The start script uses appropriate Kubernetes patterns to execute Ceph administrative commands within the tools pod, and the environment variables provide necessary customization points. However, line 819 shows a potentially problematic default value for REALM_NAME: "real" appears to be a typo for "realm".

Verify whether the default value on line 819 should be corrected:

- REALM_NAME="${REALM_NAME:-real}"
+ REALM_NAME="${REALM_NAME:-realm}"

Unless "real" is intentional and matches your actual realm name, this should be a more generic placeholder.

🤖 Prompt for AI Agents
In docs/en/solutions/How_to_perform_disaster_recovery_for_harbor.md around lines
816–830, the REALM_NAME default is likely a typo ("real"); update the default to
a correct placeholder (for example "realm" or a clearly generic value like
"my-realm") so the script uses an intentional default; modify the REALM_NAME
assignment accordingly and ensure the README text or examples elsewhere remain
consistent with the chosen placeholder.

- -c
- |
cp /disaster-recovery /opt/bin/disaster-recovery && chmod +x /opt/bin/disaster-recovery
image: build-harbor.alauda.cn/test/harbor-disaster-recovery:2.12.4-dev-7b8c78a-kychen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

记得修改

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants