Skip to content

Conversation

@nanjingfm
Copy link

@nanjingfm nanjingfm commented Nov 19, 2025

Summary by CodeRabbit

  • Documentation
    • Added comprehensive GitLab disaster recovery guide (hot-data/cold-compute) covering architecture, Ceph/object storage and PostgreSQL replication, primary/secondary setup, failover/switchover procedures, RPO/RTO, backups, drills and remediation.
    • Added comprehensive SonarQube disaster recovery guide covering architecture, PostgreSQL streaming replication, standby deployment/activation, manual failover runbook, validation, backups, and drills.
    • Added comprehensive Nexus disaster recovery guide (Ceph block storage) with replication, setup, failover workflow, backups, validation, RPO/RTO and drill guidance.
    • All guides available in English and Chinese.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 19, 2025

Walkthrough

Adds six new disaster-recovery solution documents (English and Chinese) for GitLab, SonarQube, and Nexus describing hot-data/cold-compute architectures, data replication mechanisms (Ceph RBD, PostgreSQL streaming, object replication), failover/runbook procedures, setup steps, backup guidance, and RPO/RTO targets.

Changes

Cohort / File(s) Summary
GitLab DR docs
docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md, docs/zh/solutions/How_to_perform_disaster_recovery_for_gitlab.md
New comprehensive guides for GitLab DR: architecture, components (Primary/Secondary GitLab, PostgreSQL, Gitaly, object storage), multi-layer synchronization (DB, Gitaly, attachments), Ceph RBD mirror config, object storage provisioning, secrets/PVC/PV backup, switchover/runbook, RPO/RTO, and drill/remediation steps.
SonarQube DR docs
docs/en/solutions/How_to_perform_disaster_recovery_for_sonarqube.md, docs/zh/solutions/How_to_perform_disaster_recovery_for_sonarqube.md
New SonarQube DR guides: hot-data/cold-compute pattern with PostgreSQL streaming replication, primary/secondary deployment steps, secret and Kubernetes resource examples, manual failover/runbook, promotion procedures, RPO/RTO, and drill/validation guidance.
Nexus DR docs
docs/en/solutions/How_to_perform_disaster_recovery_for_nexus.md, docs/zh/solutions/How_to_perform_disaster_recovery_for_nexus.md
New Nexus DR guides: Ceph block (RBD) based replication, Ceph RBD Mirror configuration, PVC/PV backup and restore, deployment notes for secondary Nexus, switchover workflow, RPO/RTO, drill procedures, and YAML/command examples for validation.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Primary_App as Primary App (GitLab/SonarQube/Nexus)
  participant Primary_DB as Primary DB (Postgres)
  participant Object_Storage as Object Storage (S3/Ceph)
  participant Block_Storage as Block Storage (Ceph RBD)
  participant Replica_DB as DR DB (standby)
  participant Secondary_App as Secondary (cold compute)

  rect rgb(230,248,230)
    Note over Primary_DB,Replica_DB: Continuous DB replication / WAL streaming
  end

  rect rgb(230,248,230)
    Note over Block_Storage: Block-level replication (RBD Mirror) / PV snapshots
  end

  User->>Primary_App: Normal read/write traffic
  Primary_App->>Primary_DB: write transactions
  Primary_App->>Object_Storage: upload objects/attachments
  Primary_App->>Block_Storage: PV writes (Gitaly / repo data)
  Primary_DB-->>Replica_DB: WAL stream / async replication
  Object_Storage-->>Object_Storage: bucket/object replication
  Block_Storage-->>Block_Storage: RBD mirror replication
  Primary_App->>Secondary_App: periodic backups / PV snapshots (cold)
Loading
sequenceDiagram
  autonumber
  participant Operator
  participant Replica_DB as Promote DB
  participant Secondary_App as Activate Secondary
  participant DNS
  participant User

  rect rgb(255,244,230)
    Note over Operator,Replica_DB: Failover runbook (promote DB, restore PVs)
  end

  Operator->>Replica_DB: promote standby to primary
  Operator->>Secondary_App: restore/attach PVs, apply secrets, start services
  Operator->>DNS: update external routing / load balancer
  DNS-->>User: traffic routed to Secondary_App
  Secondary_App->>Replica_DB: accept writes (system active)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Heterogeneous, long procedural docs in two languages require careful technical verification.
  • Review focus:
    • PostgreSQL promotion and WAL/streaming steps
    • Ceph RBD mirror commands and state checks
    • PVC/PV snapshot/restore and YAML snippets
    • Secrets handling and example commands
    • Consistency between English and Chinese versions

Suggested reviewers

  • tyzhou111
  • chengjingtao
  • Tongcaiyun

Poem

🐰 I hop through docs by lantern-light,
I mirror bytes to keep them tight.
When primaries fall and alarms chime,
I stitch the standby, one small hop at a time —
DR safe, I nibble logs till daylight.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the primary change: adding disaster recovery solution documentation for GitLab, which is the main focus despite also including DR docs for SonarQube and Nexus.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch doc/gitlab-dr

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md (2)

344-344: Use hyphenated compound adjective form for "backed-up" before nouns.

When "backed up" modifies a noun that follows, it should be hyphenated as "backed-up" per standard English grammar. This affects multiple locations: lines 344, 350, 395, 397, 462, and 477.

Apply these diffs to correct the compound adjectives:

- pg-secret.yaml: Change the `host` and `password` fields to the PostgreSQL connection address and password of the secondary cluster
+ pg-secret.yaml: Change the `host` and `password` fields to the secondary cluster's PostgreSQL connection address and password

Alternatively, apply the hyphenation directly at each occurrence:

- Modify the backed up files:
+ Modify the backed-up files:
- Create the backed up YAML files in the disaster recovery environment
+ Create the backed-up YAML files in the disaster recovery environment
- Modify the three backed up PV files
+ Modify the three backed-up PV files
- Restore the backed up PVC and PV resources
+ Restore the backed-up PVC and PV resources
- Restore the backed up `gitlabofficial.yaml`
+ Restore the backed-up `gitlabofficial.yaml`

Also applies to: 350-350, 395-395, 397-397, 462-462, 477-477


115-117: Reduce repetition in prerequisite sentences.

Lines 115–117 begin with identical phrasing ("Complete the deployment of..."). Rewording for variety improves readability.

- 2. Complete the deployment of `Alauda support for PostgreSQL` disaster recovery configuration.
- 3. Complete the deployment of `Alauda Build of Rook-Ceph` object storage disaster recovery configuration ([optional if conditions are met](#data-synchronization-strategy)).
- 4. Complete the deployment of `Alauda Build of Rook-Ceph` block storage disaster recovery configuration.
+ 2. Deploy `Alauda support for PostgreSQL` disaster recovery configuration.
+ 3. Set up `Alauda Build of Rook-Ceph` object storage disaster recovery configuration ([optional if conditions are met](#data-synchronization-strategy)).
+ 4. Configure `Alauda Build of Rook-Ceph` block storage disaster recovery.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1dcd960 and c172627.

⛔ Files ignored due to path filters (1)
  • docs/public/gitlab-disaster-recovery.drawio.svg is excluded by !**/*.svg
📒 Files selected for processing (2)
  • docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md (1 hunks)
  • docs/zh/solutions/How_to_perform_disaster_recovery_for_gitlab.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md

[style] ~65-~65: Consider an alternative for the often overused word ‘important’.
Context: ...ge. If you assess that this data is not important, you can choose not to perform disaster...

(NOT_IMPORTANT)


[style] ~117-~117: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...t](#data-synchronization-strategy)). 4. Complete the deployment of `Alauda Build of Rook...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[grammar] ~344-~344: Use a hyphen to join words.
Context: ...he following modifications to the backed up files: - pg-secret.yaml: Change the ...

(QB_NEW_EN_HYPHEN)


[grammar] ~350-~350: Use a hyphen to join words.
Context: ...the secondary cluster Create the backed up YAML files in the disaster recovery e...

(QB_NEW_EN_HYPHEN)


[grammar] ~395-~395: Use a hyphen to join words.
Context: ...cho "" done ``` Modify the three backed up PV files and delete all `spec.claimRe...

(QB_NEW_EN_HYPHEN)


[grammar] ~397-~397: Use a hyphen to join words.
Context: ...` fields in the yaml. Create the backed up PVC and PV YAML files directly in the...

(QB_NEW_EN_HYPHEN)


[style] ~412-~412: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rills. After the drill is complete, you need to perform the following cleanup operation...

(REP_NEED_TO_VB)


[grammar] ~462-~462: Use a hyphen to join words.
Context: ...C and PV Resources**: Restore the backed up PVC and PV resources to the disaster ...

(QB_NEW_EN_HYPHEN)


[grammar] ~477-~477: Use a hyphen to join words.
Context: ...y Secondary GitLab**: Restore the backed up gitlabofficial.yaml to the disaster...

(QB_NEW_EN_HYPHEN)

docs/zh/solutions/How_to_perform_disaster_recovery_for_gitlab.md

[uncategorized] ~108-~108: 动词的修饰一般为‘形容词(副词)+地+动词’。您的意思是否是:一致"地"命名
Context: ... Ceph 存储池名称和存储类名称 - 一致的 GitLab 实例名称 - 一致的命名空间名称 ::: ### 前置条件 1. 提前准备一个主集群和一个灾难...

(wb4)

🪛 markdownlint-cli2 (0.18.1)
docs/zh/solutions/How_to_perform_disaster_recovery_for_gitlab.md

412-412: Link fragments should be valid

(MD051, link-fragments)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md (1)

344-344: Use hyphenated compound adjectives for "backed-up".

When "backed up" functions as a compound adjective modifying a noun, it should be hyphenated as "backed-up" for proper English grammar. Review and update the following phrases:

  • Line 344: "backed up files" → "backed-up files"
  • Line 350: "backed up YAML files" → "backed-up YAML files"
  • Line 395: "backed up PV files" → "backed-up PV files"
  • Line 397: "backed up PVC and PV YAML files" → "backed-up PVC and PV YAML files"
  • Line 462: "backed up PVC and PV resources" → "backed-up PVC and PV resources"
  • Line 477: "backed up gitlabofficial.yaml" → "backed-up gitlabofficial.yaml"

Also applies to: 350-350, 395-395, 397-397, 462-462, 477-477

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c172627 and b832054.

📒 Files selected for processing (4)
  • docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md (1 hunks)
  • docs/en/solutions/How_to_perform_disaster_recovery_for_sonarqube.md (1 hunks)
  • docs/zh/solutions/How_to_perform_disaster_recovery_for_gitlab.md (1 hunks)
  • docs/zh/solutions/How_to_perform_disaster_recovery_for_sonarqube.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/zh/solutions/How_to_perform_disaster_recovery_for_sonarqube.md

[uncategorized] ~72-~72: 动词的修饰一般为‘形容词(副词)+地+动词’。您的意思是否是:一致"地"命名
Context: ... 一致的数据库实例名称和密码 - 一致的 SonarQube 实例名称 - 一致的命名空间名称 ::: ### 前置条件 1. 提前准备一个主集群和一个灾难...

(wb4)

docs/zh/solutions/How_to_perform_disaster_recovery_for_gitlab.md

[uncategorized] ~108-~108: 动词的修饰一般为‘形容词(副词)+地+动词’。您的意思是否是:一致"地"命名
Context: ... Ceph 存储池名称和存储类名称 - 一致的 GitLab 实例名称 - 一致的命名空间名称 ::: ### 前置条件 1. 提前准备一个主集群和一个灾难...

(wb4)

docs/en/solutions/How_to_perform_disaster_recovery_for_sonarqube.md

[style] ~175-~175: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rills. After the drill is complete, you need to perform the following cleanup operation...

(REP_NEED_TO_VB)

docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md

[style] ~65-~65: Consider an alternative for the often overused word ‘important’.
Context: ...ge. If you assess that this data is not important, you can choose not to perform disaster...

(NOT_IMPORTANT)


[style] ~117-~117: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...t](#data-synchronization-strategy)). 4. Complete the deployment of `Alauda Build of Rook...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[grammar] ~344-~344: Use a hyphen to join words.
Context: ...he following modifications to the backed up files: - pg-secret.yaml: Change the ...

(QB_NEW_EN_HYPHEN)


[grammar] ~350-~350: Use a hyphen to join words.
Context: ...the secondary cluster Create the backed up YAML files in the disaster recovery e...

(QB_NEW_EN_HYPHEN)


[grammar] ~395-~395: Use a hyphen to join words.
Context: ...cho "" done ``` Modify the three backed up PV files and delete all `spec.claimRe...

(QB_NEW_EN_HYPHEN)


[grammar] ~397-~397: Use a hyphen to join words.
Context: ...` fields in the yaml. Create the backed up PVC and PV YAML files directly in the...

(QB_NEW_EN_HYPHEN)


[style] ~412-~412: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rills. After the drill is complete, you need to perform the following cleanup operation...

(REP_NEED_TO_VB)


[grammar] ~462-~462: Use a hyphen to join words.
Context: ...C and PV Resources**: Restore the backed up PVC and PV resources to the disaster ...

(QB_NEW_EN_HYPHEN)


[grammar] ~477-~477: Use a hyphen to join words.
Context: ...y Secondary GitLab**: Restore the backed up gitlabofficial.yaml to the disaster...

(QB_NEW_EN_HYPHEN)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
docs/en/solutions/How_to_perform_disaster_recovery_for_nexus.md (1)

236-236: Apply consistent hyphenation for compound adjectives with "backed up".

When "backed up" modifies a noun (compound adjective), it should be hyphenated: "backed-up PV file", "backed-up PVC and PV YAML files", etc. This improves grammatical clarity.

Apply this diff to fix the hyphenation:

- Modify the backed up PV file and delete all `spec.claimRef` fields in the yaml.
+ Modify the backed-up PV file and delete all `spec.claimRef` fields in the yaml.
- Create the backed up PVC and PV YAML files directly in the disaster recovery environment with the same namespace name.
+ Create the backed-up PVC and PV YAML files directly in the disaster recovery environment with the same namespace name.
- Restore the backed up PVC and PV resources to the disaster recovery environment with the same namespace name, and check that the PVC status in the secondary cluster is `Bound`:
+ Restore the backed-up PVC and PV resources to the disaster recovery environment with the same namespace name, and check that the PVC status in the secondary cluster is `Bound`:
- 4. **Deploy Secondary Nexus**: Restore the backed up `nexus.yaml` to the disaster recovery environment with the same namespace name. Nexus will automatically start using the disaster recovery data.
+ 4. **Deploy Secondary Nexus**: Restore the backed-up `nexus.yaml` to the disaster recovery environment with the same namespace name. Nexus will automatically start using the disaster recovery data.

Also applies to: 238-238, 289-289, 300-300

docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md (1)

395-395: Apply consistent hyphenation for compound adjectives with "backed up".

Similar to the Nexus documentation, when "backed up" modifies a noun, it should be hyphenated for grammatical consistency. Line 395 should use "backed-up" in compound adjective form.

Apply this diff:

- Modify the three backed up PV files and delete all `spec.claimRef` fields in the yaml.
+ Modify the three backed-up PV files and delete all `spec.claimRef` fields in the yaml.

Also applies to: 397-397

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b832054 and d3e41ab.

📒 Files selected for processing (4)
  • docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md (1 hunks)
  • docs/en/solutions/How_to_perform_disaster_recovery_for_nexus.md (1 hunks)
  • docs/zh/solutions/How_to_perform_disaster_recovery_for_gitlab.md (1 hunks)
  • docs/zh/solutions/How_to_perform_disaster_recovery_for_nexus.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/zh/solutions/How_to_perform_disaster_recovery_for_nexus.md

[uncategorized] ~73-~73: 动词的修饰一般为‘形容词(副词)+地+动词’。您的意思是否是:一致"地"命名
Context: ...的 Ceph 存储池名称和存储类名称 - 一致的 Nexus 实例名称 - 一致的命名空间名称 ::: ### 前置条件 1. 提前准备一个主集群和一个灾难...

(wb4)

docs/zh/solutions/How_to_perform_disaster_recovery_for_gitlab.md

[uncategorized] ~108-~108: 动词的修饰一般为‘形容词(副词)+地+动词’。您的意思是否是:一致"地"命名
Context: ... Ceph 存储池名称和存储类名称 - 一致的 GitLab 实例名称 - 一致的命名空间名称 ::: ### 前置条件 1. 提前准备一个主集群和一个灾难...

(wb4)

docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md

[style] ~65-~65: Consider an alternative for the often overused word ‘important’.
Context: ...ge. If you assess that this data is not important, you can choose not to perform disaster...

(NOT_IMPORTANT)


[style] ~117-~117: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...t](#data-synchronization-strategy)). 4. Complete the deployment of `Alauda Build of Rook...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~412-~412: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rills. After the drill is complete, you need to perform the following cleanup operation...

(REP_NEED_TO_VB)

docs/en/solutions/How_to_perform_disaster_recovery_for_nexus.md

[grammar] ~236-~236: Use a hyphen to join words.
Context: ...pv-${PV}.yaml" fi ``` Modify the backed up PV file and delete all `spec.claimRef...

(QB_NEW_EN_HYPHEN)


[grammar] ~238-~238: Use a hyphen to join words.
Context: ...` fields in the yaml. Create the backed up PVC and PV YAML files directly in the...

(QB_NEW_EN_HYPHEN)


[style] ~253-~253: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...rills. After the drill is complete, you need to perform the following cleanup operation...

(REP_NEED_TO_VB)


[grammar] ~289-~289: Use a hyphen to join words.
Context: ...C and PV Resources**: Restore the backed up PVC and PV resources to the disaster ...

(QB_NEW_EN_HYPHEN)


[grammar] ~300-~300: Use a hyphen to join words.
Context: ...oy Secondary Nexus**: Restore the backed up nexus.yaml to the disaster recovery...

(QB_NEW_EN_HYPHEN)

🔇 Additional comments (18)
docs/en/solutions/How_to_perform_disaster_recovery_for_nexus.md (1)

115-135: Ensure consistency in storage class naming conventions.

Line 110 references ceph-rdb as the storage class name with a comment suggesting it should be "the configured storage class name." This is appropriately flexible, and the documentation later provides commands to verify the setup. However, ensure this naming convention is consistently documented across the Nexus, GitLab, and other DR solutions. The naming appears consistent with the GitLab DR documentation, so this is acceptable.

docs/zh/solutions/How_to_perform_disaster_recovery_for_gitlab.md (2)

412-412: Verify markdown link fragment matches target heading.

Line 412 references the internal link [灾难切换](#灾难切换) which correctly targets the section heading ## 灾难切换 at line 452. This appears to have been properly fixed from the previous review. The link should work correctly in rendered documentation.

Also applies to: 452-452


41-41: Diagram file reference is valid.

The file public/gitlab-disaster-recovery.drawio.svg exists in the repository. The relative path ../../public/gitlab-disaster-recovery.drawio.svg from the Chinese documentation location correctly resolves to the diagram file.

docs/en/solutions/How_to_perform_disaster_recovery_for_gitlab.md (2)

412-412: Verify markdown link fragment matches target heading.

Line 412 references the internal link [Primary-Secondary Switchover Procedure in Disaster Scenarios](#primary-secondary-switchover-procedure-in-disaster-scenarios) which correctly targets the section heading ## Primary-Secondary Switchover Procedure in Disaster Scenarios at line 452. The link fragment will correctly resolve when rendered.

Also applies to: 452-452


41-41: Diagram file reference is valid and the file exists.

The diagram file gitlab-disaster-recovery.drawio.svg exists at docs/public/gitlab-disaster-recovery.drawio.svg, which correctly resolves from the relative path ../../public/gitlab-disaster-recovery.drawio.svg referenced in line 41. No action required.

docs/zh/solutions/How_to_perform_disaster_recovery_for_nexus.md (13)

1-16: Well-structured metadata and clear problem statement.

The front matter is properly formatted, and the problem section clearly communicates the architecture approach and scope. The note that users must implement their own address-switching mechanism appropriately sets expectations.


22-33: Comprehensive and well-organized terminology section.

All key concepts (RPO, RTO, failover, etc.) are pre-defined before use, making the document accessible to operators less familiar with DR terminology.


35-64: Clear architecture explanation with well-defined components.

The hot-data/cold-compute pattern and five-step failover procedure are clearly explained. The approach of pre-deploying Nexus with zero replicas in the standby cluster is sound.


65-85: Sound prerequisites with practical consistency recommendations.

The guidance to maintain consistent naming and pool names across environments significantly simplifies failover procedures. The prerequisite list is realistic and appropriately references external Ceph DR documentation.


261-282: Realistic recovery objectives with transparent component breakdown.

The RPO correctly identifies dependency on Ceph RBD sync intervals and the RTO appropriately acknowledges manual procedures with realistic time estimates. The DNS propagation caveat for external routing is a thoughtful detail that operators should consider.


283-305: Clear and sequential disaster switchover procedure with verification steps.

The six-step failover sequence is logical and includes appropriate status checks. The example output showing PVC in "Bound" state provides operators with a concrete success criterion. Deferring external address switch to the final step is the correct approach.


306-312: Appropriate scope disclaimer with production readiness caution.

The warning to verify DR capability of alternative block storage and conduct testing before production use is sound operational guidance.


68-75: Grammar suggestion from static analysis appears to be a false positive.

The static analysis tool suggested using "一致地命名" (naming consistently) instead of "一致的 ... 名称" (consistent ... names), but the current phrasing is grammatically correct. The context is a list of noun phrases describing attributes (consistent naming, not the act of naming), so using "一致的" as an adjective is appropriate.


96-111: YAML structure appears valid, but verify storage class naming.

The Nexus YAML example is syntactically correct. However, line 110 shows storageClass: name: ceph-rdb — note whether this should be ceph-rbd (RBD = RADOS Block Device). This may be environment-specific, but the common Ceph RBD storage class naming convention uses "rbd" not "rdb". Please verify this matches your actual storage class names.


115-170: Command examples use standard kubectl and Ceph syntax with realistic output samples.

The bash and kubectl commands follow standard conventions. The output examples (e.g., VolumeReplication status at lines 141-143 and rbd mirror image status at lines 157-170) are realistic and would help operators verify correct setup.

Confirm that the Ceph RBD Mirror command syntax (rbd mirror snapshot schedule ls, rbd mirror image status) and the expected output formats match the actual tools available in the target environment (Rook-Ceph version 4.x).


1-312: Comprehensive and well-structured disaster recovery documentation with strong operational guidance.

This documentation provides a complete runbook for Nexus disaster recovery using Ceph RBD Mirror replication. Strengths include:

  • Clear architecture explanation with terminology glossary
  • Defensive warnings preventing common mistakes (premature Nexus creation, correct PV backup procedures)
  • Realistic RPO/RTO targets with transparent component breakdown
  • Sequential, actionable failover procedures with concrete status checks
  • Helpful distinction between manual procedures and their typical time costs

The document aligns well with the PR objective to provide disaster recovery guidance for multiple services (mentioned alongside GitLab and SonarQube).

Please verify the following before merging:

  1. External documentation links (docs.alauda.io/container_platform/4.1/...) are accessible and current
  2. Storage class naming in line 110 (ceph-rdb) is correct for your environment (vs. typical ceph-rbd)
  3. Ceph RBD Mirror command syntax and expected output formats match your actual Rook-Ceph/Ceph version
  4. The PV backup procedure (removing spec.claimRef at line 236) aligns with Kubernetes PV lifecycle best practices

83-83: All external documentation links are currently accessible and functional.

Verification testing confirms all four referenced links to docs.alauda.io return HTTP 200 status codes:

  • #create-volumereplicationclass - ✓ accessible
  • Base dr_block.html page - ✓ accessible
  • #enable-mirror-for-pvc - ✓ accessible
  • #procedures-1 - ✓ accessible

The documentation URLs and anchor fragments are valid and pointing to existing content. No broken links or inaccessible endpoints were found.


86-171: Documentation commands and output formats are verified as accurate.

Both Ceph RBD Mirror commands reference correct syntax and output formats:

  1. rbd mirror snapshot schedule ls --pool $CEPH_BLOCK_POOL --recursive — Command syntax matches official Ceph documentation; output columns (POOL, NAMESPACE, IMAGE, SCHEDULE) and format examples ("every 1m") are accurate.

  2. rbd mirror image status $CEPH_BLOCK_POOL/$NEXUS_BLOCK_IMAGE_NAME — Command syntax and output structure (global_id, state, description, service, last_update, peer_sites) align with official Ceph documentation. State values ("up+stopped", "up+replaying") are documented and correct. The nested JSON in the description field reflects real Ceph behavior.

The practical setup instructions, YAML examples, and status-check outputs are technically sound and consistent with the Alauda Container Platform documentation pattern.

Comment on lines +172 to +260
### 设置备用 Nexus

:::warning
当 Ceph RBD 处于备用状态时,同步过来的存储块无法挂载,因此备集群的 Nexus 无法部署成功。

如需验证备集群 Nexus 是否可以部署成功,可以临时将备集群的 Ceph RBD 提升为主集群,测试完成后再设置回备用状态。同时需要将测试过程中创建的 Nexus、PV 和 PVC 资源都删除。
:::

1. 备份主 Nexus 使用的 Secret
2. 备份主集群 Nexus 组件的 PVC 和 PV 资源 YAML
3. 备份主集群 Nexus 的 Nexus 资源 YAML

#### 备份主 Nexus 使用的 Secret

获取主 Nexus 使用的 Password Secret YAML,并将 Secret 创建到备集群同名命名空间中。

```bash
apiVersion: v1
data:
password: xxxxxx
kind: Secret
metadata:
name: nexus-root-password
namespace: nexus-dr
type: Opaque
```

#### 备份主 Nexus 组件的 PVC 和 PV 资源

:::tip
PV 资源中保存了 volume 属性信息,这些信息是容灾恢复时的关键信息,需要备份好。

```bash
volumeAttributes:
clusterID: rook-ceph
imageFeatures: layering
imageFormat: "2"
imageName: csi-vol-459e6f28-a158-4ae9-b5da-163448c35119
journalPool: myblock
pool: myblock
storage.kubernetes.io/csiProvisionerIdentity: 1763446982673-7963-rook-ceph.rbd.csi.ceph.com
```

:::

执行以下命令将主 Nexus 组件的 PVC 和 PV 资源备份到当前目录:

```bash
export NEXUS_PVC_NAME=<PVC_NAME>

echo "=> Exporting PVC $NEXUS_PVC_NAME"

# 导出 PVC
kubectl -n "$NEXUS_NAMESPACE" get pvc "$NEXUS_PVC_NAME" -o yaml > "pvc-${NEXUS_PVC_NAME}.yaml"

# 获取 PV
PV=$(kubectl -n "$NEXUS_NAMESPACE" get pvc "$NEXUS_PVC_NAME" -o jsonpath='{.spec.volumeName}')

if [[ -n "$PV" ]]; then
echo " ↳ Exporting PV $PV"
kubectl get pv "$PV" -o yaml > "pv-${PV}.yaml"
fi
```

修改备份出来的 PV 文件,将 yaml 中的 `spec.claimRef` 字段全部删除。

将备份出来的 PVC 和 PV YAML 文件直接创建到容灾环境同名命名空间中。

#### 备份主 Nexus 实例 YAML

```bash
kubectl -n "$NEXUS_NAMESPACE" get nexus "$NEXUS_NAME" -oyaml > nexus.yaml
```

根据容灾环境实际情况修改 `nexus.yaml` 中的信息。

:::warning
`Nexus` 资源**不需要**立即创建在容灾环境,只需要在灾难发生时,执行容灾切换时创建到备集群即可。
:::

:::warning
如需进行容灾演练,可以按照 [灾难切换](#灾难切换) 中的步骤进行演练。演练完毕后需要在容灾环境完成以下清理操作:

- 将容灾环境中的 `Nexus` 实例删除
- 将创建的 PVC 和 PV 删除
- 将 Ceph RBD 切换为备用状态

:::

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Kubernetes PersistentVolume claimRef removal reuse across clusters procedure


<web_search_result>
Short procedure to remove a PVC claimRef and reuse the underlying volume (including across clusters)

Important notes up front

  • Always snapshot/backup the underlying volume before you touch PV/PVC metadata. Cloud disks and CSI details vary and you can lose data. [1][2]
  • Dynamic volumes with reclaimPolicy: Delete will remove the cloud disk when the PVC is deleted — change to Retain first. [1][4]

Steps (cluster A → cluster B)

  1. Snapshot/backup the cloud volume (provider console or CSI snapshot). [1]
  2. In cluster A: change the PV reclaimPolicy to Retain so Kubernetes will not delete the backing disk.
    kubectl patch pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}' [3][4]
  3. Ensure the pod using the PVC is stopped and the PVC is deleted (or will be deleted). Wait until the PV status becomes Released. [1][5]
  4. Remove the claimRef from the PV metadata (or at minimum remove the uid field). Two common approaches:
    • Edit the PV and remove the entire spec.claimRef block: kubectl edit pv — remove claimRef section. [3][4]
    • Safer: edit and only remove the uid/resourceVersion lines inside claimRef so a new PVC in the same namespace can bind predictably. Some docs recommend removing only uid. [3][4]
  5. At this point the PV should become Available (or you can recreate a PV manifest that points to the same underlying disk but without claimRef). If you plan to move across clusters, you usually must detach the cloud disk and create a new PV in cluster B that references the same disk (provider-specific: GCE PD, AWS EBS, Azure Disk, or CSI volumeHandle). Example: create a PV manifest in cluster B with the correct volume identifier or CSI volumeHandle and desired access mode/storageClass. [1][6]
  6. In cluster B: apply the PV (or let dynamic provisioning re-create if using a snapshot + new PVC). Then create a matching PVC in cluster B (same size/accessModes/storageClass) so it binds to the PV. Verify Bound. [1][5]

Additional cautions and tips

  • Many providers require detaching the disk from cluster A nodes before attaching to cluster B; follow provider instructions (GCP/AWS/Azure). [1]
  • If the PV was managed by a CSI driver, prefer using CSI snapshot/restore or provider tooling to move volumes rather than manual claimRef edits. CSI volumeHandle values are required when creating a PV that points to an existing volume. [1][6]
  • If you find a finalizer blocking PVC deletion (pvc-protection), remove it only after you are certain the PV reclaim policy is Retain and you have a backup. [4]

References
[1] Kubernetes/GKE docs on PersistentVolumes and reclaiming volumes.
[2] Provider guidance: snapshot/backup before operations (GKE/AWS/Azure docs).
[3] StackOverflow / community steps to patch PV to Retain and remove claimRef.
[4] Kubernetes/OKD docs on manual reclaim and reclaimPolicy semantics.
[5] Alibaba Cloud ACK docs & migration notes (example of removing claimRef and setting Retain).
[6] CSI/provider migration recommendations for creating PVs that point to existing volumes.

If you want, tell me which cloud provider and I’ll give exact commands/manifests for that provider.
</web_search_result>


The documented PV backup and recovery procedure is incomplete; critical operational steps are missing that could result in data loss.

While removing spec.claimRef is the correct approach for reusing PVs across clusters, the documentation omits several essential steps:

  1. Missing reclaimPolicy change — Before removing the PVC, the PV's reclaimPolicy should be set to Retain to prevent the backing volume from being deleted
  2. No status verification — The procedure should wait for the PV status to transition to Released before reusing it
  3. CSI volumeHandle considerations — For Ceph RBD CSI volumes, the volumeHandle in the PV spec needs to be preserved or correctly referenced in the new cluster; simply removing claimRef is insufficient without this context

The section should explicitly document:

  • Patch the PV to set persistentVolumeReclaimPolicy: Retain before PVC deletion
  • Verify the PV reaches Released status after PVC deletion
  • Include guidance on CSI volumeHandle requirements when creating a new PV in the standby cluster that points to the same RBD image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants