-
Notifications
You must be signed in to change notification settings - Fork 702
add recovery guide for two dc deployment #9804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -105,3 +105,13 @@ PD is configurable using command-line flags and environment variables. | |||||
|
||||||
- The address of Prometheus Pushgateway, which does not push data to Prometheus by default. | ||||||
- Default: `""` | ||||||
|
||||||
## `--force-new-cluster` | ||||||
|
||||||
- Force to create a new cluster using current nodes. | ||||||
- Default: `false` | ||||||
- It is recommended to use this flag only when recovering services due to PD losing most replicas, which might cause data loss. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
## `-V`, `--version` | ||||||
|
||||||
- Output the version of PD and then exit. |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -463,6 +463,26 @@ success! | |||||
> - The argument of the `-p` option specifies the PD endpoints without the `http` prefix. Specifying the PD endpoints is to query whether the specified `region_id` is validated or not. | ||||||
> - You need to run this command for all stores where specified Regions' peers are located. | ||||||
|
||||||
### Recover ACID-inconsistent data | ||||||
|
||||||
To recover data that breaks ACID consistency, such as the loss of most replicas or incomplete data replication, you can use the `reset-to-version` command. When using this command, you need to provide an old version number that can promise the ACID consistency. Then, `tikv-ctl` cleans up all data after the specified version. | ||||||
|
||||||
- The `-v` option is used to specify the version number to recover. To get the value of the `-v` option, you can use the `pd-ctl min-resolved-ts` command in PD Control. | ||||||
|
||||||
```shell | ||||||
tikv-ctl --host 127.0.0.1:20160 reset-to-version -v 430315739761082369 | ||||||
``` | ||||||
|
||||||
``` | ||||||
success! | ||||||
``` | ||||||
|
||||||
> **Note:** | ||||||
> | ||||||
> - The preceding command only supports the online mode. Before running the command, you need to stop the processes that will write data to TiKV, such as TiDB. After the command is run successfully, `success!` is returned in the output. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
> - You need to run the same command for all TiKV nodes in the cluster. | ||||||
> - Stop all PD scheduling tasks before running the command. | ||||||
|
||||||
### Ldb Command | ||||||
|
||||||
The `ldb` command line tool offers multiple data access and database administration commands. Some examples are listed below. For more information, refer to the help message displayed when running `tikv-ctl ldb` or check the documents from RocksDB. | ||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -212,7 +212,6 @@ The replication mode is controlled by PD. You can configure the replication mode | |||||
primary-replicas = 2 | ||||||
dr-replicas = 1 | ||||||
wait-store-timeout = "1m" | ||||||
wait-sync-timeout = "1m" | ||||||
``` | ||||||
|
||||||
- Method 2: If you have deployed a cluster, use pd-ctl commands to modify the configurations of PD. | ||||||
|
@@ -274,14 +273,34 @@ The details for the status switch are as follows: | |||||
|
||||||
### Disaster recovery | ||||||
|
||||||
This section introduces the disaster recovery solution of the two data centers in one city deployment. | ||||||
This section introduces the disaster recovery solution of the two data centers (DCs) in one city deployment. The disaster discussed in this section refers to the situation where the primary DC fails as a whole, or multiple TiKV nodes in the primary/secondary DC fail, causing the loss of most replicas and service shutdown. | ||||||
|
||||||
When a disaster occurs to a cluster in the synchronous replication mode, you can perform data recovery with `RPO = 0`: | ||||||
> **Tip:** | ||||||
> | ||||||
> If you need support for disaster recovery, you can contact the TiDB team for a recovery solution. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
- If the primary data center fails and most of the Voter replicas are lost, but complete data exists in the DR data center, the lost data can be recovered from the DR data center. At this time, manual intervention is required with professional tools. You can contact the TiDB team for a recovery solution. | ||||||
#### Overall failure of the primary data center | ||||||
|
||||||
- If the DR center fails and a few Voter replicas are lost, the cluster automatically switches to the asynchronous replication mode. | ||||||
In this situation, all Regions in the primary DC have lost most of their replicas, so the cluster is down. At this time, to recover the service, the secondary DC is needed. The recovery capability is determined by the replication status before the failure: | ||||||
|
||||||
When a disaster occurs to a cluster that is not in the synchronous replication mode and you cannot perform data recovery with `RPO = 0`: | ||||||
- If the cluster before failure is in the synchronous replication mode (the status code is `sync` or `async_wait`), you can use the secondary DC to recover with `RPO = 0`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
- If most of the Voter replicas are lost, manual intervention is required with professional tools. You can contact the TiDB team for a recovery solution. | ||||||
- If the cluster before failure is in the asynchronous replication mode (the status code is `async`), after recovering the primary DC with the data of the secondary DC, the data written from the primary DC to the secondary DC before the failure in the asynchronous replication mode will be lost. A typical scenario is that the primary DC disconnects from the secondary DC and the primary DC switches to the asynchronous replication mode and provides service for a while before the overall failure. | ||||||
|
||||||
- If the cluster before failure is in synchronous recovery mode (the status code is `sync-recover`). After using the secondary DC to recover the service, some data written by the primary DC in the asynchronous replication mode might be lost. This might break the ACID consistency and you need to recover the ACID-inconsistent data additionally. A typical scenario is that the primary DC disconnects from the secondary DC and the connection is recovered after some data is written to the primary DC in the asynchronous replication mode. But during the asynchronous replication between primary and secondary, something goes wrong and causes the primary DC to fail as a whole. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
The process of disaster recovery is as follows: | ||||||
|
||||||
1. Stop all PD, TiKV, and TiDB services of the secondary DC. | ||||||
|
||||||
2. Start PD nodes of the secondary DC in the single replica mode with the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is single replica mode? It is not mentioned anywhere. |
||||||
|
||||||
3. Follow the instructions in [Online Unsafe Recovery](/online-unsafe-recovery.md) to process the TiKV data of the secondary DC. The parameters are the list of all Store IDs in the primary DC. | ||||||
|
||||||
4. Write a new placement rule configuration file and use it in [PD Control](/pd-control.md). In the configuration file, the Voter replica count of the Region is the same as that of the original cluster in the secondary DC. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
5. Start the PD and TiKV services of the primary DC. | ||||||
|
||||||
6. To perform an ACID-consistent data recovery (the status of `DR_STATE` in the old PD is `sync-recover`), you can use [the `reset-to-version` command of TiKV COntrol](/tikv-control.md#recover-acid-inconsistent-data) to process TiKV data. The `version` parameter can be obtained by running `pd-ctl min-resolved-ts` in PD Control. | ||||||
|
||||||
7. Start the TiDB service in the primary DC and check the data integrity and consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.