Skip to content

add recovery guide for two dc deployment #9804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions command-line-flags-for-pd-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,3 +105,13 @@ PD is configurable using command-line flags and environment variables.

- The address of Prometheus Pushgateway, which does not push data to Prometheus by default.
- Default: `""`

## `--force-new-cluster`

- Forcibly creates a new cluster using current nodes.
- Default: `false`
- It is recommended to use this flag only for recovering services when PD loses most of its replicas, which might cause data loss.

## `-V`, `--version`

- Output the version of PD and then exit.
20 changes: 20 additions & 0 deletions tikv-control.md
Original file line number Diff line number Diff line change
Expand Up @@ -463,6 +463,26 @@ success!
> - The argument of the `-p` option specifies the PD endpoints without the `http` prefix. Specifying the PD endpoints is to query whether the specified `region_id` is validated or not.
> - You need to run this command for all stores where specified Regions' peers are located.

### Recover ACID-inconsistent data

To recover data that breaks ACID consistency, such as the loss of most replicas or incomplete data replication, you can use the `reset-to-version` command. When using this command, you need to provide an old version number that can promise the ACID consistency. Then, `tikv-ctl` cleans up all data after the specified version.

- The `-v` option is used to specify the version number to recover. To get the value of the `-v` option, you can use the `pd-ctl min-resolved-ts` command in PD Control.

```shell
tikv-ctl --host 127.0.0.1:20160 reset-to-version -v 430315739761082369
```

```
success!
```

> **Note:**
>
> - The preceding command only supports the online mode. Before running the command, you need to stop the processes that will write data to TiKV, such as the TiDB processes. After the command is run successfully, `success!` is returned in the output.
> - You need to run the same command for all TiKV nodes in the cluster.
> - Stop all PD scheduling tasks before running the command.

### Ldb Command

The `ldb` command line tool offers multiple data access and database administration commands. Some examples are listed below. For more information, refer to the help message displayed when running `tikv-ctl ldb` or check the documents from RocksDB.
Expand Down
33 changes: 26 additions & 7 deletions two-data-centers-in-one-city-deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,6 @@ The replication mode is controlled by PD. You can configure the replication mode
primary-replicas = 2
dr-replicas = 1
wait-store-timeout = "1m"
wait-sync-timeout = "1m"
```

- Method 2: If you have deployed a cluster, use pd-ctl commands to modify the configurations of PD.
Expand Down Expand Up @@ -274,14 +273,34 @@ The details for the status switch are as follows:

### Disaster recovery

This section introduces the disaster recovery solution of the two data centers in one city deployment.
This section introduces the disaster recovery solution of the two data centers (DCs) in one city deployment. The disaster discussed in this section refers to the situation where the primary DC fails as a whole, or multiple TiKV nodes in the primary/secondary DC fail, causing the loss of most replicas and service shutdown.

When a disaster occurs to a cluster in the synchronous replication mode, you can perform data recovery with `RPO = 0`:
> **Tip:**
>
> If you need support for disaster recovery, contact the TiDB team for a recovery solution.

- If the primary data center fails and most of the Voter replicas are lost, but complete data exists in the DR data center, the lost data can be recovered from the DR data center. At this time, manual intervention is required with professional tools. You can contact the TiDB team for a recovery solution.
#### Overall failure of the primary data center

- If the DR center fails and a few Voter replicas are lost, the cluster automatically switches to the asynchronous replication mode.
In this situation, all Regions in the primary DC have lost most of their replicas, so the cluster is down. At this time, to recover the service, the secondary DC is needed. The recovery capability is determined by the replication status before the failure:

When a disaster occurs to a cluster that is not in the synchronous replication mode and you cannot perform data recovery with `RPO = 0`:
- If the cluster before failure is in the synchronous replication mode (the status code is `sync` or `async_wait`), you can use the secondary DC to recover data with `RPO = 0`.

- If most of the Voter replicas are lost, manual intervention is required with professional tools. You can contact the TiDB team for a recovery solution.
- If the cluster before failure is in the asynchronous replication mode (the status code is `async`), after recovering the primary DC with the data of the secondary DC, the data written from the primary DC to the secondary DC before the failure in the asynchronous replication mode will be lost. A typical scenario is that the primary DC disconnects from the secondary DC and the primary DC switches to the asynchronous replication mode and provides service for a while before the overall failure.

- If the cluster before failure is switching from asynchronous to synchronous mode (the status code is `sync-recover`), after using the secondary DC to recover the service, some data asynchronously replicated from the primary DC to the secondary DC will be lost. This might break the ACID consistency, and you need to recover the ACID-inconsistent data accordingly. A typical scenario is that the primary DC disconnects from the secondary DC. After some data is replicated to the primary DC in the asynchronous mode, the connection is recovered. But during the asynchronous replication, errors occur again and cause the primary DC to fail as a whole.

The process of disaster recovery is as follows:

1. Stop all PD, TiKV, and TiDB services of the secondary DC.

2. Start PD nodes of the secondary DC with a replica using the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag.

3. Follow the instructions in [Online Unsafe Recovery](/online-unsafe-recovery.md) to process the TiKV data of the secondary DC. The parameters are the list of all Store IDs in the primary DC.

4. Write a new placement rule configuration file and use it in [PD Control](/pd-control.md). In the configuration file, the Voter replica count of the Region is the same as that of the original cluster in the secondary DC.

5. Start the PD and TiKV services of the primary DC.

6. To perform an ACID-consistent data recovery (the status of `DR_STATE` in the old PD is `sync-recover`), you can use [the `reset-to-version` command of TiKV COntrol](/tikv-control.md#recover-acid-inconsistent-data) to process TiKV data. The `version` parameter can be obtained by running `pd-ctl min-resolved-ts` in PD Control.

7. Start the TiDB service in the primary DC and check the data integrity and consistency.