From 190a308dfb4b7f542c230620afb6cd9cfba4dae4 Mon Sep 17 00:00:00 2001 From: Aolin Date: Thu, 4 Aug 2022 16:45:58 +0800 Subject: [PATCH 1/4] add recovery guide --- command-line-flags-for-pd-configuration.md | 10 +++++++ tikv-control.md | 20 ++++++++++++++ two-data-centers-in-one-city-deployment.md | 31 +++++++++++++++++----- 3 files changed, 54 insertions(+), 7 deletions(-) diff --git a/command-line-flags-for-pd-configuration.md b/command-line-flags-for-pd-configuration.md index c12a03b8a3bf2..44f2f28a02bd6 100644 --- a/command-line-flags-for-pd-configuration.md +++ b/command-line-flags-for-pd-configuration.md @@ -105,3 +105,13 @@ PD is configurable using command-line flags and environment variables. - The address of Prometheus Pushgateway, which does not push data to Prometheus by default. - Default: `""` + +## `--force-new-cluster` + +- Force to create a new cluster using current nodes. +- Default: `false` +- It is recommended to use this flag only when recovering services due to PD losing most replicas, which might cause data loss. + +## `-V`, `--version` + +- Output the version of PD and then exit. \ No newline at end of file diff --git a/tikv-control.md b/tikv-control.md index ccce9656a2bd9..c8498d283765a 100644 --- a/tikv-control.md +++ b/tikv-control.md @@ -463,6 +463,26 @@ success! > - The argument of the `-p` option specifies the PD endpoints without the `http` prefix. Specifying the PD endpoints is to query whether the specified `region_id` is validated or not. > - You need to run this command for all stores where specified Regions' peers are located. +### Recover from ACID inconsistency data + +To recover data from ACID inconsistency, such as the loss of most replicas or incomplete data synchronization, you can use the `reset-to-version` command. When using this command, you need to provide an old version number that can promise the ACID consistency. Then, `tikv-ctl` cleans up all data after the specified version. + +- The `-v` option is used to specify the version number to restore. To get the value of the `-v` parameter, you can use the `pd-ctl min-resolved-ts` command. + +```shell +tikv-ctl --host 127.0.0.1:20160 reset-to-version -v 430315739761082369 +``` + +``` +success! +``` + +> **Note:** +> +> - The preceding command only supports the online mode. Before executing the command, you need to stop processes that will write data to TiKV, such as TiDB. After the command is executed successfully, it will return `success!`. +> - You need to execute the same command for all TiKV nodes in the cluster. +> - All PD scheduling tasks should be stopped before executing the command. + ### Ldb Command The `ldb` command line tool offers multiple data access and database administration commands. Some examples are listed below. For more information, refer to the help message displayed when running `tikv-ctl ldb` or check the documents from RocksDB. diff --git a/two-data-centers-in-one-city-deployment.md b/two-data-centers-in-one-city-deployment.md index 7ab247c531800..a47fbf16c50ea 100644 --- a/two-data-centers-in-one-city-deployment.md +++ b/two-data-centers-in-one-city-deployment.md @@ -212,7 +212,6 @@ The replication mode is controlled by PD. You can configure the replication mode primary-replicas = 2 dr-replicas = 1 wait-store-timeout = "1m" - wait-sync-timeout = "1m" ``` - Method 2: If you have deployed a cluster, use pd-ctl commands to modify the configurations of PD. @@ -274,14 +273,32 @@ The details for the status switch are as follows: ### Disaster recovery -This section introduces the disaster recovery solution of the two data centers in one city deployment. +This section introduces the disaster recovery solution of the two data centers in one city deployment. The disaster discussed in this section is the overall failure of the primary data center, or multiple TiKV nodes in the primary/secondary data center fail, resulting in the loss of most replicas and it is unable to provide services. -When a disaster occurs to a cluster in the synchronous replication mode, you can perform data recovery with `RPO = 0`: +#### Overall failure of the primary data center -- If the primary data center fails and most of the Voter replicas are lost, but complete data exists in the DR data center, the lost data can be recovered from the DR data center. At this time, manual intervention is required with professional tools. You can contact the TiDB team for a recovery solution. +In this situation, all Regions in the primary data center have lost most of their replicas, so the cluster is unable to use. At this time, it is necessary to use the secondary data center to recover the service. The replication status before failure determines the recovery ability: -- If the DR center fails and a few Voter replicas are lost, the cluster automatically switches to the asynchronous replication mode. +- If the status before failure is in the synchronous replication mode (the status code is `sync` or `async_wait`), you can use the secondary data center to recover using `RPO = 0`. -When a disaster occurs to a cluster that is not in the synchronous replication mode and you cannot perform data recovery with `RPO = 0`: +- If the status before failure is in the asynchronous replication mode (the status code is `async`), the written data in the primary data center in the asynchronous replication mode is lost after using the secondary data center to recover. A typical scenario is that the primary data center disconnects from the secondary data center and the primary data center switches to the asynchronous replication mode and provides service for a while before the overall failure. -- If most of the Voter replicas are lost, manual intervention is required with professional tools. You can contact the TiDB team for a recovery solution. \ No newline at end of file +- If the status before failure is switching from the asynchronous to synchronous (the status code is `sync-recover`), part of the written data in the primary data center in the asynchronous replication mode is lost after using the secondary data center to recover. This might cause the ACID inconsistency, and you need to recover it additionally. A typical scenario is that the primary data center disconnects from the secondary data center, the connection is restored after switching to the asynchronous mode, and data is written. But during the data synchronization between primary and secondary, something goes wrong and causes the overall failure of the primary data center. + +The process of disaster recovery is as follows: + +1. Stop all PD, TiKV, and TiDB services of the secondary data center. + +2. Start PD nodes of the secondary data center using the single replica mode with the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag. + +3. Use the [Online Unsafe Recovery](/online-unsafe-recovery.md) to process the TiKV data in the secondary data center and the parameters are the list of all Store IDs in the primary data center. + +4. Write a new configuration of placement rule using [PD Control](/pd-control.md), and the Voter replica configuration of the Region is the same as the original cluster in the secondary data center. + +5. Start the PD and TiKV services of the primary data center. + +6. To recover ACID consistency (the status of `DR_STATE` in the old PD is `sync-recover`), you can use [`reset-to-version`](/tikv-control.md#recover-from-acid-inconsistency-data) to process TiKV data and the `version` parameter used can be obtained from `pd-ctl min-resolved-ts`. + +7. Start the TiDB service in the primary data center and check the data integrity and consistency. + +If you need support for disaster recovery, you can contact the TiDB team for a recovery solution. From 10c7d5949e2bbaf02850f96a404b9362b156094f Mon Sep 17 00:00:00 2001 From: Aolin Date: Mon, 8 Aug 2022 14:12:10 +0800 Subject: [PATCH 2/4] apply suggestions from code review --- tikv-control.md | 12 ++++----- two-data-centers-in-one-city-deployment.md | 30 ++++++++++++---------- 2 files changed, 22 insertions(+), 20 deletions(-) diff --git a/tikv-control.md b/tikv-control.md index c8498d283765a..a1a6819d49bce 100644 --- a/tikv-control.md +++ b/tikv-control.md @@ -463,11 +463,11 @@ success! > - The argument of the `-p` option specifies the PD endpoints without the `http` prefix. Specifying the PD endpoints is to query whether the specified `region_id` is validated or not. > - You need to run this command for all stores where specified Regions' peers are located. -### Recover from ACID inconsistency data +### Recover ACID-inconsistent data -To recover data from ACID inconsistency, such as the loss of most replicas or incomplete data synchronization, you can use the `reset-to-version` command. When using this command, you need to provide an old version number that can promise the ACID consistency. Then, `tikv-ctl` cleans up all data after the specified version. +To recover data that breaks ACID consistency, such as the loss of most replicas or incomplete data replication, you can use the `reset-to-version` command. When using this command, you need to provide an old version number that can promise the ACID consistency. Then, `tikv-ctl` cleans up all data after the specified version. -- The `-v` option is used to specify the version number to restore. To get the value of the `-v` parameter, you can use the `pd-ctl min-resolved-ts` command. +- The `-v` option is used to specify the version number to recover. To get the value of the `-v` option, you can use the `pd-ctl min-resolved-ts` command in PD Control. ```shell tikv-ctl --host 127.0.0.1:20160 reset-to-version -v 430315739761082369 @@ -479,9 +479,9 @@ success! > **Note:** > -> - The preceding command only supports the online mode. Before executing the command, you need to stop processes that will write data to TiKV, such as TiDB. After the command is executed successfully, it will return `success!`. -> - You need to execute the same command for all TiKV nodes in the cluster. -> - All PD scheduling tasks should be stopped before executing the command. +> - The preceding command only supports the online mode. Before running the command, you need to stop the processes that will write data to TiKV, such as TiDB. After the command is run successfully, `success!` is returned in the output. +> - You need to run the same command for all TiKV nodes in the cluster. +> - Stop all PD scheduling tasks before running the command. ### Ldb Command diff --git a/two-data-centers-in-one-city-deployment.md b/two-data-centers-in-one-city-deployment.md index a47fbf16c50ea..8ff51a9f0fdb7 100644 --- a/two-data-centers-in-one-city-deployment.md +++ b/two-data-centers-in-one-city-deployment.md @@ -273,32 +273,34 @@ The details for the status switch are as follows: ### Disaster recovery -This section introduces the disaster recovery solution of the two data centers in one city deployment. The disaster discussed in this section is the overall failure of the primary data center, or multiple TiKV nodes in the primary/secondary data center fail, resulting in the loss of most replicas and it is unable to provide services. +This section introduces the disaster recovery solution of the two data centers (DCs) in one city deployment. The disaster discussed in this section refers to the situation where the primary DC fails as a whole, or multiple TiKV nodes in the primary/secondary DC fail, causing the loss of most replicas and service shutdown. + +> **Tip:** +> +> If you need support for disaster recovery, you can contact the TiDB team for a recovery solution. #### Overall failure of the primary data center -In this situation, all Regions in the primary data center have lost most of their replicas, so the cluster is unable to use. At this time, it is necessary to use the secondary data center to recover the service. The replication status before failure determines the recovery ability: +In this situation, all Regions in the primary DC have lost most of their replicas, so the cluster is down. At this time, to recover the service, the secondary DC is needed. The recovery capability is determined by the replication status before the failure: -- If the status before failure is in the synchronous replication mode (the status code is `sync` or `async_wait`), you can use the secondary data center to recover using `RPO = 0`. +- If the cluster before failure is in the synchronous replication mode (the status code is `sync` or `async_wait`), you can use the secondary DC to recover with `RPO = 0`. -- If the status before failure is in the asynchronous replication mode (the status code is `async`), the written data in the primary data center in the asynchronous replication mode is lost after using the secondary data center to recover. A typical scenario is that the primary data center disconnects from the secondary data center and the primary data center switches to the asynchronous replication mode and provides service for a while before the overall failure. +- If the cluster before failure is in the asynchronous replication mode (the status code is `async`), after recovering the primary DC with the data of the secondary DC, the data written from the primary DC to the secondary DC before the failure in the asynchronous replication mode will be lost. A typical scenario is that the primary DC disconnects from the secondary DC and the primary DC switches to the asynchronous replication mode and provides service for a while before the overall failure. -- If the status before failure is switching from the asynchronous to synchronous (the status code is `sync-recover`), part of the written data in the primary data center in the asynchronous replication mode is lost after using the secondary data center to recover. This might cause the ACID inconsistency, and you need to recover it additionally. A typical scenario is that the primary data center disconnects from the secondary data center, the connection is restored after switching to the asynchronous mode, and data is written. But during the data synchronization between primary and secondary, something goes wrong and causes the overall failure of the primary data center. +- If the cluster before failure is in synchronous recovery mode (the status code is `sync-recover`). After using the secondary DC to recover the service, some data written by the primary DC in the asynchronous replication mode might be lost. This might break the ACID consistency and you need to recover the ACID-inconsistent data additionally. A typical scenario is that the primary DC disconnects from the secondary DC and the connection is recovered after some data is written to the primary DC in the asynchronous replication mode. But during the asynchronous replication between primary and secondary, something goes wrong and causes the primary DC to fail as a whole. The process of disaster recovery is as follows: -1. Stop all PD, TiKV, and TiDB services of the secondary data center. - -2. Start PD nodes of the secondary data center using the single replica mode with the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag. +1. Stop all PD, TiKV, and TiDB services of the secondary DC. -3. Use the [Online Unsafe Recovery](/online-unsafe-recovery.md) to process the TiKV data in the secondary data center and the parameters are the list of all Store IDs in the primary data center. +2. Start PD nodes of the secondary DC in the single replica mode with the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag. -4. Write a new configuration of placement rule using [PD Control](/pd-control.md), and the Voter replica configuration of the Region is the same as the original cluster in the secondary data center. +3. Follow the instructions in [Online Unsafe Recovery](/online-unsafe-recovery.md) to process the TiKV data of the secondary DC. The parameters are the list of all Store IDs in the primary DC. -5. Start the PD and TiKV services of the primary data center. +4. Write a new placement rule configuration file and use it in [PD Control](/pd-control.md). In the configuration file, the Voter replica count of the Region is the same as that of the original cluster in the secondary DC. -6. To recover ACID consistency (the status of `DR_STATE` in the old PD is `sync-recover`), you can use [`reset-to-version`](/tikv-control.md#recover-from-acid-inconsistency-data) to process TiKV data and the `version` parameter used can be obtained from `pd-ctl min-resolved-ts`. +5. Start the PD and TiKV services of the primary DC. -7. Start the TiDB service in the primary data center and check the data integrity and consistency. +6. To perform an ACID-consistent data recovery (the status of `DR_STATE` in the old PD is `sync-recover`), you can use [the `reset-to-version` command of TiKV COntrol](/tikv-control.md#recover-acid-inconsistent-data) to process TiKV data. The `version` parameter can be obtained by running `pd-ctl min-resolved-ts` in PD Control. -If you need support for disaster recovery, you can contact the TiDB team for a recovery solution. +7. Start the TiDB service in the primary DC and check the data integrity and consistency. From 2cb14261ebe111488a10c85a29a8e249cc58d78e Mon Sep 17 00:00:00 2001 From: Aolin Date: Mon, 8 Aug 2022 18:13:26 +0800 Subject: [PATCH 3/4] apply suggestions from code review --- command-line-flags-for-pd-configuration.md | 4 ++-- tikv-control.md | 2 +- two-data-centers-in-one-city-deployment.md | 10 +++++----- 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/command-line-flags-for-pd-configuration.md b/command-line-flags-for-pd-configuration.md index 44f2f28a02bd6..444b5fe08b272 100644 --- a/command-line-flags-for-pd-configuration.md +++ b/command-line-flags-for-pd-configuration.md @@ -108,9 +108,9 @@ PD is configurable using command-line flags and environment variables. ## `--force-new-cluster` -- Force to create a new cluster using current nodes. +- Forcibly creates a new cluster using current nodes. - Default: `false` -- It is recommended to use this flag only when recovering services due to PD losing most replicas, which might cause data loss. +- It is recommended to use this flag only for recovering services when PD loses most of its replicas, which might cause data loss. ## `-V`, `--version` diff --git a/tikv-control.md b/tikv-control.md index a1a6819d49bce..e45b34bf74334 100644 --- a/tikv-control.md +++ b/tikv-control.md @@ -479,7 +479,7 @@ success! > **Note:** > -> - The preceding command only supports the online mode. Before running the command, you need to stop the processes that will write data to TiKV, such as TiDB. After the command is run successfully, `success!` is returned in the output. +> - The preceding command only supports the online mode. Before running the command, you need to stop the processes that will write data to TiKV, such as the TiDB processes. After the command is run successfully, `success!` is returned in the output. > - You need to run the same command for all TiKV nodes in the cluster. > - Stop all PD scheduling tasks before running the command. diff --git a/two-data-centers-in-one-city-deployment.md b/two-data-centers-in-one-city-deployment.md index 8ff51a9f0fdb7..6d25980f17aff 100644 --- a/two-data-centers-in-one-city-deployment.md +++ b/two-data-centers-in-one-city-deployment.md @@ -277,27 +277,27 @@ This section introduces the disaster recovery solution of the two data centers ( > **Tip:** > -> If you need support for disaster recovery, you can contact the TiDB team for a recovery solution. +> If you need support for disaster recovery, contact the TiDB team for a recovery solution. #### Overall failure of the primary data center In this situation, all Regions in the primary DC have lost most of their replicas, so the cluster is down. At this time, to recover the service, the secondary DC is needed. The recovery capability is determined by the replication status before the failure: -- If the cluster before failure is in the synchronous replication mode (the status code is `sync` or `async_wait`), you can use the secondary DC to recover with `RPO = 0`. +- If the cluster before failure is in the synchronous replication mode (the status code is `sync` or `async_wait`), you can use the secondary DC to recover data with `RPO = 0`. - If the cluster before failure is in the asynchronous replication mode (the status code is `async`), after recovering the primary DC with the data of the secondary DC, the data written from the primary DC to the secondary DC before the failure in the asynchronous replication mode will be lost. A typical scenario is that the primary DC disconnects from the secondary DC and the primary DC switches to the asynchronous replication mode and provides service for a while before the overall failure. -- If the cluster before failure is in synchronous recovery mode (the status code is `sync-recover`). After using the secondary DC to recover the service, some data written by the primary DC in the asynchronous replication mode might be lost. This might break the ACID consistency and you need to recover the ACID-inconsistent data additionally. A typical scenario is that the primary DC disconnects from the secondary DC and the connection is recovered after some data is written to the primary DC in the asynchronous replication mode. But during the asynchronous replication between primary and secondary, something goes wrong and causes the primary DC to fail as a whole. +- If the cluster before failure is switching from asynchronous to synchronous mode (the status code is `sync-recover`), after using the secondary DC to recover the service, some data asynchronously replicated from the primary DC to the secondary DC will be lost. This might break the ACID consistency, and you need to recover the ACID-inconsistent data accordingly. A typical scenario is that the primary DC disconnects from the secondary DC. After some data is replicated to the primary DC in the asynchronous mode, the connection is recovered. But during the asynchronous replication, errors occur again and cause the primary DC to fail as a whole. The process of disaster recovery is as follows: 1. Stop all PD, TiKV, and TiDB services of the secondary DC. -2. Start PD nodes of the secondary DC in the single replica mode with the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag. +2. Start PD nodes of the secondary DC using a replica with the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag. 3. Follow the instructions in [Online Unsafe Recovery](/online-unsafe-recovery.md) to process the TiKV data of the secondary DC. The parameters are the list of all Store IDs in the primary DC. -4. Write a new placement rule configuration file and use it in [PD Control](/pd-control.md). In the configuration file, the Voter replica count of the Region is the same as that of the original cluster in the secondary DC. +4. Write a new placement rule configuration file and use it in [PD Control](/pd-control.md). In the configuration file, the Voter replica count of the Region is the same as that of the original cluster in the secondary DC. 5. Start the PD and TiKV services of the primary DC. From 8d2b228417aa986f55ffd37b65d22b09fc7e1aa1 Mon Sep 17 00:00:00 2001 From: Aolin Date: Mon, 8 Aug 2022 18:30:00 +0800 Subject: [PATCH 4/4] apply suggestions from code review --- two-data-centers-in-one-city-deployment.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/two-data-centers-in-one-city-deployment.md b/two-data-centers-in-one-city-deployment.md index 6d25980f17aff..27da6709bda25 100644 --- a/two-data-centers-in-one-city-deployment.md +++ b/two-data-centers-in-one-city-deployment.md @@ -293,7 +293,7 @@ The process of disaster recovery is as follows: 1. Stop all PD, TiKV, and TiDB services of the secondary DC. -2. Start PD nodes of the secondary DC using a replica with the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag. +2. Start PD nodes of the secondary DC with a replica using the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag. 3. Follow the instructions in [Online Unsafe Recovery](/online-unsafe-recovery.md) to process the TiKV data of the secondary DC. The parameters are the list of all Store IDs in the primary DC.