Skip to content

Risk of data loss due to Patroni configuration defaults #1153

@nsakkos

Description

@nsakkos

During a recent outage, we observed data loss when a former primary with divergent history rejoined the cluster. I believe this is linked to the default Patroni configuration, which can remove and recreate a data directory when timelines have diverged.

Charm defaults for patroni configuration:

use_pg_rewind: true
remove_data_directory_on_rewind_failure: true
remove_data_directory_on_diverged_timelines: true

From the patroni official documentation:

remove_data_directory_on_diverged_timelines: Patroni will remove the PostgreSQL data directory and recreate the replica if it notices that timelines are diverging and the former primary can not start streaming from the new primary.

Steps to reproduce

  1. Deploy a 3-unit Postgres cluster using the charm
  2. Manipulate the quorum so that two sub-clusters can both elect a leader
  3. Write data to the isolated primary
  4. Force a leadership change on the other primary (the one on the 2-node cluster) to increase the timeline
  5. Restore connectivity and/or restart patroni on the isolated primary, so that it rejoins the cluster
  6. Observe whether the isolated primary wipes its data directory when joining the cluster

Expected behavior

  • A diverged primary should not automatically discard its data without human intervention.
  • Documentation should clearly highlight the tradeoffs of the current defaults (risk of data loss vs automatic cluster recovery)

Actual behavior

In the observed incident, a node broke away from the cluster, but continued to receive updates from the consuming application. When it was manually restarted and attempted to rejoin, its timeline was behind the cluster's leader. As a result, it joined as a replica and dropped its local data to sync to the leader's timeline, leading to data loss.

Versions

The data-loss incident occurred on the following juju/charm versions, but the patroni default configuration is the same on the newer versions of the charm.

Operating system:

Juju CLI: 3.6.8-ubuntu-amd64

Juju agent: 3.1.8

Charm revision: 331

LXD:

Log output

Juju debug log:

I apologize for the internal-only links

https://pastebin.canonical.com/p/Cr92FPB8wZ/
https://pastebin.canonical.com/p/TNS6BYK4dd/

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working as expected

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions