Skip to content

Can't provision new replicas #835

Open
@gregory-schiano

Description

@gregory-schiano

Steps to reproduce

  1. On an existing PostgreSQL cluster with 194GB of database, create a backup
  2. Deploy a new PostgreSQL charm with 1 unit
  3. Restore the backup of the previous cluster on the new cluster
  4. Add an extra unit

Expected behavior

The unit is provision and become part of the cluster as a replica

Actual behavior

The unit stays in "executing" with status message "awaiting for member to start"
See juju show-status-log:

Time                   Type       Status     Message
15 Apr 2025 15:14:44Z  juju-unit  executing  running s3-parameters-relation-joined hook for big-postgresql-backups-s3-integrator/0
15 Apr 2025 15:14:46Z  juju-unit  executing  running upgrade-relation-changed hook
15 Apr 2025 15:14:47Z  juju-unit  executing  running database-peers-relation-changed hook
15 Apr 2025 15:14:50Z  juju-unit  executing  running restart-relation-joined hook for big-postgresql/3
15 Apr 2025 15:14:52Z  juju-unit  executing  running database-relation-joined hook for remote-02e36cfa4a0f46408c03afa363dbe8d3/0
15 Apr 2025 15:14:55Z  juju-unit  executing  running upgrade-relation-joined hook for big-postgresql/3
15 Apr 2025 15:14:57Z  juju-unit  executing  running database-relation-changed hook for remote-02e36cfa4a0f46408c03afa363dbe8d3/0
15 Apr 2025 15:15:00Z  juju-unit  executing  running upgrade-relation-changed hook for big-postgresql/3
15 Apr 2025 15:15:02Z  juju-unit  executing  running database-peers-relation-joined hook for big-postgresql/3
15 Apr 2025 15:15:04Z  juju-unit  executing  running database-peers-relation-changed hook for big-postgresql/3
15 Apr 2025 15:15:07Z  juju-unit  executing  running restart-relation-changed hook for big-postgresql/3
15 Apr 2025 15:15:10Z  juju-unit  executing  running s3-parameters-relation-changed hook for big-postgresql-backups-s3-integrator/0
15 Apr 2025 15:15:13Z  juju-unit  executing  running database-relation-joined hook for remote-02e36cfa4a0f46408c03afa363dbe8d3/1
15 Apr 2025 15:15:17Z  juju-unit  executing  running database-relation-changed hook for remote-02e36cfa4a0f46408c03afa363dbe8d3/1
15 Apr 2025 15:15:20Z  juju-unit  executing  running database-relation-joined hook for remote-02e36cfa4a0f46408c03afa363dbe8d3/2
15 Apr 2025 15:15:23Z  juju-unit  executing  running database-relation-changed hook for remote-02e36cfa4a0f46408c03afa363dbe8d3/2
15 Apr 2025 15:15:26Z  juju-unit  executing  running database-relation-joined hook for remote-02e36cfa4a0f46408c03afa363dbe8d3/3
15 Apr 2025 15:15:29Z  juju-unit  executing  running database-relation-changed hook for remote-02e36cfa4a0f46408c03afa363dbe8d3/3
15 Apr 2025 15:15:32Z  juju-unit  idle
15 Apr 2025 15:42:17Z  workload   waiting    awaiting for member to start

Versions

Operating system: Ubuntu 22.04.5 LTS

Juju CLI: 3.6.3-ubuntu-amd64

Juju agent: 3.1.8

Charm revision: 405

LXD: N/A

Log output

Juju debug log:

unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log ops 2.12.0 up and running.
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log Re-emitting deferred event <ConfigChangedEvent via PostgresqlOperatorCharm/on/config_changed[33]>.
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /cluster HTTP/1.1" 200 None
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /cluster HTTP/1.1" 200 None
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log Defer on_config_changed: Cannot connect to database
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log Deferring <ConfigChangedEvent via PostgresqlOperatorCharm/on/config_changed[33]>.
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log Re-emitting deferred event <SecretChangedEvent via PostgresqlOperatorCharm/on/secret_changed[41]>.
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /cluster HTTP/1.1" 200 None
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /cluster HTTP/1.1" 200 None
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log Building PostgreSQL parameters for profile='production' and available_memory=16765079552
unit-big-postgresql-6: 16:17:07 DEBUG unit.big-postgresql/6.juju-log Partner addresses: {'10.141.37.41'}
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /health HTTP/1.1" 503 None
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Early exit update_config: Patroni not started yet
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /health HTTP/1.1" 503 None
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Deferring on_peer_relation_changed: awaiting for member to start
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Deferring <SecretChangedEvent via PostgresqlOperatorCharm/on/secret_changed[41]>.
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Re-emitting deferred event <RelationChangedEvent via PostgresqlOperatorCharm/on/database_peers_relation_changed[62]>.
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /cluster HTTP/1.1" 200 None
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /cluster HTTP/1.1" 200 None
unit-ubuntu-advantage-69: 16:17:08 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Building PostgreSQL parameters for profile='production' and available_memory=16765079552
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Partner addresses: {'10.141.37.41'}
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /health HTTP/1.1" 503 None
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Early exit update_config: Patroni not started yet
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /health HTTP/1.1" 503 None
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Deferring on_peer_relation_changed: awaiting for member to start
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Deferring <RelationChangedEvent via PostgresqlOperatorCharm/on/database_peers_relation_changed[62]>.
unit-big-postgresql-6: 16:17:08 DEBUG unit.big-postgresql/6.juju-log Re-emitting deferred event <RelationChangedEvent via PostgresqlOperatorCharm/on/database_peers_relation_changed[90]>.
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /cluster HTTP/1.1" 200 None
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /cluster HTTP/1.1" 200 None
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log Building PostgreSQL parameters for profile='production' and available_memory=16765079552
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log Partner addresses: {'10.141.37.41'}
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /health HTTP/1.1" 503 None
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log Early exit update_config: Patroni not started yet
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /health HTTP/1.1" 503 None
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log Deferring on_peer_relation_changed: awaiting for member to start
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log Deferring <RelationChangedEvent via PostgresqlOperatorCharm/on/database_peers_relation_changed[90]>.
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log Emitting Juju event update_status.
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log Starting new HTTP connection (1): 10.141.37.164:8008
unit-big-postgresql-6: 16:17:09 DEBUG unit.big-postgresql/6.juju-log http://10.141.37.164:8008 "GET /health HTTP/1.1" 503 None
unit-big-postgresql-6: 16:17:14 INFO unit.big-postgresql/6.juju-log restarted PostgreSQL because it was not running

Patroni logs on the second unit:

2025-04-15 15:17:41 UTC [32809]: INFO: trying to bootstrap from leader 'big-postgresql-3'
2025-04-15 15:17:41 UTC [32809]: ERROR: Error when fetching backup: pg_basebackup exited with code=1
2025-04-15 15:17:41 UTC [32809]: WARNING: Trying again in 5 seconds
2025-04-15 15:17:46 UTC [32809]: ERROR: Error when fetching backup: pg_basebackup exited with code=1
2025-04-15 15:17:46 UTC [32809]: ERROR: failed to bootstrap from leader 'big-postgresql-3'
2025-04-15 15:17:46 UTC [32809]: INFO: Removing data directory: /var/snap/charmed-postgresql/common/var/lib/postgresql
2025-04-15 15:17:56 UTC [32809]: INFO: Removing data directory: /var/snap/charmed-postgresql/common/var/lib/postgresql
2025-04-15 15:18:01 UTC [32809]: INFO: Lock owner: big-postgresql-3; I am big-postgresql-6
2025-04-15 15:18:01 UTC [32809]: INFO: trying to bootstrap from leader 'big-postgresql-3'
2025-04-15 15:18:01 UTC [32809]: INFO: Lock owner: big-postgresql-3; I am big-postgresql-6
2025-04-15 15:18:01 UTC [32809]: INFO: bootstrap from leader 'big-postgresql-3' in progress
2025-04-15 15:18:01 UTC [32809]: ERROR: Error when fetching backup: pg_basebackup exited with code=1

PostgresQL logs on the primary unit:

2025-04-15 15:18:11 UTC [411975]: user=replication,db=[unknown],app=[unknown],client=10.141.37.164,line=1 FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
2025-04-15 15:18:16 UTC [412360]: user=replication,db=[unknown],app=[unknown],client=10.141.37.164,line=1 FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
2025-04-15 15:18:21 UTC [412364]: user=replication,db=[unknown],app=[unknown],client=10.141.37.164,line=1 FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
2025-04-15 15:18:26 UTC [412365]: user=replication,db=[unknown],app=[unknown],client=10.141.37.164,line=1 FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
2025-04-15 15:18:31 UTC [412367]: user=replication,db=[unknown],app=[unknown],client=10.141.37.164,line=1 FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
2025-04-15 15:18:36 UTC [412380]: user=replication,db=[unknown],app=[unknown],client=10.141.37.164,line=1 FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
2025-04-15 15:18:41 UTC [412382]: user=replication,db=[unknown],app=[unknown],client=10.141.37.164,line=1 FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
2025-04-15 15:18:46 UTC [412383]: user=replication,db=[unknown],app=[unknown],client=10.141.37.164,line=1 FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
2025-04-15 15:18:51 UTC [412385]: user=replication,db=[unknown],app=[unknown],client=10.141.37.164,line=1 FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
2025-04-15 15:18:56 UTC [412386]: user=replication,db=[unknown],app=[unknown],client=10.141.37.164,line=1 FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
2025-04-15 15:19:34 UTC [392567]: user=replication,db=[unknown],app=pg_basebackup,client=10.141.37.13,line=14 STATEMENT:  BASE_BACKUP LABEL 'pg_basebackup base backup' PROGRESS   NOWAIT    MANIFEST 'yes'
2025-04-15 15:19:34 UTC [392567]: user=replication,db=[unknown],app=pg_basebackup,client=10.141.37.13,line=15 FATAL:  connection to client lost
2025-04-15 15:19:36 UTC [412422]: user=replication,db=[unknown],app=pg_basebackup,client=10.141.37.164,line=1 LOG:  received replication command: SHOW data_directory_mode
2025-04-15 15:19:36 UTC [412422]: user=replication,db=[unknown],app=pg_basebackup,client=10.141.37.164,line=2 STATEMENT:  SHOW data_directory_mode
2025-04-15 15:19:36 UTC [412422]: user=replication,db=[unknown],app=pg_basebackup,client=10.141.37.164,line=3 LOG:  received replication command: SHOW wal_segment_size
2025-04-15 15:19:36 UTC [412422]: user=replication,db=[unknown],app=pg_basebackup,client=10.141.37.164,line=4 STATEMENT:  SHOW wal_segment_size
2025-04-15 15:19:36 UTC [412422]: user=replication,db=[unknown],app=pg_basebackup,client=10.141.37.164,line=5 LOG:  received replication command: IDENTIFY_SYSTEM
2025-04-15 15:19:36 UTC [412422]: user=replication,db=[unknown],app=pg_basebackup,client=10.141.37.164,line=6 STATEMENT:  IDENTIFY_SYSTEM
2025-04-15 15:19:36 UTC [412422]: user=replication,db=[unknown],app=pg_basebackup,client=10.141.37.164,line=7 LOG:  received replication command: BASE_BACKUP LABEL 'pg_basebackup base backup' PROGRESS   NOWAIT    MANIFEST 'yes'
2025-04-15 15:19:36 UTC [412422]: user=replication,db=[unknown],app=pg_basebackup,client=10.141.37.164,line=8 STATEMENT:  BASE_BACKUP LABEL 'pg_basebackup base backup' PROGRESS   NOWAIT    MANIFEST 'yes'

Journalctl logs on the second unit:

Apr 15 15:52:42 juju-007f82-prod-chat-synapse-db-61 systemd[1]: snap.charmed-postgresql.patroni.service: Failed with result 'timeout'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit snap.charmed-postgresql.patroni.service has entered the 'failed' state with result 'timeout'.
Apr 15 15:52:42 juju-007f82-prod-chat-synapse-db-61 systemd[1]: snap.charmed-postgresql.patroni.service: State 'final-sigterm' timed out. Killing.
Apr 15 15:52:42 juju-007f82-prod-chat-synapse-db-61 systemd[1]: snap.charmed-postgresql.patroni.service: Killing process 40275 (python3) with signal SIGKILL.
Apr 15 15:52:42 juju-007f82-prod-chat-synapse-db-61 systemd[1]: snap.charmed-postgresql.patroni.service: Killing process 41463 (pg_basebackup) with signal SIGKILL.
Apr 15 15:52:42 juju-007f82-prod-chat-synapse-db-61 systemd[1]: snap.charmed-postgresql.patroni.service: Killing process 40292 (python3) with signal SIGKILL.
Apr 15 15:52:42 juju-007f82-prod-chat-synapse-db-61 systemd[1]: snap.charmed-postgresql.patroni.service: Killing process 40295 (n/a) with signal SIGKILL.

Additional context

The max_wal_senders is a cause of patroni retrying not a consequence, terminating the process using SELECT pg_terminate_backend(PID); keeps getting extra wal connections

The systemd logs shows the process being killed after a "stop" time out, but extra logs (that I didn't keep) were showing that "something" was stopping the patroni service

A look at the juju-debug log restarted PostgreSQL because it was not running in the charm code pointed me to:
https://github.com/canonical/postgresql-operator/blob/rev405/src/charm.py#L1286
The line before restarts Patroni, which is the observed behavior !

The method is called from there:
https://github.com/canonical/postgresql-operator/blob/rev405/src/charm.py#L1243
And is triggered by the update-status event

My conclusion is that Patroni starts with making a backup of the primary unit, this process taking long due to the DB size, the update-status is triggered while the backup job is still running. The charm interprets the service as unhealthy and restarts Patroni. Systemd after 90s consider the process as not shutting down properly and sends a SIGKILL to patroni service and its sub processes. After the restart, Patroni retries.

Applied workaround

Changing the update-status frequency on model config to allow enough time for Patroni to make a backup and initiate the replication
juju model-config update-status-hook-interval=40m
Once Patroni is done restore the initial update-status interval:
juju model-config update-status-hook-interval=5m
And force trigger an update-status hook to update the charm status:
juju exec -u big-postgresql/6 -- JUJU_DISPATCH_PATH='hooks/update-status' ./dispatch

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working as expected

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions