Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
265 changes: 265 additions & 0 deletions upgrade-postgres-to-14/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
# CircleCI Server: Postgres 12 → 14 upgrade script

`upgrade-postgres-to-14.sh` automates the on-disk PostgreSQL major-version upgrade inside your CircleCI Server installation. It renders and applies a one-shot Kubernetes Job that runs `pg_upgrade --link` against your existing Postgres PVC, then prints the helm values block to update and the `helm upgrade` command to run.

The full upgrade procedure — including platform-specific guidance on snapshots, rollback, and recovery — is documented separately at **docs.circleci.com** (CircleCI Server upgrade guide). This README focuses on operating the script and the end-to-end flow at a high level.

## What the script does

- Discovers your PVC name, postgres password secret, and (while postgres is still running) the source-cluster encoding/locale. Every value is overridable.
- Verifies that your application-layer deployments are scaled to 0 before doing anything.
- If the postgres StatefulSet is still running, captures encoding/locale from the live cluster and **scales postgres to 0** automatically, then waits for the underlying volume to detach.
- Renders a `pg_upgrade` Job manifest, applies it, and streams the Job's logs.
- Verifies completion by waiting for the Job's `Complete` condition.
- On success, prints the helm values block to paste into your CircleCI Server values file and the exact `helm upgrade` command to run.

## What the script does NOT do

- **It does not take backups.** Snapshotting the PVC or pre-provisioning a PVC clone is your responsibility and platform-specific (CSI VolumeSnapshot, your cloud provider's snapshot CLI, Velero, etc.). The script assumes you have a known-good restore path before you invoke it.
- **It does not scale your application layer.** You scale `layer=application` deployments to 0 before invoking the script. The script verifies this and refuses to proceed otherwise. (It does scale the postgres StatefulSet itself — see above.)
- **It does not run `helm upgrade`.** That step is yours — the script tells you exactly what to run. The `helm upgrade` afterward restores replicas to their correct counts for both postgres and your application layer.

## End-to-end upgrade procedure

1. **Pre-flight.**
- Confirm your cluster can pull the target `server-postgres:14.22.x` image. By default the script pulls from CircleCI's ACR (`cciserver.azurecr.io`); use `--dockerhub` if you pull from Docker Hub.
- Decide on a rollback strategy: a point-in-time snapshot of the postgres PVC, or a pre-provisioned PVC clone, or both. *How* you take either depends on your platform — see the detailed upgrade guide on docs.circleci.com.
- Source-cluster encoding and locale: in the standard flow, the script captures these for you (it queries `template1` while postgres is still running, then scales postgres to 0 itself). You don't need to query manually. If you'd like to verify ahead of time:
```
PGP=$(kubectl -n <ns> get secret <secret-name> \
-o jsonpath='{.data.postgres-password}' | base64 -d)
kubectl -n <ns> exec postgresql-0 -- env PGPASSWORD="$PGP" \
psql -U postgres -tAc \
"SELECT pg_encoding_to_char(encoding), datcollate, datctype \
FROM pg_database WHERE datname='template1'"
```
The script's built-in defaults are `UTF8` / `C.UTF-8` / `C.UTF-8`. Common alternative locales are `en_US.UTF-8` for `LC_COLLATE` / `LC_CTYPE` on some Bitnami builds. If you prefer to scale postgres down yourself (instead of letting the script do it), capture these values *before* scaling and pass them to the script via:
```
./upgrade-postgres-to-14.sh -n <ns> \
--initdb-encoding <encoding> \
--initdb-lc-collate <lc_collate> \
--initdb-lc-ctype <lc_ctype>
```
A locale or encoding mismatch causes `pg_upgrade` to abort partway through its consistency checks with an error naming the offending database and value.
- Plan a maintenance window. Total downtime depends on your database size, storage performance, the number of databases and extensions, and post-upgrade `VACUUM ANALYZE` time. We recommend at least 30-60 minutes.

2. **Take a backup.** Snapshot the PVC, or pre-provision a clone PVC, or both. The remaining steps assume you can restore from one of these if anything goes wrong.

3. **Quiesce the application layer.** Scale your application deployments to 0. Leave the postgres StatefulSet alone — the script will scale it down itself after capturing the live cluster's encoding/locale:
```
kubectl -n <ns> scale deploy -l layer=application --replicas=0
```
The script refuses to proceed if any `layer=application` deployment still has replicas > 0.

4. **Run the upgrade.** With the application layer quiesced:
```
./upgrade-postgres-to-14.sh -n <namespace>
```
See [Flags](#flags) for overrides. The script will:
- Query the live postgres pod for encoding/locale.
- Prompt to scale the postgres StatefulSet to 0, then wait for the underlying volume to detach (the cloud-side detach round-trip typically takes 30–90 seconds).
- Apply the `pg_upgrade` Job and stream its logs.
- Exit 0 once `pg_upgrade` reports `Upgrade Complete` and the new PG14 cluster is in place at `/bitnami/postgresql/data`.

If you'd rather scale postgres down yourself, do so before invoking the script — but capture encoding/locale *before* you scale, since the script can't query a stopped pod.

5. **Update your helm values.** On success, the script prints the exact `postgresql:` block to paste into your CircleCI Server values file. The change is in the `postgresql.image` registry/repository/tag — for the default (ACR) run, the diff against the previous PG12 pin looks like:

```diff
postgresql:
image:
registry: cciserver.azurecr.io
repository: circleci/server-postgres
tag: 14.22.4094-4922444
```

If you ran the script with `--dockerhub`, `registry: docker.io` stays — only the `tag` changes.

6. **Roll forward.** Run `helm upgrade` against your CircleCI Server release. This upgrades the chart, deploys the new postgres image, and restores correct replica counts for postgres and your application workloads — you don't need a separate scale-up:
```
helm upgrade circleci-server \
oci://cciserver.azurecr.io/circleci-server \
--version <server-version> \
-f <path-to-your-values-file>
```
The new postgres pod mounts the upgraded data directory and starts on PG14.

7. **Validate.**
```
PGP=$(kubectl -n <ns> get secret <secret-name> \
-o jsonpath='{.data.postgres-password}' | base64 -d)
kubectl -n <ns> exec -it postgresql-0 -- \
env PGPASSWORD="$PGP" psql -U postgres -c 'SELECT version();'
kubectl -n <ns> exec -it postgresql-0 -- \
env PGPASSWORD="$PGP" vacuumdb -U postgres --all --analyze-in-stages
```
`pg_upgrade` does NOT carry optimizer statistics across — `vacuumdb --analyze-in-stages` rebuilds them and keeps query plans sane. Smoke-test your CircleCI install (log in, trigger a workflow) and watch the postgres logs for any startup warnings.

8. **Clean up.** After at least 24 hours of healthy operation on PG14:
- Delete the completed upgrade Job (its pod has terminated, but the Job object stays in the namespace until you remove it):
```
kubectl -n <ns> delete job postgres-upgrade-12-to-14
```
- Inside the postgres pod, remove the old data directory left behind by `pg_upgrade`:
```
kubectl -n <ns> exec -it postgresql-0 -- bash
# then, inside the pod:
bash /bitnami/postgresql/upgrade-logs/delete_old_cluster.sh
```
- Delete your snapshot or pre-provisioned clone PVC if you no longer need them.

## Quick start

```bash
# Standard install: everything auto-discovered, ACR by default
./upgrade-postgres-to-14.sh -n circleci-server

# Pull server-postgres from Docker Hub instead of ACR
./upgrade-postgres-to-14.sh -n circleci-server --dockerhub

# Use a newer PG14 tag than the script's default
./upgrade-postgres-to-14.sh -n circleci-server -t 14.22.5000-newsha

# Preview the rendered Job manifest without applying
./upgrade-postgres-to-14.sh -n circleci-server --dry-run

# Fully explicit, no confirmation prompts
./upgrade-postgres-to-14.sh -n circleci-server \
--pvc-name data-postgresql-0 --secret-name postgresql \
--initdb-lc-collate en_US.UTF-8 --initdb-lc-ctype en_US.UTF-8 \
-y
```

## Flags

| Flag | Default | Purpose |
|---|---|---|
| `-n, --namespace NS` | *(required)* | Namespace where your postgres StatefulSet runs. |
| `-t, --pg14-tag TAG` | `14.22.4094-4922444` | server-postgres image tag to upgrade to. |
| `--pvc-name NAME` | auto-discovered | Source PVC. Defaults to whatever has `app.kubernetes.io/name=postgresql` in your namespace. |
| `--secret-name NAME` | auto-discovered | Secret holding the postgres superuser password. |
| `--secret-key KEY` | `postgres-password` | Key within that secret. |
| `--initdb-encoding ENC` | discovered, else `UTF8` | New cluster encoding. |
| `--initdb-lc-collate LOC` | discovered, else `C.UTF-8` | New cluster `LC_COLLATE`. Must match source. |
| `--initdb-lc-ctype LOC` | discovered, else `C.UTF-8` | New cluster `LC_CTYPE`. Must match source. |
| `--dockerhub` | off | Pull `server-postgres` from Docker Hub (`circleci/server-postgres`) instead of ACR. |
| `--acr-path PATH` | `cciserver.azurecr.io/circleci/server-postgres` | Override the ACR repository path. |
| `--upgrade-job-image IMG` | `tianon/postgres-upgrade:12-to-14` | Image used by the `pg_upgrade` Job itself (always pulled from Docker Hub). |
| `-y, --yes` | off | Skip confirmation prompts. |
| `--dry-run` | off | Render the Job manifest and exit without applying. |
| `-h, --help` | — | Show usage. |

## Prerequisites

- `kubectl` configured for the target cluster (verify with `kubectl config current-context`).
- A standard CircleCI Server internal installation of `postgresql`.
- Network reachability from your cluster to the registry hosting the PG14 image (ACR by default), with `imagePullSecrets` configured accordingly.
- Network reachability from your cluster to Docker Hub for the `tianon/postgres-upgrade:12-to-14` utility image used by the Job. If your cluster can't reach Docker Hub, mirror that image to your own registry and pass it via `--upgrade-job-image`.
- Your application-layer deployments (`layer=application`) scaled to 0 before invocation — the script verifies this and refuses to proceed otherwise. The postgres StatefulSet does *not* need to be scaled in advance; the script handles that.

## Auto-discovery behavior

The script reads the cluster to fill in placeholder values, so a typical invocation needs only `--namespace`.

- **PVC name and secret name** are looked up via the `app.kubernetes.io/name=postgresql` label.
- **Encoding and locale** are queried from the live postgres pod's `template1` catalog when the script starts. Discovery happens *before* the script scales postgres down, so it succeeds in the standard flow. If you'd prefer to scale postgres down yourself before invoking the script, you must pass the values via flags — by the time the script runs, the pod is gone and discovery has no live cluster to query.

If discovery fails or you scaled postgres down ahead of time and didn't pass flags, the script falls back to its built-in defaults (`UTF8` / `C.UTF-8` / `C.UTF-8`) and prints a warning. A mismatch with the source cluster's actual settings causes `pg_upgrade` to abort partway through its consistency checks — see [Pre-flight](#end-to-end-upgrade-procedure) (step 1) for how to override.

## Image source defaults

- **server-postgres image** defaults to `cciserver.azurecr.io/circleci/server-postgres:<tag>` from CircleCI's ACR. Use `--dockerhub` to pull `circleci/server-postgres:<tag>` from Docker Hub instead, or `--acr-path` to override the ACR path.
- **pg_upgrade utility image** is `tianon/postgres-upgrade:12-to-14`, always pulled from Docker Hub regardless of `--dockerhub`. If your cluster cannot reach Docker Hub, mirror this image and pass `--upgrade-job-image` accordingly.

## What success looks like

The script prints the next-step block at the end of a successful run:

```
═══════════════════════════════════════════════════════════════════════════
NEXT STEPS
═══════════════════════════════════════════════════════════════════════════

1. UPDATE your helm values file. Replace the postgresql block with:

postgresql:
image:
registry: cciserver.azurecr.io
repository: circleci/server-postgres
tag: 14.22.4094-4922444

2. RUN helm upgrade ...
3. VALIDATE ...
4. AFTER ≥24h of healthy operation, clean up ...
═══════════════════════════════════════════════════════════════════════════
```

After updating the values file, run `helm upgrade circleci-server oci://cciserver.azurecr.io/circleci-server --version <server-version> -f <path-to-your-values-file>` to roll forward. This single command rolls the chart, applies the new postgres image, and restores correct replica counts for postgres and your application workloads — no separate scale-up needed.

## Troubleshooting

### Re-running after a failed Job

The Job's first step refuses to re-run if `data-12`, `data-14`, or `data-12.preupgrade` exist on the PVC. To recover, run a one-shot pod that mounts the PVC and resets the state:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: pg-upgrade-cleanup
namespace: <your-namespace>
spec:
restartPolicy: Never
securityContext:
runAsUser: 0
runAsGroup: 0
containers:
- name: cleanup
image: busybox:latest
command:
- sh
- -c
- |
set -ex
cd /bitnami/postgresql
if [ -d data-12 ] && [ ! -e data ]; then mv data-12 data; fi
rm -rf data-14 upgrade-logs
chown -R 1001:1001 data
volumeMounts:
- name: data
mountPath: /bitnami/postgresql
volumes:
- name: data
persistentVolumeClaim:
claimName: <your-postgres-pvc>
```

Apply, wait for it to reach phase `Succeeded`, delete it, then re-run `upgrade-postgres-to-14.sh`.

### Common pg_upgrade failures

- **Locale mismatch** — `lc_collate values for database "<name>" do not match` aborts the consistency checks. Re-run with `--initdb-lc-collate` / `--initdb-lc-ctype` set to the source cluster's actual values.
- **Missing extension build** — if your PG12 databases use extensions (e.g. `pg_stat_statements`, `pgcrypto`, `uuid-ossp`), the target PG14 image must include them. Catalog all extensions per database during pre-flight:
```
for db in $(kubectl -n <ns> exec postgresql-0 -- env PGPASSWORD="$PGP" \
psql -U postgres -tAc \
"SELECT datname FROM pg_database WHERE datistemplate=false AND datname!='postgres'"); do
echo "=== $db ==="
kubectl -n <ns> exec postgresql-0 -- env PGPASSWORD="$PGP" \
psql -U postgres -d "$db" -c "SELECT extname, extversion FROM pg_extension"
done
```
- **Authentication failure connecting to source cluster** — confirm `POSTGRES_SECRET_NAME` and `POSTGRES_SECRET_KEY` are correct by running:
```
PGP=$(kubectl -n <ns> get secret <secret-name> \
-o jsonpath='{.data.<secret-key>}' | base64 -d)
kubectl -n <ns> exec postgresql-0 -- env PGPASSWORD="$PGP" \
psql -U postgres -c 'SELECT 1'
```
*before* quiescing the cluster.

For anything outside these common cases — including rollback procedures — refer to the CircleCI Server upgrade guide on docs.circleci.com.

## Where to go next

The detailed customer-facing upgrade documentation (with platform-specific snapshot and PVC clone procedures, rollback playbooks, and the full troubleshooting reference) is on **docs.circleci.com** under the CircleCI Server upgrade guide.
Loading