Skip to main content
Version: Develop

Cluster Recovery

When finding a cluster in a broken state where manual intervention is necessary, we need to proper identify the cluster state and then bring it back in the correct order. This document describes how to bring an a9s PostgreSQL cluster back after resetting the cluster state.

This document is intended to describe a default set of steps that can always reset and restore an a9s PostgreSQL cluster.

a9s PostgreSQL Startup

In order to go ahead, we need a small vision of how the a9s PostgreSQL cluster starts. First it needs a running primary, so even when standby nodes are starting, if a primary is not available, they will wait until the valid primary starts.

After the primary is running, any standby node starting will clone (always during pre-start and only if necessary when running monit start) from the primary. With data up-to-date, the PostgreSQL on the standby node is started.

In the following next section, we explain how to safely accomplish that.

Pre Recovery

Check and Repair When Continuous Archiving is Enabled

note
  • This step is only required if Continuous Archiving is enabled.
  • This case might happen when the a9s Parachute limit is reached, for example.

Although mechanisms are in place to reduce the risk, there is still a small chance that during a cluster crash and the PostgreSQL or Backup Agent processes are stopped, and some WAL files may not be successfully backed up to blob storage. This can compromise the WAL file history. Therefore, it is important to verify and ensure consistency.

In order to accomplish this consistency, follow the steps below:

  1. Check for Unarchived WAL Files

On each node with the PostgreSQL process stopped, check if there are any WAL files that have not been backed up:

# Replace <postgresqlXX> with your PostgreSQL version (e.g., postgresql17, postgresql15, etc)
ls /var/vcap/store/<postgresqlXX>/pg_wal/archive_status | grep ".ready"

Example Output:

000000010000000000000017.ready
000000010000000000000018.ready

These files are pending upload to blob storage.

  1. Manually Upload WAL Files

If <wal-file>.ready files are found, upload each one manually from each node:

note

Ensure both the Backup Agent (and Consul as a dependency) and Backup Manager are running and healthy before proceeding.

# Replace <wal-file> with the WAL filename (without the `.ready` extension)
/var/vcap/jobs/postgresql-ha/bin/backup/backup-wal-file.sh pg_wal/<wal-file> <wal-file>

Operation Example:

/var/vcap/jobs/postgresql-ha/bin/backup/backup-wal-file.sh pg_wal/000000010000000000000017 000000010000000000000017
/var/vcap/jobs/postgresql-ha/bin/backup/backup-wal-file.sh pg_wal/000000010000000000000018 000000010000000000000018

Moreover, mark WAL files as processed by after uploading, rename each <wal-file>.ready file to <wal-file>.done to indicate it has been processed:

mv /var/vcap/store/<postgresqlXX>/pg_wal/archive_status/<wal-file>.ready \
/var/vcap/store/<postgresqlXX>/pg_wal/archive_status/<wal-file>.done

Alternatively, use the script snippet below to achieve the same result:

#!/bin/bash
set -e

trap 'echo ">> Error: Script failed at line $LINENO. Check the logs!"; exit 1' ERR

# Replace the <postgresqlXX> value
postgresql_version="<postgresqlXX>"
for wal in $(ls /var/vcap/store/${postgresql_version}/pg_wal/archive_status | grep ".ready"); do
wal_file_name=$(echo ${wal} | sed 's/.ready//g')
echo ">> Trying to upload ${wal_file_name}. "

/var/vcap/jobs/postgresql-ha/bin/backup/backup-wal-file.sh pg_wal/${wal_file_name} ${wal_file_name}

mv /var/vcap/store/${postgresql_version}/pg_wal/archive_status/${wal_file_name}.ready \
/var/vcap/store/${postgresql_version}/pg_wal/archive_status/${wal_file_name}.done

echo ">> ${wal_file_name} upload OK!"
done

echo ">> The end."
  1. Verify Upload in Blob Storage

Confirm that the WAL files are present in the blob storage:

S3 blob storage Example:

(...)
2025-04-23 23:41:28 413136 000000010000000000000016-0
2025-04-23 23:41:29 363 000000010000000000000016.json
2025-04-23 23:41:28 423136 000000010000000000000017-0
2025-04-23 23:41:29 363 000000010000000000000017.json
2025-04-23 23:41:44 514624 000000010000000000000018-0
2025-04-23 23:41:45 363 000000010000000000000018.json
(...)
info

Once all the above steps are completed and all WAL files are present in blob storage, the data history can be considered consistent. As a result, a disaster recovery operation can be performed if needed. For more information, see Disaster Recovery.

Cluster Recovery

0. Identify the Cluster Primary

Before going ahead, make sure you know the correct primary of the cluster following the procedures described on Cluster Status.

info

And always backup the a9s PostgreSQL primary before executing any destructive steps.

danger

To prevent data loss, it is essential to select the correct primary node before executing any destructive step. Choosing an incorrect node could result in data loss due to outdated information. We strongly recommend proceeding with utmost care during this process.

1. Stop the Cluster

After you have identified the correct primary of the cluster following the previous steps, it is time to stop the nodes in an order that we avoid unwanted promotion of a new primary.

Stop the Standby Nodes

We first stop the standby nodes, we will describe it using monit stop, so you need first to ssh inside the instance:

$ bosh -d <deployment> ssh pg/<index>
$ sudo -i
$ monit stop all

Check that the node has stopped with monit summary.

Cleanup repmgr

Before stopping the primary, cleanup the repmgr database.

Access the repmgr database on the primary node.

$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ chpst -u postgres:vcap psql repmgr_cluster
a9s PostgreSQL 13 and below

As vcap user:

$ su - vcap
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ psql repmgr_cluster

Drop all entries in the status table:

DELETE FROM repmgr.nodes;

Stop the Primary Node

Stop the primary using the same commands described on Stop the Standby Nodes

2. Backup the Primary

With the whole cluster stopped, you need to make a backup of the current data directory of the primary. The directory storing this data is located on /var/vcap/store/postgresqlXX.

You can backup can be done to a local directory under /var/vcap/store. If the node does not have space available, backup to your local machine using bosh scp.

cp -rp /var/vcap/store/postgresqlXX /var/vcap/store/postgresqlXX.bkp

3. Cleanup Cluster State

After you backup the primary data, it is time to clean up the nodes.

Some information about the cluster state is stored under the locks directory, usually data that needs to be available before the primary or the local PostgreSQL is up and running.

Cleanup Lock Files

The files are:

  • previous_address: File holding the address of the local machine on latest startup. This is used to detect network changes in order to execute necessary steps when necessary.
  • upstream_node.lock: File holding the id (spec.id) of the current upstream of this node. This is used to decide who is the primary based on election and quorum. Note that we do not choose which node to promote, this file only reflects the repmgr state.
  • master_role.lock: Signals which node is the current primary of the majority in this cluster. This file is used to decide if the node is a valid primary, and it is handled by a9s PostgreSQL Info.

On all nodes, make sure you remove all files under /var/vcap/store/postgresqlXX-locks/.

Cleanup Startup Lock Directory

A file may be left on the startup lock directory containing a PID that has been recycled by the operating system. In this case, when trying to restart the postgresql-ha process, it can fail due to a startup process already running when actually another process is reusing the PID.

To avoid this issue, after completely stopping the postgresql-ha process, check if there is any related process running with ps aux. If no related process is running, remove the content of the startup locks directory:

rm /tmp/postgresql-startup-locks/*

Cleanup standby Data Directories

Since you are sure which node is the primary, you can remove (or move to another location) the PostgreSQL data directory on the standby.

Before going ahead, make sure you are executing this command in the right nodes (standby nodes) and that data of the primary exists.

rm -rf /var/vcap/store/postgresqlXX/*

Note that usually, the pre-start script executes this steps. But, in order to force cloning again from the primary, we can remove the data (on standby nodes only) and it will be cloned again during pre-start or monit start.

pre-start always cleans up the data directory of a standby node and clones if there is a primary available at the moment. Cloning during monit start will only happen if the data directory does not contain the bootstrap_first.lock or if this is primary returning as standby and does not have the recovery.conf file on its data directory.

4. Recreate upstream_node.lock File

Recreate the upstream_node.lock file with the node-id of the master-node of your PostgreSQL cluster on all nodes as the user running PostgreSQL as the owner to avoid permission issues. In any case the system user should be able to write and read the files, since this is the user that runs the startup scripts and the processes itself.

PostgreSQL User Information

The versions differ in the user that runs the main PostgreSQL processes.

  • a9s PostgreSQL 13 and below use the vcap user.
  • a9s PostgreSQL 15 and above use the postgres system user.

You can get the node-id of the master-node by checking what IP address is assigned to the master-alias of the PostgreSQL cluster and comparing it to the assigned IP addresses of the VMs.

Example:

We have a running cluster with the following VMs:

Deployment 'my-psql-cluster'

Instance Process State AZ IPs
pg/8eb9ef9f-c436-4eda-8dd4-6c61ab33f040 running z2 172.28.39.52 [...]
pg/8fcb86bd-1643-4176-900d-a070aa4c8ace running z1 172.28.38.65 [...]
pg/eb433d91-f3d2-4344-89cc-e7a3f9db4e5b running z3 172.28.40.83 [...]

We can then use dig to resolve the IP address of the master-node inside the cluster.

Please keep in mind to replace these placeholders in the following commands:

  • [CONSUL-DOMAIN] with the Consul domain. The value can be found in your IaaS config under iaas.consul.domain.
  • my-psql-cluster with the actual BOSH deployment name of your a9s PostgreSQL cluster instance.
export MASTER_IP=$(dig +short my-psql-cluster-psql-master-alias.node.dc1.[CONSUL-DOMAIN])

## MASTER_IP=172.28.38.65

The output of this command then reflects the node-id of the master-node of your PostgreSQL cluster:

$ bosh -d my-psql-cluster vms | grep $MASTER_IP | awk '{print $1}' | cut -d"/" -f2
8fcb86bd-1643-4176-900d-a070aa4c8ace

After you have the node-id, you can start to create the upstream_node.lock file on each VM in the cluster (do not forget to replace XX with the actual PostgreSQL version in use):

echo "<node-id>" > /var/vcap/store/postgresqlXX-locks/upstream_node.lock

The created file should only contain the node-id, for example:

$ cat /var/vcap/store/postgresql13-locks/upstream_node.lock
8fcb86bd-1643-4176-900d-a070aa4c8ace

5. Recreate bootstrap_first.lock File

The /var/vcap/store/postgresqlXX/bootstrap_first.lock lock file signals that the cluster has already been initialized, and therefore initdb should never be executed again. In order to make sure our primary will not loose data when starting, we need to make sure this file exists:

If the file /var/vcap/store/postgresqlXX/bootstrap_first.lock does not exist, create it with the following command:

date > /var/vcap/store/postgresqlXX/bootstrap_first.lock

Once again, the owner of this file must be the system user, and it should be able to read and write this file.

PostgreSQL User Information

The versions differ in the user that runs the main PostgreSQL processes.

  • a9s PostgreSQL 13 and below use the vcap user.
  • a9s PostgreSQL 15 and above use the postgres system user.

6. Start the Cluster

Starting Primary

Now that you have updated the upstream_node.lock file on all nodes, start the primary.

monit start all

Note that the PostgreSQL process will not start, even though monit is showing it running.

The reason is that monit will show the status for the startup script, and this script will wait until there is quorum ( (n/2) + 1 nodes have the same upstream including the current node waiting) enough to make sure this is the correct primary. This means that in a 3 noded cluster, the first started node will wait until at least the second a9s PostgreSQL Info is started, this should happen when we start the first standby node.

Starting Standbys

With the primary waiting to finish the initialization, it is time to start the standby nodes again.

First, execute the pre-start script. This step will make sure that everything that needs to be created is created and that data is cloned:

/var/vcap/jobs/postgresql-ha/bin/pre-start

The pre-start script currently runs the a9s PostgreSQL Info webservice on that node, so after executing the pre-start script, the primary can be already running but blocked. The node will remain blocked until enough a9s PostgreSQL Info processes agree that the primary is valid.

In any case, we are ready to start the first standby node:

monit start all

After this point, the cluster will already be accessible and you can execute the same steps on the next standby nodes.

7. Checking Cluster Health

After executing these steps, make sure everything is working as it should.

  • Check the debug/status.sh script: You can refer to the Cluster Status document, and compare the values. All nodes should be running, unblocked and following the same primary.

  • Check replication stream: The cluster is correctly replicating data when all standby nodes are streaming data from the primary. Check with the following query on the primary node:

SELECT * FROM pg_stat_replication;

The output should show all standby nodes streaming from the primary. E.g.:

SELECT * FROM pg_stat_replication;
  pid  | usesysid |  usename  | application_name | client_addr  | client_hostname | client_port |        backend_start          | backend_xmin |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+-----------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+--------------+------------
31003 | 16504 | repl_user | 10.0.0.10 | 10.0.0.10 | | 39542 | 2019-11-12 09:03:49.12856+00 | 267053161 | streaming | 21/8A06DB30 | 21/8A06DB30 | 21/8A06DB30 | 21/8A06DB30 | 0 | async
4186 | 16504 | repl_user | 10.0.1.10 | 10.0.1.10 | | 46232 | 2019-11-12 08:19:31.267585+00 | 267053162 | streaming | 21/8A06DB30 | 21/8A06DB30 | 21/8A06DB30 | 21/8A06DAC0 | 0 | async
(2 rows)