a9s PostgreSQL Cluster Recovery
When finding a cluster in a broken state where manual intervention is necessary, we need to proper identify the cluster state and then bring it back in the correct order. This document describes how to bring an a9s PostgreSQL cluster back after resetting the cluster state.
This document is intended to describe a default set of steps that can always reset and restore an a9s PostgreSQL cluster.
a9s PostgreSQL Startup
In order to go ahead, we need a small vision of how the a9s PostgreSQL cluster starts. First it needs a running primary, so even when standby nodes are starting, if a primary is not available, they will wait until the valid primary starts.
After the primary is running, any standby node starting will clone (always during
pre-start
and only if necessary when running monit start
) from the primary.
With data up-to-date, the PostgreSQL on the standby node is started.
Cluster Recovery
0. Identify the Cluster Primary
Before going ahead, make sure you know the correct primary of the cluster following the procedures described on Cluster Status.
And always backup the a9s PostgreSQL primary before executing any destructive steps.
To prevent data loss, it is essential to select the correct primary node before executing any destructive step. Choosing an incorrect node could result in data loss due to outdated information. We strongly recommend proceeding with utmost care during this process.
1. Stop the Cluster
After you have identified the correct primary of the cluster following the previous steps, it is time to stop the nodes in an order that we avoid unwanted promotion of a new primary.
Stop the Standby Nodes
We first stop the standby nodes, we will describe it using monit stop
, so you need first
to ssh
inside the instance:
$ bosh -d <deployment> ssh pg/<index>
$ sudo -i
$ monit stop all
Check that the node has stopped with monit summary
.
Cleanup repmgr
Before stopping the primary, cleanup the repmgr
database.
Access the repmgr database on the primary node.
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ chpst -u postgres:vcap psql repmgr_cluster
a9s PostgreSQL 13 and below
As vcap
user:
$ su - vcap
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ psql repmgr_cluster
Drop all entries in the status table:
DELETE FROM repmgr.nodes;
Stop the Primary Node
Stop the primary using the same commands described on Stop the Standby Nodes
2. Backup the Primary
With the whole cluster stopped, you need to make a backup of the current data directory
of the primary. The directory storing this data is located on /var/vcap/store/postgresqlXX
.
You can backup can be done to a local directory under /var/vcap/store
.
If the node does not have space available, backup to your local machine using bosh scp
.
cp -rp /var/vcap/store/postgresqlXX /var/vcap/store/postgresqlXX.bkp
3. Cleanup Cluster State
After you backup the primary data, it is time to clean up the nodes.
Some information about the cluster state is stored under the locks directory, usually data that needs to be available before the primary or the local PostgreSQL is up and running.
Cleanup Lock Files
The files are:
previous_address
: File holding the address of the local machine on latest startup. This is used to detect network changes in order to execute necessary steps when necessary.upstream_node.lock
: File holding the id (spec.id
) of the current upstream of this node. This is used to decide who is the primary based on election and quorum. Note that we do not choose which node to promote, this file only reflects therepmgr
state.master_role.lock
: Signals which node is the current primary of the majority in this cluster. This file is used to decide if the node is a valid primary, and it is handled by a9s PostgreSQL Info.
On all nodes, make sure you remove all files under /var/vcap/store/postgresqlXX-locks/
.
Cleanup Startup Lock Directory
A file may be left on the startup lock directory containing a PID that has been recycled
by the operating system. In this case, when trying to restart the postgresql-ha
process,
it can fail due to a startup process already running when actually another process is reusing the PID.
To avoid this issue, after completely stopping the postgresql-ha
process, check if there is
any related process running with ps aux
. If no related process is running, remove the content
of the startup locks directory:
rm /tmp/postgresql-startup-locks/*
Cleanup standby Data Directories
Since you are sure which node is the primary, you can remove (or move to another location) the PostgreSQL data directory on the standby.
Before going ahead, make sure you are executing this command in the right nodes (standby nodes) and that data of the primary exists.
rm -rf /var/vcap/store/postgresqlXX/*
Note that usually, the pre-start script executes this steps. But, in order to force cloning again
from the primary, we can remove the data (on standby nodes only) and it will be cloned again during
pre-start
or monit start
.
pre-start
always cleans up the data directory of a standby node and clones if there is a primary
available at the moment. Cloning during monit start
will only happen if the data directory does
not contain the bootstrap_first.lock
or if this is primary returning as standby and does not have
the recovery.conf
file on its data directory.
4. Recreate upstream_node.lock
File
Recreate the upstream_node.lock
file with the node-id
of the master
-node of your PostgreSQL
cluster on all nodes as the user running PostgreSQL as the owner to avoid permission issues.
In any case the system user should be able to write and read the files, since this is the user that
runs the startup scripts and the processes itself.
The versions differ in the user that runs the main PostgreSQL processes.
- a9s PostgreSQL 13 and below use the
vcap
user. - a9s PostgreSQL 15 and above use the
postgres
system user.
You can get the node-id
of the master
-node by checking what IP address is assigned to the
master-alias
of the PostgreSQL cluster and comparing it to the assigned IP addresses of the
VMs.
Example:
We have a running cluster with the following VMs:
Deployment 'my-psql-cluster'
Instance Process State AZ IPs
pg/8eb9ef9f-c436-4eda-8dd4-6c61ab33f040 running z2 172.28.39.52 [...]
pg/8fcb86bd-1643-4176-900d-a070aa4c8ace running z1 172.28.38.65 [...]
pg/eb433d91-f3d2-4344-89cc-e7a3f9db4e5b running z3 172.28.40.83 [...]
We can then use dig
to resolve the IP address of the master
-node inside the cluster.
Please keep in mind to replace these placeholders in the following commands:
[CONSUL-DOMAIN]
with the Consul domain. The value can be found in your IaaS config underiaas.consul.domain
.my-psql-cluster
with the actual BOSH deployment name of your a9s PostgreSQL cluster instance.
export MASTER_IP=$(dig +short my-psql-cluster-psql-master-alias.node.dc1.[CONSUL-DOMAIN])
## MASTER_IP=172.28.38.65
The output of this command then reflects the node-id
of the master
-node of your PostgreSQL cluster:
$ bosh -d my-psql-cluster vms | grep $MASTER_IP | awk '{print $1}' | cut -d"/" -f2
8fcb86bd-1643-4176-900d-a070aa4c8ace
After you have the node-id
, you can start to create the upstream_node.lock
file on each
VM in the cluster (do not forget to replace XX
with the actual PostgreSQL version in use):
echo "<node-id>" > /var/vcap/store/postgresqlXX-locks/upstream_node.lock
The created file should only contain the node-id
, for example:
$ cat /var/vcap/store/postgresql13-locks/upstream_node.lock
8fcb86bd-1643-4176-900d-a070aa4c8ace
5. Recreate bootstrap_first.lock
File
The /var/vcap/store/postgresqlXX/bootstrap_first.lock
lock file signals that the cluster has
already been initialized, and therefore initdb
should never be executed again. In order to
make sure our primary will not loose data when starting, we need to make sure this file exists:
If the file /var/vcap/store/postgresqlXX/bootstrap_first.lock
does not exist, create it with
the following command:
date > /var/vcap/store/postgresqlXX/bootstrap_first.lock
Once again, the owner of this file must be the system user, and it should be able to read and write this file.
The versions differ in the user that runs the main PostgreSQL processes.
- a9s PostgreSQL 13 and below use the
vcap
user. - a9s PostgreSQL 15 and above use the
postgres
system user.
6. Start the Cluster
Starting Primary
Now that you have updated the upstream_node.lock
file on all nodes, start the primary.
monit start all
Note that the PostgreSQL process will not start, even though monit is showing it running.
The reason is that monit will show the status for the startup script, and this script will wait
until there is quorum ( (n/2) + 1
nodes have the same upstream including the current node waiting)
enough to make sure this is the correct primary. This means that in a 3 noded cluster, the first
started node will wait until at least the second a9s PostgreSQL Info is started, this should happen
when we start the first standby node.
Starting Standbys
With the primary waiting to finish the initialization, it is time to start the standby nodes again.
First, execute the pre-start
script. This step will make sure that everything that needs to
be created is created and that data is cloned:
/var/vcap/jobs/postgresql-ha/bin/pre-start
The pre-start
script currently runs the a9s PostgreSQL Info webservice on that node, so after
executing the pre-start
script, the primary can be already running but blocked. The node will
remain blocked until enough a9s PostgreSQL Info processes agree that the primary is valid.
In any case, we are ready to start the first standby node:
monit start all
After this point, the cluster will already be accessible and you can execute the same steps on the next standby nodes.
7. Checking Cluster Health
After executing these steps, make sure everything is working as it should.
Check the
debug/status.sh
script: You can refer to the Cluster Status document, and compare the values. All nodes should be running, unblocked and following the same primary.Check replication stream: The cluster is correctly replicating data when all standby nodes are streaming data from the primary. Check with the following query on the primary node:
SELECT * FROM pg_stat_replication;
The output should show all standby nodes streaming from the primary. E.g.:
SELECT * FROM pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+-----------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+--------------+------------
31003 | 16504 | repl_user | 10.0.0.10 | 10.0.0.10 | | 39542 | 2019-11-12 09:03:49.12856+00 | 267053161 | streaming | 21/8A06DB30 | 21/8A06DB30 | 21/8A06DB30 | 21/8A06DB30 | 0 | async
4186 | 16504 | repl_user | 10.0.1.10 | 10.0.1.10 | | 46232 | 2019-11-12 08:19:31.267585+00 | 267053162 | streaming | 21/8A06DB30 | 21/8A06DB30 | 21/8A06DB30 | 21/8A06DAC0 | 0 | async
(2 rows)