a9s PostgreSQL Cluster Recovery

When finding a cluster in a broken state where manual intervention is necessary, we need to proper identify the cluster state and them bring it back in the correct order. This document describes how to bring an a9s PostgreSQL cluster back after resetting the cluster state.

This document is intended to describe a default set of steps that can always reset and restore an a9s PostgreSQL cluster.

a9s PostgreSQL Startup

In order to go ahead, we need a small vision of how the a9s PostgreSQL cluster starts. First it needs a running primary, so even when standby nodes are starting, if a primary is not available, they will wait until the valid primary starts.

After the primary is running, any standby node starting will clone (always during pre-start and only if necessary when running monit start) from the primary. With data up-to-date, the PostgreSQL on the standby node is started.

Cluster Recovery

0. Identify the Cluster Primary

Before going ahead make, sure you know the correct primary of the cluster following the procedures described on Cluster Status.

And always backup the a9s PostgreSQL primary before executing any destructive steps.

Warning: the primary alias is a good tip of which node is the primary, but it might not be taken as certain alone. Check back the data described in the procedure mentioned before.

1. Stop the Cluster

After you have identified the correct primary of the cluster following the previous steps, it is time to stop the nodes to avoid unwanted promotion of a new primary.

We first stop the standby nodes, we will describe it using monit stop, so you need first to ssh inside the instance:

$ bosh -d <deployment> ssh pg/<index>
$ sudo -i
# monit stop all

Check that the node has stopped with monit summary. Stop the primary using the same commands.

2. Backup the Primary

With the whole cluster stopped, you need to make a backup of the current data directory of the primary. The directory storing this data is located on /var/vcap/store/postgresqlXX.

You can backup can be done to a local directory under /var/vcap/store. If the node does not have space available, backup to your local machine using bosh scp.

# cp -rp /var/vcap/store/postgresqlXX /var/vcap/store/postgresqlXX.bkp

3. Cleanup Cluster State

After you backup the primary data, it is time to clean up the nodes.

Some information about the cluster state is stored under the locks directory, usually data that needs to be available before the primary or the local PostgreSQL is up and running.

Cleanup Lock Files

The files are:

  • previous_address: File holding the address of the local machine on latest startup. This is used to detect network changes in order to execute necessary steps when necessary.
  • upstream_node.lock: File holding the id (spec.id) of the current upstream of this node. This is used to decide who is the primary based on election and quorum. Note that we do not choose which node to promote, this file only reflects the repmgr state.
  • master_role.lock: Signals which node is the current primary of the majority in this cluster. This file is used to decide if the node is a valid primary, and it is handled by a9s PostgreSQL Info.

On all nodes, make sure you remove all files under /var/vcap/store/postgresqlXX-locks/.

Cleanup Startup Lock Directory

A file may be left on the startup lock directory containing a PID that has been recycled by the operating system. In this case, when trying to restart the postgresql-ha process, it can fail due to a startup process already running when actually another process is reusing the PID.

To avoid this issue, after completely stopping the postgresql-ha process, check if there is any related process running with ps aux. If no related process is running, remove the content of the startup locks directory:

rm /tmp/postgresql-startup-locks/*

Cleanup standby Data Directories

Since you are sure which node is the primary, you can remove (or move to another location) the PostgreSQL data directory on the standby.

Before going ahead, make sure you are executing this command in the right nodes (standby nodes) and that data of the primary exists.

# rm -rf /var/vcap/store/postgresqlXX/*

Note that usually, the pre-start script executes this steps. But, in order to force cloning again from the primary, we can remove the data (on standby nodes only) and it will be cloned again during pre-start or monit start.

pre-start always cleans up the data directory of a standby node and clones if there is a primary available at the moment. Cloning during monit start will only happen if the data directory does not contain the bootstrap_first.lock or if this is primary returning as standby and does not have the recovery.conf file on its data directory.

4. Recreate upstream_node.lock File

Recreate the upstream_node.lock file on all nodes as vcap user to avoid permission issues, in any case the vcap user should be able to write and read the files, since this is the user that runs the startup scripts and the processes itself.

You can get the node-id with bosh instances, where Instance is pg/<node-id> or if you are inside the instance, your prompt will show pg/<node-id>.

$ echo "<node-id>" > /var/vcap/store/postgresqlXX-locks/upstream_node.lock

5. Recreate bootstrap_first.lock File

The /var/vcap/store/postgresqlXX/bootstrap_first.lock lock file signals that the cluster has already been initialized, and therefore initdb should never be executed again. In order to make sure our primary will not loose data when starting, we need to make sure this file exists:

If the file /var/vcap/store/postgresqlXX/bootstrap_first.lock does not exist, create it with the following command:

$ date > /var/vcap/store/postgresqlXX/bootstrap_first.lock

Once again, the owner of this file must be the vcap user, and it should be able to read and write this file.

6. Fix Key-Value Store (a9s PostgreSQL 9.4 only)

Check the Key-Value Store

The a9s PostgreSQL 9.4 uses a different code base, this version stores part of the information about which node is the primary on Consul. So, check if this value is pointing to the correct primary of the cluster.

First, find out the address of a valid Consul cluster node. You can check that with bosh -d consul-dns vms, pick address on the IPs column with a valid consul/<id> running node.

Query for the key against any of the running Consul instances in your consul-dns deployment:

$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ curl "http://<consul-node-ip>:8500/v1/kv/${SERVICE_NAME}-psql-master-hostname?raw"; echo

This will give something like pg/0, which is the primary node (<instance-group>/<index>).

When the answer of the request is empty or failover_in_progress, then manual fixing is necessary.

If a proper value is given, compare this information with the debug/status.sh script output, and if the value is pointing to a node index different from the correct primary, then update this value on Consul.

Update Key-Value Store

After identifying the proper primary of the cluster following the steps described on Identify the Cluster Primary section, identify the index of the correct primary and update the key with proper pg/<index> value.

Note that if the value is stuck in failover_in_progress, the cluster will wait forever for the promotion to happen and the upgrade will fail.

The value stored follows the format <instance-group>/<master-node-index>, for the current version, the instance group is always pg.

To update the value, execute against any Consul node:

$ curl -X PUT -d "pg/<master-node-index>" "http://<consul-node-ip>:8500/v1/kv/<deployment-name>-psql-master-hostname"

7. Start the Cluster

Starting Primary

Now that you have updated the upstream_node.lock file on all nodes, start the primary.

# monit start all

Note that the PostgreSQL process will not start, even though monit is showing it running.

The reason is that monit will show the status for the startup script, and this script will wait until there is quorum ( (n/2) + 1 nodes have the same upstream including the current node waiting) enough to make sure this is the correct primary. This means that in a 3 noded cluster, the first started node will wait until at least the second a9s PostgreSQL Info is started, this should happen when we start the first standby node.

Starting Standbys

With the primary waiting to finish the initialization, it is time to start the standby nodes again.

First, execute the pre-start script. This step will make sure that everything that needs to be created is created and that data is cloned:

# /var/vcap/jobs/postgresql-ha/bin/pre-start

The pre-start script currently runs the a9s PostgreSQL Info webservice on that node, so after executing the pre-start script, the primary can be already running but blocked. The node will remain blocked until enough a9s PostgreSQL Info processes agree that the primary is valid.

In any case, we are ready to start the first standby node:

# monit start all

After this point, the cluster will already be accessible and you can execute the same steps on the next standby nodes.

8. Checking Cluster Health

After executing these steps, make sure everything is working as it should.

  • Check the debug/status.sh script: You can refer to the Cluster Status document, and compare the values. All nodes should be running, unblocked and following the same primary.

  • Check replication stream: The cluster is correctly replicating data when all standby nodes are streaming data from the primary. Check with the following query on the primary node:

# SELECT * FROM pg_stat_replication;

The output should show all standby nodes streaming from the primary. E.g.:

postgres=# SELECT * FROM pg_stat_replication;
  pid  | usesysid |  usename  | application_name | client_addr  | client_hostname | client_port |        backend_start          | backend_xmin |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+-----------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+--------------+------------
 31003 |    16504 | repl_user |    10.0.0.10     |  10.0.0.10   |                 |       39542 | 2019-11-12 09:03:49.12856+00  |    267053161 | streaming | 21/8A06DB30   | 21/8A06DB30    | 21/8A06DB30    | 21/8A06DB30     |             0 | async
  4186 |    16504 | repl_user |    10.0.1.10     |  10.0.1.10   |                 |       46232 | 2019-11-12 08:19:31.267585+00 |    267053162 | streaming | 21/8A06DB30   | 21/8A06DB30    | 21/8A06DB30    | 21/8A06DAC0     |             0 | async
(2 rows)