Skip to main content
Version: Develop

Replication Lag

This document describes how a9s PostgreSQL deals with replication lag and how to fix the cluster in the case of a replication issue.

Max Replication Lag

On the current repmgr version used to monitor the cluster, the failover algorithm does not consider replication lag, as result, a node with replication broken can be promoted to primary.

In this scenario, if the primary goes down for any reason, an outdated standby node can be promoted to primary. The old primary rejoining the cluster will overwrite the correct data with the outdated version during cloning, causing data loss.

To avoid this scenario, it was introduced the postgresql-ha.cluster.replication.max_replication_lag, where if a standby node has a lag bigger than the configured value in seconds (Default 1800), the promotion will fail.

This configuration is a way to limit the size of the data loss in case of a catastrophe. It defines the limit of 30 min (default) since the last time this standby node replicated data from the primary, where a standby can still be promoted.

You must monitor all a9s PostgreSQL clusters to detect replication failures and act as fast as possible to fix it. You can know more about this in the Metrics documentation. a9s Prometheus can be used to trigger alerts and visualize the metrics.

How to Check Replication

Replication is configured using replication slots. You can check the pg_stat_replication and the pg_replication_slots views on the primary database.

# su - vcap
$ source /var/vcap/jobs/postgresql-ha/helpers/
$ psql postgres
# SELECT * FROM pg_stat_replication;
# SELECT * FROM pg_replication_slots;

It is supposed to have the same number of slots as the number of standby nodes.

If location values are increasing in between queries, then the standby nodes are replicating.

How to Fix a Node not Replicating

Replication slots are recreated during the clone phase. To trigger a new clone, the pre-start script must be executed.

First, stop the PostgreSQL process:

$ bosh -d <deployment> stop pg/<id-of-affected-node>

Cleanup the standby node as described on Cluster Recovery.

And start node again:

$ bosh -d <deployment> start pg/<id-of-affected-node>

During node startup, clone will take place and replication will be rebuild.

How to Fix a Cluster which Promotion Has Been Prevented

If the replication lag was too big and as result, the promotion has been prevented, the cluster will be in a broken state, requiring manual intervention.

The two scenarios described below are the most likely to happen.

No standby Node Has Been Promoted, but Primary is Down.

Stop all nodes in the cluster wither with bosh stop:

$ bosh -d <deployment> stop pg/<id>

Start the primary:

$ bosh -d <deployment> stop pg/<id>

Cleanup the standby nodes as described on Cluster Recovery.

And start all nodes again, starting with the primary.

$ bosh -d <deployment> start pg/<id>

One of the standby Nodes Has Been Promoted

In this scenario, you might have a primary, a standby, and either another primary or a node restarting. In this case, follow the procedure described in Cluster Recovery to fix the cluster.