Version: Develop

Failover Monitor

This document describes how the failover monitor works and how to configure this feature.

Failover Monitor

a9s PostgreSQL clusters failover using repmgr.

Repmgr detects that a primary is not reachable anymore, elects a standby, and promotes this standby to a new primary.

In some scenarios, some versions of repmgr do not support retry on promotion, which can cause a cluster not to promote a primary and even fail to decide which node should be promoted for valid scenarios.

a9s PostgreSQL Failover Monitor

a9s PostgreSQL collocates a posgresql-info-webservice process on each node alongside PostgreSQL. The postgresql-info-webservice continuously monitors the state of a cluster. The failover monitor checks if repmgr failed to take action during a failure, it verifies if there is no primary available and that all standby nodes are not replicating. When this happens, it waits for a given timeout (default 10min). After this, it picks the valid standby with the most ahead checkpoint and promotes this node to primary.

If repmgr does not automatically start following the new primary, the failover promotion tries to make the node to follow the new primary.

This monitor executes only on cluster deployments, and it only takes action in the following scenarios:

If the majority valid standby nodes timeout without a promotion: This means that the majority of the standby nodes were following the same primary, the primary is not reachable, and therefore the standby nodes are not replicating for the given timeout, and no failover is going on, in this case the monitor executes repmgr promote.
An attempt to follow a new primary is made if there is only one primary reachable: If there is a running primary, which is not upstream of a standby node, the monitor tries to fix the upstream with repmgr follow. This can fail if this is not a valid primary for this node or if the replication slot has already been deleted (default 3h after the standby stopped replicating).

Configuring a9s PostgreSQL Failover Monitor

postgresql-info-webservice.enable_failover_monitoring the failover monitor is enabled by default, but it is possible to disable configuring this property in the service instance ops file to set the property postgresql-info-webservice.enable_failover_monitoring under the postgresql-ha job properties.
postgresql-info-webservice.promotion_monitor.timeout the time that the monitor waits to try to promote a standby after it stops replicating from the primary and no primary is available. This property is under the postgresql-ha job properties that can be configured via the service instance ops file.

Parameters Influence on the Promotion Process

Before starting to promote a new primary, repmngr will try to connect to the current primary node. When it reaches timeout (reconnect_interval*reconnect_attempts), standby nodes start electing a new primary.

repmgr uses postgresql-ha.cluster.status.reconnect_interval and tries to reconnect every 5 seconds to the latest primary before starting the promotion process.
repmgr uses postgresql-ha.cluster.status.reconnect_attempts and makes 6 attempts to reconnect to the latest primary before starting the promotion process.