Failover Monitor
This document describes how the failover monitor works and how to configure this feature.
Failover Monitor
a9s PostgreSQL clusters failover using repmgr.
Repmgr detects that a primary is not reachable anymore, elects a standby, and promotes this standby to a new primary.
In some scenarios, some versions of repmgr do not support retry on promotion, which can cause a cluster not to promote a primary and even fail to decide which node should be promoted for valid scenarios.
a9s PostgreSQL Failover Monitor
a9s PostgreSQL collocates a posgresql-info-webservice process on each node alongside PostgreSQL.
The postgresql-info-webservice continuously monitors the state of a cluster. The failover monitor
checks if repmgr failed to take action during a failure, it verifies if there is no primary
available and that all standby nodes are not replicating. When this happens, it waits for a given
timeout (default 10min). After this, it picks the valid standby with the most ahead checkpoint and
promotes this node to primary.
If repmgr does not automatically start following the new primary, the failover promotion tries
to make the node to follow the new primary.
This monitor executes only on cluster deployments, and it only takes action in the following scenarios:
-
If the majority valid standby nodes timeout without a promotion: This means that the majority of the standby nodes were following the same primary, the primary is not reachable, and therefore the standby nodes are not replicating for the given timeout, and no failover is going on, in this case the monitor executes
repmgr promote. -
An attempt to follow a new primary is made if there is only one primary reachable: If there is a running primary, which is not upstream of a standby node, the monitor tries to fix the upstream with
repmgr follow. This can fail if this is not a valid primary for this node or if the replication slot has already been deleted (default 3h after the standby stopped replicating).
Configuring a9s PostgreSQL Failover Monitor
-
postgresql-info-webservice.enable_failover_monitoringthe failover monitor is enabled by default, but it is possible to disable configuring this property in the Service Instance ops file to set the propertypostgresql-info-webservice.enable_failover_monitoringunder thepostgresql-hajob properties. -
postgresql-info-webservice.promotion_monitor.timeoutthe time that the monitor waits to try to promote a standby after it stops replicating from the primary and no primary is available. This property is under thepostgresql-hajob properties that can be configured via the Service Instance ops file.
Parameters Influence on the Promotion Process
Before starting to promote a new primary, repmngr will try to connect to the current primary node.
When it reaches timeout (reconnect_interval*reconnect_attempts), standby nodes start electing
a new primary.
-
repmgrusespostgresql-ha.cluster.status.reconnect_intervaland tries to reconnect every 5 seconds to the latest primary before starting the promotion process. -
repmgrusespostgresql-ha.cluster.status.reconnect_attemptsand makes 6 attempts to reconnect to the latest primary before starting the promotion process.