Failover Monitor
This document describes how the failover monitor works and how to configure this feature.
Failover Monitor
a9s PostgreSQL clusters failover using repmgr.
Repmgr detects that a primary is not reachable anymore, elects a standby, and promotes this standby to a new primary.
In some scenarios, some versions of repmgr do not support retry on promotion, which can cause a cluster not to promote a primary and even fail to decide which node should be promoted for valid scenarios.
a9s PostgreSQL Failover Monitor
a9s PostgreSQL collocates a posgresql-info-webservice
process on each node alongside PostgreSQL.
The postgresql-info-webservice
continuously monitors the state of a cluster. The failover monitor
checks if repmgr
failed to take action during a failure, it verifies if there is no primary
available and that all standby nodes are not replicating. When this happens, it waits for a given
timeout (default 10min
). After this, it picks the valid standby with the most ahead checkpoint and
promotes this node to primary.
If repmgr
does not automatically start following the new primary, the failover promotion tries
to make the node to follow the new primary.
This monitor executes only on cluster deployments, and it only takes action in the following scenarios:
If the majority valid standby nodes timeout without a promotion: This means that the majority of the standby nodes were following the same primary, the primary is not reachable, and therefore the standby nodes are not replicating for the given timeout, and no failover is going on, in this case the monitor executes
repmgr promote
.An attempt to follow a new primary is made if there is only one primary reachable: If there is a running primary, which is not upstream of a standby node, the monitor tries to fix the upstream with
repmgr follow
. This can fail if this is not a valid primary for this node or if the replication slot has already been deleted (default 3h after the standby stopped replicating).
Configuring a9s PostgreSQL Failover Monitor
postgresql-info-webservice.enable_failover_monitoring
the failover monitor is enabled by default, but it is possible to disable configuring this property in the service instance ops file to set the propertypostgresql-info-webservice.enable_failover_monitoring
under thepostgresql-ha
job properties.postgresql-info-webservice.promotion_monitor.timeout
the time that the monitor waits to try to promote a standby after it stops replicating from the primary and no primary is available. This property is under thepostgresql-ha
job properties that can be configured via the service instance ops file.
Parameters Influence on the Promotion Process
Before starting to promote a new primary, repmngr
will try to connect to the current primary node.
When it reaches timeout (reconnect_interval
*reconnect_attempts
), standby nodes start electing
a new primary.
repmgr
usespostgresql-ha.cluster.status.reconnect_interval
and tries to reconnect every 5 seconds to the latest primary before starting the promotion process.repmgr
usespostgresql-ha.cluster.status.reconnect_attempts
and makes 6 attempts to reconnect to the latest primary before starting the promotion process.