Switchover
This document describes how the switchover operation works and how to configure this feature.
Description
a9s PostgreSQL clusters do the switchover operation to reduce the downtime on cluster instance updates.
When the a9s PostgreSQL cluster instance is being updated, and while the current primary node is shutting down,
a9s PostgreSQL automation uses repmgr
to promote a standby node to primary, and the existing primary node is stopped.
It assures that the update will happen with minimum possible downtime, as a new standby node is promoted to primary
immediately instead of waiting for repmgr
to failover once the primary is stopped.
The a9s PostgreSQL switchover is only executed when the cluster is healthy and while the update is being executed in the current primary node. In case of failure to run the switchover operation, it causes the switchover process to be skipped. As a consequence, it might cause a bigger downtime.
This feature is supported only for a9s PostgreSQL 13 or greater.
Switchover's Requirements
- A healthy cluster has all its nodes running and following the same primary.
- No node should be blocked, and all processes are running. This information can be monitored through the cluster's Status Script.
- By default, the replication lag between the primary node and the standby node must not be bigger than
1MB
.
Disable The Feature
The a9s PostgreSQL Switchover feature is enabled by default on all a9s PostgreSQL cluster service instances. However, at
the Platform Operator's discretion, it can be disabled by setting the postgresql-info-webservice.switchover.enable
property to
false
in individual cluster service instances.
If you want to disable this feature for all service instances you can add the
postgresql-info-webservice.switchover.enable
property to your a9s PostgreSQL Data Service manifest as a
Template Custom Ops Files (Inline),
as shown below.
# postgresql-service.yml
...
- name: templates-uploader
jobs:
- name: template-uploader
...
properties:
template-uploader:
template-custom-ops: |
- type: replace
path: /instance_groups/name=pg/jobs/name=postgresql-ha/properties/postgresql-info-webservice/switchover?/enable?
value: false
...
Known Issues
A minimum downtime will happen. All the connections established to the database will be broken when the switchover process starts, to be re-established when the new primary appears.
The cluster update might take more time depending on the amount of WAL files to be archived. When WAL file archiving is set up (e.g. on continuous archiving instances), if there is a backlog of files waiting to be archived, PostgreSQL will not finally shut down until all of them have been archived. Despite this, the drain script is able to handle the issue and wait the necessary amount of time for PostgreSQL to make a clean shutdown, ensuring that the cluster will be in a healthy state after the process.