Skip to main content
Version: 55.0.0

Switchover

This document describes how the switchover operation works and how to configure this feature.

Description

a9s PostgreSQL clusters do the switchover operation to reduce the downtime on cluster instance updates. When the a9s PostgreSQL cluster instance is being updated, and while the current primary node is shutting down, a9s PostgreSQL automation uses repmgr to promote a standby node to primary, and the existing primary node is stopped. It assures that the update will happen with minimum possible downtime, as a new standby node is promoted to primary immediately instead of waiting for repmgr to failover once the primary is stopped.

The a9s PostgreSQL switchover is only executed when the cluster is healthy and while the update is being executed in the current primary node. In case of failure to run the switchover operation, it causes the switchover process to be skipped. As a consequence, it might cause a bigger downtime.

This feature is supported only for a9s PostgreSQL 13 or greater.

Switchover's Requirements

  • A healthy cluster has all its nodes running and following the same primary.
  • No node should be blocked, and all processes are running. This information can be monitored through the cluster's Status Script.
  • By default, the replication lag between the primary node and the standby node must not be bigger than 1MB.

Disable The Feature

The a9s PostgreSQL Switchover feature is enabled by default on all a9s PostgreSQL cluster service instances. However, at the Platform Operator's discretion, it can be disabled by setting the postgresql-info-webservice.switchover.enable property to false in individual cluster service instances.

If you want to disable this feature for all service instances you can add the postgresql-info-webservice.switchover.enable property to your a9s PostgreSQL Data Service manifest as a Template Custom Ops Files (Inline), as shown below.

# postgresql-service.yml
...
- name: templates-uploader
jobs:
- name: template-uploader
...
properties:
template-uploader:
template-custom-ops: |
- type: replace
path: /instance_groups/name=pg/jobs/name=postgresql-ha/properties/postgresql-info-webservice/switchover?/enable?
value: false
...

Known Issues

  • A minimum downtime will happen. All the connections established to the database will be broken when the switchover process starts, to be re-established when the new primary appears.

  • The cluster update might take more time depending on the amount of WAL files to be archived. When WAL file archiving is set up (e.g. on continuous archiving instances), if there is a backlog of files waiting to be archived, PostgreSQL will not finally shut down until all of them have been archived. Despite this, the drain script is able to handle the issue and wait the necessary amount of time for PostgreSQL to make a clean shutdown, ensuring that the cluster will be in a healthy state after the process.