Skip to main content
Version: Develop

Split Brain

This document describes the potential causes that can lead an a9s PosgreSQL cluster into a split brain scenario and the steps on to fix it.

A split brain scenario is a rare case, where the a9s PostgreSQL cluster splits into two individual ones. However, it is important to be aware of such situations, as it impacts not only the High Availability of your instance, but it also can lead to conflicting decisions and an overall state of inconsistency.

Possible Causes

It is possible for an a9s PostgreSQL cluster to be split when there is a network partition or partial outage and the primary node can't be reached by the secondary nodes. As a result, one of the standby nodes will be promoted as a second primary even though a primary node already exists and is operational.

This can also be caused by missing disk space on the master node or other IaaS issues that make the primary node unresponsive and lead to cluster separation. a9s PostgreSQL includes a mechanism that protects the cluster when a split happens, so that the primary in the split partition is blocked to avoid receiving new writes and prevent further damage.

Fix The Split Brain

When a cluster split happens, manual intervention is necessary to fix the issue. The new primary node that was created must be restarted in order to allow it to join the cluster again.

Therefore, the new primary node and the original node must be identified in the first place. To do that follow the steps described in the Identifying Current Valid Primary of the Cluster section.

Afterwards, the original primary node must be inspected if it is operable or it has for example problems with the disk space. If the original node is not running properly, the issue must be fixed before doing further steps.

When all nodes have been identified and the functionality of the original primary node was ensured, the second primary node can be restarted to let the nodes reconnect to the original primary node:

$ bosh -d <deployment> restart pg/<index>

As a final step, the state of the cluster must be inspected according to the Identifying Current Valid Primary of the Cluster section to ensure the cluster has been fixed.

Cluster Recovery

caution

Make sure to leave the correct primary node untouched throughout the whole process and only apply the described steps to the second primary node and the secondary nodes.

If the split cluster can't be fixed by restarting the second primary node, a cluster recovery can restore the cluster. The steps to execute a cluster recovery are described in the Cluster recovery page.