Cluster Status

Since anynines-deployment v4.0.0, a9s PostgreSQL includes a debug script. This script should help operators to properly identify the cluster state and the events currently happening.

The purpose of this document is to explain how to use and the output of this script, as well as how to properly identify the valid primary of the cluster.

Status Script

a9s PostgreSQL collocates the a9s PostgreSQL Info process, which listens on port 63145. The main purpose of this service is to provide read-only information about the node state through an HTTP server.

There is a script under /var/vcap/jobs/postgresql-ha/bin/debug/status.sh that iterates over all PostgreSQL nodes and requests the current state of the node, by executing a GET against the /v1/status endpoint. This way, executing this script from any node will give you the state of the cluster as seen from that node (this is important especially if you need to debug a network split).

You can execute the script without arguments:

$ /var/vcap/jobs/postgresql-ha/bin/debug/status.sh

Where the output contains information about the state of all nodes. You can check what each value represents in the table below.

General Request Information

PropertyTypeDescription
messagestringGeneral information identifying the request.
statusstringStatus of this request, it can be: [success].
datajsonPayload with node-specific status information.

Node specific information (data)

PropertyTypeDescription
node_indexintegerNode index (spec.index)
node_ipstringNode IP address (spec.ip)
idstringNode ID (spec.id)
bootstrapbooleanIf this is the bootstrap node (spec.bootstrap)
firstbootbooleanIf this is the first boot of this node (reads
from first boot lock)
statusstringResult of pg_isready against the local node.
Possible values are: [running, rejecting_connections, no_response, no_attempt_made, unknown_error]
modestringThe mode this node is running extracted from pg_controldata.
Possible values are: [master, standby, shutdown, shut_down_in_recovery, shutting_down, in_crash_recovery, unknown]
master_in_majority_partitionbooleanIf this is a valid primary for the majority of the cluster.
upstream_node_idstringUpstream node id for this node. Which node this node is replicating from
failover_in_progressbooleanIf this node is going through a failover process, whether being promoted to primary or starting to follow a recently promoted primary.
upstream_node_idstringUpstream node id for this node. Which node this node is replicating from
latest_checkpoint*stringLatest checkpoint extracted from pg_controldata. This property is not used to identify the master of the cluster anymore.
upstream_node_idstringUpstream node id for this node. Which node this node is replicating from.
blockedbooleanIf this node was blocked by the monitor. The node is blocked when it is a primary not followed by the majority of the cluster, in this case, data.master_in_majority_partition is false, for example. This property is used to inform if the node can receive new connections.
processesjsonThis section identifies the status of the main processes. This check is done using the ps. And they are used during the post-start to decide if it should wait for a node to finish cloning.
Possible values for these properties are: [running, starting, cloning, stopped, unknown, error]
pg_configjsonPostgreSQL configuration section.

Processes specific information (processes)

PropertyTypeDescription
postgresqlstringStatus of the PostgreSQL process running on the node.
repmgrdstringStatus of the repmgrd process running on the node.

PostgreSQL configuration section (pg_config)

PropertyTypeDescription
max_connectionsintegermax_connections value extracted from the PostgreSQL config file. Information is used to check if a max_connections update is happening. We currently support updates only to a greater value than the current one

* latest_checkpoint is deprecated.

Examining Status Output

This is the output of the script execution on a three-node healthy cluster:

{
  "message": "Node status",
  "status": "succeeded",
  "data": {
    "id": "0c40bd50-1743-419b-aa87-ce400e3c7e35",
    "node_index": 0,
    "node_ip": "10.244.28.56",
    "bootstrap": true,
    "firstboot": false,
    "status": "running",
    "blocked": false,
    "latest_checkpoint": "11/C58CB018",
    "mode": "master",
    "master_in_majority_partition": true,
    "upstream_node_id": "0c40bd50-1743-419b-aa87-ce400e3c7e35",
    "failover_in_progress": false,
    "processes": {
      "postgresql": "running",
      "repmgrd": "running"
    },
    "pg_config": {
      "max_connections": "1024"
    }
  }
}
{
  "message": "Node status",
  "status": "succeeded",
  "data": {
    "id": "13f921dd-32d2-46ef-8091-b3372a463158",
    "node_index": 1,
    "node_ip": "10.244.29.52",
    "bootstrap": false,
    "firstboot": false,
    "status": "running",
    "blocked": false,
    "latest_checkpoint": "11/C56E5878",
    "mode": "standby",
    "master_in_majority_partition": false,
    "upstream_node_id": "0c40bd50-1743-419b-aa87-ce400e3c7e35",
    "failover_in_progress": false,
    "processes": {
      "postgresql": "running",
      "repmgrd": "running"
    },
    "pg_config": {
      "max_connections": "1024"
    }
  }
}
{
  "message": "Node status",
  "status": "succeeded",
  "data": {
    "id": "70dabb0f-388c-4d58-8356-a0d9cf1fa0b2",
    "node_index": 2,
    "node_ip": "10.244.30.50",
    "bootstrap": false,
    "firstboot": false,
    "status": "running",
    "blocked": false,
    "latest_checkpoint": "11/C56E5878",
    "mode": "standby",
    "master_in_majority_partition": false,
    "upstream_node_id": "0c40bd50-1743-419b-aa87-ce400e3c7e35",
    "failover_in_progress": false,
    "processes": {
      "postgresql": "running",
      "repmgrd": "running"
    },
    "pg_config": {
      "max_connections": "1024"
    }
  }
}

There are some information we can check rightway from the output above:

  • All nodes should be in status running: It is normal that the node gets into different states while starting, but after deployment finishes, all nodes should be running. You can find a different case when bringing back a cluster where the three nodes were down, in this case, if the standby nodes start first, it will pass through the status rejecting_connections and no_response while the PostgreSQL process finishes starting. If this fails, please check the PostgreSQL logs under /var/vcap/sys/log/monit.

  • Must have 1 node in mode master, and all others on mode standby: In a healthy cluster all nodes must have 1 master and N standby nodes following the same upstream_node_id. Note that on single deployments, we only have a master node.

  • All nodes belonging to the cluster are following the same upstream: It is possible to have unknown value during first boot, but after that, all nodes should replicate from the same upstream node, the primary. A node not following the current valid upstream may be in a split-brain situation. Note that the monitor uses this information to decide if a node should be blocked or not, so a master that is not followed by the majority of the nodes, is blocked. The same way a standby node following an invalid node, is also blocked.

  • The valid primary is the one followed by the majority: This means that a valid master is followed by at least (n / 2) + 1 nodes in the cluster, this includes the primary following itself.

  • No node should be blocked: It is normal that a standby node recently added in the cluster to be blocked for a period. Currently, default is 60s. But after this period the node should be unblocked. Any other situation means that the cluster is not healthy. A blocked node is not necessarily the wrong node in the cluster, we can have a standby node failing to restart or even a standby that for any unforeseen reason fails to follow the new node. In case a node is blocked, check if all nodes in the cluster are running and following the same upstream and the a9s PostgreSQL Info log file: /var/vcap/job/sys/log/postgresql-ha/postgresql-info-webservice.err.log.

  • Primary DNS Alias must point to the correct primary: This means the alias should point to the address of the correct primary of this cluster, it should agree with the information shown by the debug/status.sh script. Since anynines-deployment v4.13.0 primary alias monitoring is enabled by default, and it should correct the alias in case it is pointing to a different node that is not the valid primary of the cluster. This means that for the alias to be updated /v1/master endpoint should return the status of the valid primary of the majority.

  • All processes must be running: Even when BOSH shows PostgreSQL and repmgrd running, this status can provide a different value. This happens because before starting the process the script is running, waiting for cloning or starting, for example, BOSH will report the status running, but this endpoint will show the process as cloning or starting. In a healthy cluster, all processes listed there should be running. If the status is different from running or cloning, check the logs for PostgreSQL and repmgrd under /var/vcap/sys/log/monit/.

Identifying Current Valid Primary of the Cluster

Use the method bellow that better fits your case and use more than one approach to decide.

0. Using a9s PostgreSQL Info (/v1/master)

a9s PostgreSQL Info includes the GET /v1/master endpoint. This endpoint provides the status of the current valid primary of the cluster. It will request to every node of the cluster for its current status, with this data an election is made and the primary (mode master) that has more nodes following, is identified as primary of the cluster.

This endpoint can also respond with No Quorum, when the node receiving the request can not reach at least (n/2) + 1 nodes of the cluster (including itself) or if no primary has at least (n/2) + 1 standby nodes replicating.

This will always provide the current valid master of the cluster, even if the primary alias is pointing to a different node.

1. Using repmgr

When you can still connect with the PostgreSQL cluster, the safest way to discover who is the primary is to connect with the repmgr_cluster database on each node.

$ bosh -d <deploymnet> ssh pg/X
$ sudo su - vcap
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ psql repmgr_cluster

Connected to the repmgr_cluster database, you can execute the following query:

On a9s PostgreSQL 11 and later versions:

repmgr_cluster=# SELECT * FROM repmgr.nodes;

On a9s PostgreSQL 10 and below versions:

repmgr_cluster=# SELECT * FROM repmgr_cluster.repl_nodes;

Then analyze the fields id (node_id), type, upstream_node_id and name (node_name), where name is the current node id (spec.id), id is spec.index + 1, type is the node role (primary or standby) and upstream_node_id is the id of the upstream node id.

Compare the results of all nodes, since a node in a split-brain partition can show out of date information.

2. Using the Status Script

When executing the /var/vcap/jobs/postgresql-ha/bin/debug/status.sh script, we can look for the mode property on every node, to find the one running as master. If the cluster is not healthy and more than one node is presenting himself as master, we can see the upstream_node_id of all nodes and choose the one who has more standby nodes replicating from it.

3. Check upstream_node.lock File

a9s PostgreSQL Info gets the upstream_node_id value from the upstream_node.lock file located at /var/vcap/store/postgresqlXX-locks/upstream_node.lock. So, when a9s PostgreSQL Info is not available, you can go through every node of the cluster and check the content of this file, then use this information as seen above. A node containing unknown means that right after executing repmgr register, repmgr promote or repmgr follow the startup script could not retrieve this information from the primary. This is a critical error.

4. Check the Key-Value Store (a9s PostgreSQL 9.4 only)

The a9s PostgreSQL 9.4 stores the primary node of the cluster on Consul. Query for the key against any of the running Consul instances in your consul-dns deployment:

$ curl http://<consul-node-ip>:8500/v1/kv/<deployment-name>-psql-master-hostname?raw; echo

Example for a9s-pg:

$ curl http://10.244.5.15:8500/v1/kv/a9s-pg-psql-master-hostname?raw; echo

Which will give you something like pg/0, which is the master node (/). When the answer of the request is empty or failover_in_progress, then a different method to identify the master node must be used (see the following sections).

Forcing to change this value manually can cause destructive actions, including data loss. Don't change unless you are completely sure who is the correct primary and if the data directory using pg_controldata, to make sure the data is for a primary instead of standby (PostgreSQL internal uses a different startup sequence for each case, therefore some action needs to be taken in order to promote a standby manually.

5. Check the Logs

You first need to identify who was the primary when this node started checking the /var/vcap/sys/log/postgresql-ha/pre-start.stdout.log file. On every node, check for the following line:

6838: Tue Jul  9 23:30:55 UTC 2019 - [debug][pre-start:254][init_standby_node] - Trying to find the master - master_id: d1df087d-bcea-406e-bdf5-17693a63e28a

Then check the /var/vcap/sys/log/postgresql-ha/repmgr_follow_promote.log, to trace a promote-follow history (keep the timestamp in mind to discover the precedence), from there you can identify who was the last promoted node. Compare the observations with /var/vcap/sys/log/postgresql-ha/repmgrd.log (/var/vcap/sys/log/postgresql-ha/repmgrd.err.log on a9s PostgreSQL 11), on the repmgrd log look for the a line like:

[2018-02-13 19:24:54] [NOTICE] STANDBY PROMOTE successful

If you can not find this file this means that the node was never promoted to primary.

If you can not find this file in any of the nodes, chances are that a failover never happened and the bootstrap node is still the master.

6. Check Control Data

PostgreSQL keeps some information regarding the state of the cluster in its files. One way to access this information is by using the pg_controldata binary shipped with this release.

If the cluster was correctly stopped (This should be the case during a failed upgrade), the information about the last role of a node can be found using this approach. Of course, during the upgrade, multiple nodes can be stopped as primary. But this will give one more piece of information to make a decision about which node is the primary.

You can get this information executing the following command against every node in the cluster:

$ bosh -d <deployment-name> ssh pg/<instance-index-or-id> "sudo /var/vcap/packages/postgresql94/bin/pg_controldata /var/vcap/store/postgresql94 | grep 'Database cluster state'"

Example for a9s-pg, node 2:

$ bosh -d a9s-pg ssh pg/2 "sudo /var/vcap/packages/postgresql94/bin/pg_controldata /var/vcap/store/postgresql94 | grep 'Database cluster state'"

Which should give you the following output:

If the node is standby and down, you should see:

Database cluster state:               shut down in recovery

If the node is primary and down, you should see:

Database cluster state:               shut down

If the node is primary and running you should see:

Database cluster state:               in production

If the node is standby and running you should see:

Database cluster state:               in archive recovery

Primaries will have the state either in production or shut down. If you see both status in one cluster analyse the logs to identify the last primary.