Version: 58.0.0

Cluster Status

Since anynines-deployment v4.0.0, a9s PostgreSQL includes a debug script. This script should help operators to properly identify the cluster state and the events currently happening.

The purpose of this document is to explain how to use and the output of this script, as well as how to properly identify the valid primary of the cluster.

Status Script

a9s PostgreSQL collocates the a9s PostgreSQL Info process, which listens on port 63145. The main purpose of this service is to provide read-only information about the node state through an HTTP server.

There is a script under /var/vcap/jobs/postgresql-ha/bin/debug/status.sh that iterates over all PostgreSQL nodes and requests the current state of the node, by executing a GET against the /v1/status endpoint. This way, executing this script from any node will give you the state of the cluster as seen from that node (this is important especially if you need to debug a network split).

You can execute the script without arguments:

$ /var/vcap/jobs/postgresql-ha/bin/debug/status.sh

Where the output contains information about the state of all nodes.

The script can also be executed with /v1/replication_status as an argument in order to check the replication.

$ /var/vcap/jobs/postgresql-ha/bin/debug/status.sh /v1/replication_status

The script will then retrieve the checkpoint's information from the PostgreSQL process live (with the exception of the primary, as in this case it shows when it was promoted). The most relevant information there is upstream_node_id and repmgr_node_id.

As this information is extracted from the local repmgr database from each node it can be used to compare with the output of the default execution of the script if they diverge somehow, we strongly recommend looking into the logs in order to understand what happened with the cluster.

Keep in mind that when executed for single instances the output falls back to the default output, as replication status only makes sense for a cluster setting.

You can refer to the tables below in order to get detailed descriptions of what each value represents, for both types of execution.

General Request Information

Property	Type	Description
`message`	`string`	General information identifying the request.
`status`	`string`	Status of this request, it can be: [`success`].
`data`	`json`	Payload with node-specific status information.

Node specific information (data)

Property	Type	Description
`node_index`	`integer`	Node index (`spec.index`)
`node_ip`	`string`	Node IP address (`spec.ip`)
`id`	`string`	Node ID (`spec.id`)
`bootstrap`	`boolean`	If this is the bootstrap node (`spec.bootstrap`)
`firstboot`	`boolean`	If this is the first boot of this node (reads from first boot lock)
`status`	`string`	Result of pg_isready against the local node. Possible values are: [`running`, `rejecting_connections`, `no_response`, `no_attempt_made`, `unknown_error`]
`mode`	`string`	The mode this node is running extracted from pg_controldata. Possible values are: [`master`, `standby`, `shutdown`, `shut_down_in_recovery`, `shutting_down`, `in_crash_recovery`, `unknown`]
`master_in_majority_partition`	`boolean`	If this is a valid primary for the majority of the cluster.
`upstream_node_id`	`string`	Upstream node id for this node. Which node this node is replicating from
`failover_in_progress`	`boolean`	If this node is going through a failover process, whether being promoted to primary or starting to follow a recently promoted primary.
`upstream_node_id`	`string`	Upstream node id for this node. Which node this node is replicating from
`latest_checkpoint`*	`string`	Latest checkpoint extracted from `pg_controldata`, the string represents a LSN (Log Sequence Number). This property is not used to identify the master of the cluster anymore.
`upstream_node_id`	`string`	Upstream node id for this node. Which node this node is replicating from.
`blocked`	`boolean`	If this node was blocked by the monitor. The node is blocked when it is a primary not followed by the majority of the cluster, in this case, `data.master_in_majority_partition` is `false`, for example. This property is used to inform if the node can receive new connections.
`processes`	`json`	This section identifies the status of the main processes. This check is done using the `ps`. And they are used during the post-start to decide if it should wait for a node to finish cloning. Possible values for these properties are: [`running`, `starting`, `cloning`, `stopped`, `unknown`, `error`]
`pg_config`	`json`	PostgreSQL configuration section.

Processes specific information (processes)

Property	Type	Description
`postgresql`	`string`	Status of the PostgreSQL process running on the node.
`repmgrd`	`string`	Status of the repmgrd process running on the node.

PostgreSQL configuration section (pg_config)

Property	Type	Description
`max_connections`	`integer`	`max_connections` value extracted from the PostgreSQL config file. Information is used to check if a `max_connections` update is happening. We currently support updates only to a greater value than the current one.

Replication status information

Property	Type	Description
`latest_checkpoint`**	`string`	Latest checkpoint extracted from pg_last_wal_replay_lsn, the string represents a LSN (Log Sequence Number). Last write-ahead log location replayed during recovery. This number increases only when the node is in recovery, therefore for the primary this number should remain the same across time. Except when it goes through recovery mode (e.g. `PITR`).
`standby_latest_receive_wal`	`string`	Latest checkpoint extracted from pg_last_wal_receive_lsn for standby nodes, the string represents a LSN (Log Sequence Number). Get last write-ahead log location received and synced to disk by streaming replication. This number increases only when the node is in recovery, therefore for the primary this number should have the `empty_value_for_master_node` value. Moreover, eventually, this value might be bigger than `latest_checkpoint`.
`pg_mode`	`string`	The mode of the node. The value can be either `master`, `shutdown` or `standby`.
`repmgr_node_id`	`integer`	Repmgr node id for this node. In the case of the master node it is the same value as `upstream_node_id`. Fetched from the repmgr database
`standby_age`	`integer`	Indicates for how long this node is acting as a standby. This value has two special values `-1` and `-2`, single instances will have the `-2` value while on the cluster setting only the first primary (bootstrapped) will have `-2`, the primary after the first failover will have `-1` however. Fetched from the repmgr database
`upstream_node_id`	`integer`	Upstream node id for this node. Specifies which node this node is replicating from. Fetched from the repmgr database

* latest_checkpoint is deprecated for the /v1/status endpoint.

** It should be noted that, even though latest_checkpoint is a deprecated value within the default version of the script, it is still a valid source of truth when fetched by running the script with /v1/replication_status as the parameter. This information is live and processed via a different query, thus it is actual, factual and valid.

Examining Status Output

Depending on how the script was executed, the response will differ in certain properties. In order to address this, the information will be divided into two categories: Cluster Status (Default) and Replication Status.

Cluster Status (Default)

This is the output of the script execution on a three-node healthy cluster:

{
  "message": "Node status",
  "status": "succeeded",
  "data": {
    "id": "0c40bd50-1743-419b-aa87-ce400e3c7e35",
    "node_index": 0,
    "node_ip": "10.244.28.56",
    "bootstrap": true,
    "firstboot": false,
    "status": "running",
    "blocked": false,
    "latest_checkpoint": "11/C58CB018",
    "mode": "master",
    "master_in_majority_partition": true,
    "upstream_node_id": "0c40bd50-1743-419b-aa87-ce400e3c7e35",
    "failover_in_progress": false,
    "processes": {
      "postgresql": "running",
      "repmgrd": "running"
    },
    "pg_config": {
      "max_connections": "1024"
    }
  }
}
{
  "message": "Node status",
  "status": "succeeded",
  "data": {
    "id": "13f921dd-32d2-46ef-8091-b3372a463158",
    "node_index": 1,
    "node_ip": "10.244.29.52",
    "bootstrap": false,
    "firstboot": false,
    "status": "running",
    "blocked": false,
    "latest_checkpoint": "11/C56E5878",
    "mode": "standby",
    "master_in_majority_partition": false,
    "upstream_node_id": "0c40bd50-1743-419b-aa87-ce400e3c7e35",
    "failover_in_progress": false,
    "processes": {
      "postgresql": "running",
      "repmgrd": "running"
    },
    "pg_config": {
      "max_connections": "1024"
    }
  }
}
{
  "message": "Node status",
  "status": "succeeded",
  "data": {
    "id": "70dabb0f-388c-4d58-8356-a0d9cf1fa0b2",
    "node_index": 2,
    "node_ip": "10.244.30.50",
    "bootstrap": false,
    "firstboot": false,
    "status": "running",
    "blocked": false,
    "latest_checkpoint": "11/C56E5878",
    "mode": "standby",
    "master_in_majority_partition": false,
    "upstream_node_id": "0c40bd50-1743-419b-aa87-ce400e3c7e35",
    "failover_in_progress": false,
    "processes": {
      "postgresql": "running",
      "repmgrd": "running"
    },
    "pg_config": {
      "max_connections": "1024"
    }
  }
}

Replication Status

This is the output of the script execution, with /v1/replication_status as a parameter, on a three-node healthy cluster:

{
  "message": "Node Replication Stat",
  "status": "success",
  "data": {
    "node": {
      "id": "e4c623b8-5367-44a0-961b-442a3670935d",
      "index": 0,
      "ip": "10.244.28.50"
    },
    "pg_mode": "master",
    "upstream_node_id": "1",
    "repmgr_node_id": "1",
    "standby_age": -2,
    "latest_checkpoint": null,
    "standby_latest_receive_wal": "empty_value_for_master_node",
    "failover_in_progress": false,
    "processes": {
      "postgresql": "running",
      "repmgrd": "running"
    },
    "pg_config": {
      "version": 11,
      "max_connections": "128"
    }
  }
}
{
  "message": "Node Replication Stat",
  "status": "success",
  "data": {
    "node": {
      "id": "b76d9f58-943a-4162-b991-81e2279606a3",
      "index": 2,
      "ip": "10.244.28.56"
    },
    "pg_mode": "standby",
    "upstream_node_id": "1",
    "repmgr_node_id": "3",
    "standby_age": 3296,
    "latest_checkpoint": "0/5133F68",
    "standby_latest_receive_wal": "0/5133F78",
    "failover_in_progress": false,
    "processes": {
      "postgresql": "running",
      "repmgrd": "running"
    },
    "pg_config": {
      "version": 11,
      "max_connections": "128"
    }
  }
}
{
  "message": "Node Replication Stat",
  "status": "success",
  "data": {
    "node": {
      "id": "f98ef63a-91a2-4829-9d88-e52d67703c33",
      "index": 1,
      "ip": "10.244.28.52"
    },
    "pg_mode": "standby",
    "upstream_node_id": "1",
    "repmgr_node_id": "2",
    "standby_age": 3059,
    "latest_checkpoint": "0/5133F68",
    "standby_latest_receive_wal": "0/5133F78",
    "failover_in_progress": false,
    "processes": {
      "postgresql": "running",
      "repmgrd": "running"
    },
    "pg_config": {
      "version": 11,
      "max_connections": "128"
    }
  }
}

Values of Interest

There is some information that we can check rightway from the output above:

All nodes should be in status running: It is normal that the node gets into different states while starting, but after deployment finishes, all nodes should be running. You can find a different case when bringing back a cluster where the three nodes were down, in this case, if the standby nodes start first, it will pass through the status rejecting_connections and no_response while the PostgreSQL process finishes starting. If this fails, please check the PostgreSQL logs under /var/vcap/sys/log/monit.
Must have 1 node in mode master, and all others on mode standby: In a healthy cluster all nodes must have 1 master and N standby nodes following the same upstream_node_id. Note that on single deployments, we only have a master node.
All nodes belonging to the cluster must be following the same upstream: It is possible to have unknown value during first boot, but after that, all nodes should replicate from the same upstream node, the primary. A node not following the current valid upstream may be in a split-brain situation. Note that the monitor uses this information to decide if a node should be blocked or not, so a master that is not followed by the majority of the nodes, is blocked. The same way a standby node following an invalid node, is also blocked. To validate if the cluster is in a healthy state, you can run both the script for the cluster status (the default execution) and the script for the replication status (using /v1/replication_status as parameter), then compare the values for the upstream_node_id.
The valid primary is the one followed by the majority: This means that a valid master is followed by at least (n / 2) + 1 nodes in the cluster, this includes the primary following itself.
No node should be blocked: It is normal that a standby node recently added in the cluster to be blocked for a period. Currently, default is 60s. But after this period the node should be unblocked. Any other situation means that the cluster is not healthy. A blocked node is not necessarily the wrong node in the cluster, we can have a standby node failing to restart or even a standby that for any unforeseen reason fails to follow the new node. In case a node is blocked, check if all nodes in the cluster are running and following the same upstream and the a9s PostgreSQL Info log file: /var/vcap/job/sys/log/postgresql-ha/postgresql-info-webservice.err.log.
Primary DNS Alias must point to the correct primary: This means the alias should point to the address of the correct primary of this cluster, it should agree with the information shown by the debug/status.sh script. Since anynines-deployment v4.13.0 primary alias monitoring is enabled by default, and it should correct the alias in case it is pointing to a different node that is not the valid primary of the cluster. This means that for the alias to be updated /v1/master endpoint should return the status of the valid primary of the majority.
All processes must be running: Even when BOSH shows PostgreSQL and repmgrd running, this status can provide a different value. This happens because before starting the process the script is running, waiting for cloning or starting, for example, BOSH will report the status running, but this endpoint will show the process as cloning or starting. In a healthy cluster, all processes listed there should be running. If the status is different from running or cloning, check the logs for PostgreSQL and repmgrd under /var/vcap/sys/log/monit/.
The value of latest_checkpoint must change for the standby nodes When using the script with /v1/replication_status as a parameter, the value of latest_checkpoint must increase for every execution. If the value remains unchanging, then the replication process is broken. This, however, does not apply to the primary node, as by definition is not in standby mode.

Identifying Current Valid Primary of the Cluster

Use the method below that better fits your case and use more than one approach to decide.

0. Using a9s PostgreSQL Info (`/v1/master`)

a9s PostgreSQL Info includes the GET /v1/master endpoint. This endpoint provides the status of the current valid primary of the cluster. It will request to every node of the cluster for its current status, with this data an election is made and the primary (mode master) that has more nodes following, is identified as primary of the cluster.

This endpoint can also respond with No Quorum, when the node receiving the request can not reach at least (n/2) + 1 nodes of the cluster (including itself) or if no primary has at least (n/2) + 1 standby nodes replicating.

This will always provide the current valid master of the cluster, even if the primary alias is pointing to a different node.

1. Using repmgr

When you can still connect with the PostgreSQL cluster, the safest way to discover who is the primary is to connect with the repmgr_cluster database on each node.

$ bosh -d <deploymnet> ssh pg/X
$ sudo -i
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ chpst -u postgres:vcap psql repmgr_cluster

a9s PostgreSQL 13 and below

$ bosh -d <deploymnet> ssh pg/X
$ sudo su - vcap
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ psql repmgr_cluster

Connected to the repmgr_cluster database, you can execute the following query:

On a9s PostgreSQL 11 and later versions:

repmgr_cluster=## SELECT * FROM repmgr.nodes;

Then analyze the fields id (node_id), type, upstream_node_id and name (node_name), where name is the current node id (spec.id), id is spec.index + 1, type is the node role (primary or standby) and upstream_node_id is the id of the upstream node id.

Compare the results of all nodes, since a node in a split-brain partition can show out of date information.

2. Using the Status Script

When executing the /var/vcap/jobs/postgresql-ha/bin/debug/status.sh script, we can look for the mode property on every node, to find the one running as master. If the cluster is not healthy and more than one node is presenting himself as master, we can see the upstream_node_id of all nodes and choose the one who has more standby nodes replicating from it.

3. Check `upstream_node.lock` File

a9s PostgreSQL Info gets the upstream_node_id value from the upstream_node.lock file located at /var/vcap/store/postgresqlXX-locks/upstream_node.lock. So, when a9s PostgreSQL Info is not available, you can go through every node of the cluster and check the content of this file, then use this information as seen above. A node containing unknown means that right after executing repmgr register, repmgr promote or repmgr follow the startup script could not retrieve this information from the primary. This is a critical error.

4. Check the a9s PostgreSQL Metrics

If the metrics were collected, checking the a9s PostgreSQL metrics can help identify the primary node and monitor the cluster's health over time. For more information, see a9s PostgreSQL Metric groups.

note

This method should not be used alone; it should be combined with the other methods mentioned in this section.

5. Check the Logs

You first need to identify who was the primary when this node started checking the /var/vcap/sys/log/postgresql-ha/pre-start.stdout.log file. On every node, check for the following line:

6838: Tue Jul  9 23:30:55 UTC 2019 - [debug][pre-start:254][init_standby_node] - Trying to find the master - master_id: d1df087d-bcea-406e-bdf5-17693a63e28a

Then check the /var/vcap/sys/log/postgresql-ha/repmgr_follow_promote.log, to trace a promote-follow history (keep the timestamp in mind to discover the precedence), from there you can identify who was the last promoted node. Compare the observations with /var/vcap/sys/log/postgresql-ha/repmgrd.log (/var/vcap/sys/log/postgresql-ha/repmgrd.err.log on a9s PostgreSQL 11), on the repmgrd log look for the a line like:

[2018-02-13 19:24:54] [NOTICE] STANDBY PROMOTE successful

If you can not find this file this means that the node was never promoted to primary.

If you can not find this file in any of the nodes, chances are that a failover never happened and the bootstrap node is still the master.

6. Check Control Data

PostgreSQL keeps some information regarding the state of the cluster in its files. One way to access this information is by using the pg_controldata binary shipped with this release.

If the cluster was correctly stopped (This should be the case during a failed upgrade), the information about the last role of a node can be found using this approach. Of course, during the upgrade, multiple nodes can be stopped as primary. But this will give one more piece of information to make a decision about which node is the primary.

You can get this information executing the following command against every node in the cluster:

bosh -d <deployment-name> ssh pg/<instance-index-or-id> "sudo /var/vcap/packages/postgresql94/bin/pg_controldata /var/vcap/store/postgresql94 | grep 'Database cluster state'"

Example for a9s-pg, node 2:

bosh -d a9s-pg ssh pg/2 "sudo /var/vcap/packages/postgresql94/bin/pg_controldata /var/vcap/store/postgresql94 | grep 'Database cluster state'"

Which should give you the following output:

If the node is standby and down, you should see:

Database cluster state:               shut down in recovery

If the node is primary and down, you should see:

Database cluster state:               shut down

If the node is primary and running you should see:

Database cluster state:               in production

If the node is standby and running you should see:

Database cluster state:               in archive recovery

Primaries will have the state either in production or shut down. If you see both status in one cluster analyze the logs to identify the last primary.

Status Script​

Examining Status Output​

Cluster Status (Default)​

Replication Status​

Values of Interest​

Identifying Current Valid Primary of the Cluster​

0. Using a9s PostgreSQL Info (/v1/master)​

1. Using repmgr​

2. Using the Status Script​

3. Check upstream_node.lock File​

4. Check the a9s PostgreSQL Metrics​

5. Check the Logs​

6. Check Control Data​