Cluster Status
Since anynines-deployment v4.0.0, a9s PostgreSQL includes a debug script. This script should help operators to properly identify the cluster state and the events currently happening.
The purpose of this document is to explain how to use and the output of this script, as well as how to properly identify the valid primary of the cluster.
Status Script
a9s PostgreSQL collocates the a9s PostgreSQL Info process, which listens
on port 63145
. The main purpose of this service is to provide read-only
information about the node state through an HTTP server.
There is a script under /var/vcap/jobs/postgresql-ha/bin/debug/status.sh
that iterates over all PostgreSQL nodes and requests the current state of
the node, by executing a GET against the /v1/status
endpoint. This way,
executing this script from any node will give you the state of the cluster
as seen from that node (this is important especially if you need to debug a
network split).
You can execute the script without arguments:
$ /var/vcap/jobs/postgresql-ha/bin/debug/status.sh
Where the output contains information about the state of all nodes.
The script can also be executed with /v1/replication_status
as an argument in
order to check the replication.
$ /var/vcap/jobs/postgresql-ha/bin/debug/status.sh /v1/replication_status
The script will then retrieve the checkpoint's information from the PostgreSQL
process live (with the exception of the primary, as in this case it shows when
it was promoted). The most relevant information there is upstream_node_id
and
repmgr_node_id
.
As this information is extracted from the local repmgr
database from each node
it can be used to compare with the output of the default execution of the script
if they diverge somehow, we strongly recommend looking into the logs in order to
understand what happened with the cluster.
Keep in mind that when executed for single instances the output falls back to the
default output, as replication status
only makes sense for a cluster setting.
You can refer to the tables below in order to get detailed descriptions of what each value represents, for both types of execution.
General Request Information
Property | Type | Description |
---|---|---|
message | string | General information identifying the request. |
status | string | Status of this request, it can be: [success ]. |
data | json | Payload with node-specific status information. |
Node specific information (data
)
Property | Type | Description |
---|---|---|
node_index | integer | Node index (spec.index ) |
node_ip | string | Node IP address (spec.ip ) |
id | string | Node ID (spec.id ) |
bootstrap | boolean | If this is the bootstrap node (spec.bootstrap ) |
firstboot | boolean | If this is the first boot of this node (reads from first boot lock) |
status | string | Result of pg_isready against the local node. Possible values are: [running , rejecting_connections , no_response , no_attempt_made , unknown_error ] |
mode | string | The mode this node is running extracted from pg_controldata. Possible values are: [master , standby , shutdown , shut_down_in_recovery , shutting_down , in_crash_recovery , unknown ] |
master_in_majority_partition | boolean | If this is a valid primary for the majority of the cluster. |
upstream_node_id | string | Upstream node id for this node. Which node this node is replicating from |
failover_in_progress | boolean | If this node is going through a failover process, whether being promoted to primary or starting to follow a recently promoted primary. |
upstream_node_id | string | Upstream node id for this node. Which node this node is replicating from |
latest_checkpoint * | string | Latest checkpoint extracted from pg_controldata , the string represents a LSN (Log Sequence Number). This property is not used to identify the master of the cluster anymore. |
upstream_node_id | string | Upstream node id for this node. Which node this node is replicating from. |
blocked | boolean | If this node was blocked by the monitor. The node is blocked when it is a primary not followed by the majority of the cluster, in this case, data.master_in_majority_partition is false , for example. This property is used to inform if the node can receive new connections. |
processes | json | This section identifies the status of the main processes. This check is done using the ps . And they are used during the post-start to decide if it should wait for a node to finish cloning. Possible values for these properties are: [running , starting , cloning , stopped , unknown , error ] |
pg_config | json | PostgreSQL configuration section. |
Processes specific information (processes
)
Property | Type | Description |
---|---|---|
postgresql | string | Status of the PostgreSQL process running on the node. |
repmgrd | string | Status of the repmgrd process running on the node. |
PostgreSQL configuration section (pg_config
)
Property | Type | Description |
---|---|---|
max_connections | integer | max_connections value extracted from the PostgreSQL config file. Information is used to check if a max_connections update is happening. We currently support updates only to a greater value than the current one. |
Replication status information
Property | Type | Description |
---|---|---|
latest_checkpoint ** | string | Latest checkpoint extracted from pg_last_wal_replay_lsn, the string represents a LSN (Log Sequence Number). Last write-ahead log location replayed during recovery. This number increases only when the node is in recovery, therefore for the primary this number should remain the same across time. Except when it goes through recovery mode (e.g. PITR ). |
standby_latest_receive_wal | string | Latest checkpoint extracted from pg_last_wal_receive_lsn for standby nodes, the string represents a LSN (Log Sequence Number). Get last write-ahead log location received and synced to disk by streaming replication. This number increases only when the node is in recovery, therefore for the primary this number should have the empty_value_for_master_node value. Moreover, eventually, this value might be bigger than latest_checkpoint . |
pg_mode | string | The mode of the node. The value can be either master , shutdown or standby . |
repmgr_node_id | integer | Repmgr node id for this node. In the case of the master node it is the same value as upstream_node_id . Fetched from the repmgr database |
standby_age | integer | Indicates for how long this node is acting as a standby. This value has two special values -1 and -2 , single instances will have the -2 value while on the cluster setting only the first primary (bootstrapped) will have -2 , the primary after the first failover will have -1 however. Fetched from the repmgr database |
upstream_node_id | integer | Upstream node id for this node. Specifies which node this node is replicating from. Fetched from the repmgr database |
* latest_checkpoint
is deprecated for the /v1/status
endpoint.
** It should be noted that, even though latest_checkpoint
is a deprecated
value within the default version of the script, it is still a valid source of
truth when fetched by running the script with /v1/replication_status
as the
parameter. This information is live and processed via a different query, thus
it is actual, factual and valid.
Examining Status Output
Depending on how the script was executed, the response will differ in certain
properties. In order to address this, the information will be divided into two
categories: Cluster Status (Default)
and Replication Status
.
Cluster Status (Default)
This is the output of the script execution on a three-node healthy cluster:
{
"message": "Node status",
"status": "succeeded",
"data": {
"id": "0c40bd50-1743-419b-aa87-ce400e3c7e35",
"node_index": 0,
"node_ip": "10.244.28.56",
"bootstrap": true,
"firstboot": false,
"status": "running",
"blocked": false,
"latest_checkpoint": "11/C58CB018",
"mode": "master",
"master_in_majority_partition": true,
"upstream_node_id": "0c40bd50-1743-419b-aa87-ce400e3c7e35",
"failover_in_progress": false,
"processes": {
"postgresql": "running",
"repmgrd": "running"
},
"pg_config": {
"max_connections": "1024"
}
}
}
{
"message": "Node status",
"status": "succeeded",
"data": {
"id": "13f921dd-32d2-46ef-8091-b3372a463158",
"node_index": 1,
"node_ip": "10.244.29.52",
"bootstrap": false,
"firstboot": false,
"status": "running",
"blocked": false,
"latest_checkpoint": "11/C56E5878",
"mode": "standby",
"master_in_majority_partition": false,
"upstream_node_id": "0c40bd50-1743-419b-aa87-ce400e3c7e35",
"failover_in_progress": false,
"processes": {
"postgresql": "running",
"repmgrd": "running"
},
"pg_config": {
"max_connections": "1024"
}
}
}
{
"message": "Node status",
"status": "succeeded",
"data": {
"id": "70dabb0f-388c-4d58-8356-a0d9cf1fa0b2",
"node_index": 2,
"node_ip": "10.244.30.50",
"bootstrap": false,
"firstboot": false,
"status": "running",
"blocked": false,
"latest_checkpoint": "11/C56E5878",
"mode": "standby",
"master_in_majority_partition": false,
"upstream_node_id": "0c40bd50-1743-419b-aa87-ce400e3c7e35",
"failover_in_progress": false,
"processes": {
"postgresql": "running",
"repmgrd": "running"
},
"pg_config": {
"max_connections": "1024"
}
}
}
Replication Status
This is the output of the script execution, with /v1/replication_status
as a
parameter, on a three-node healthy cluster:
{
"message": "Node Replication Stat",
"status": "success",
"data": {
"node": {
"id": "e4c623b8-5367-44a0-961b-442a3670935d",
"index": 0,
"ip": "10.244.28.50"
},
"pg_mode": "master",
"upstream_node_id": "1",
"repmgr_node_id": "1",
"standby_age": -2,
"latest_checkpoint": null,
"standby_latest_receive_wal": "empty_value_for_master_node",
"failover_in_progress": false,
"processes": {
"postgresql": "running",
"repmgrd": "running"
},
"pg_config": {
"version": 11,
"max_connections": "128"
}
}
}
{
"message": "Node Replication Stat",
"status": "success",
"data": {
"node": {
"id": "b76d9f58-943a-4162-b991-81e2279606a3",
"index": 2,
"ip": "10.244.28.56"
},
"pg_mode": "standby",
"upstream_node_id": "1",
"repmgr_node_id": "3",
"standby_age": 3296,
"latest_checkpoint": "0/5133F68",
"standby_latest_receive_wal": "0/5133F78",
"failover_in_progress": false,
"processes": {
"postgresql": "running",
"repmgrd": "running"
},
"pg_config": {
"version": 11,
"max_connections": "128"
}
}
}
{
"message": "Node Replication Stat",
"status": "success",
"data": {
"node": {
"id": "f98ef63a-91a2-4829-9d88-e52d67703c33",
"index": 1,
"ip": "10.244.28.52"
},
"pg_mode": "standby",
"upstream_node_id": "1",
"repmgr_node_id": "2",
"standby_age": 3059,
"latest_checkpoint": "0/5133F68",
"standby_latest_receive_wal": "0/5133F78",
"failover_in_progress": false,
"processes": {
"postgresql": "running",
"repmgrd": "running"
},
"pg_config": {
"version": 11,
"max_connections": "128"
}
}
}
Values of Interest
There is some information that we can check rightway from the output above:
All nodes should be in status
running
: It is normal that the node gets into different states while starting, but after deployment finishes, all nodes should be running. You can find a different case when bringing back a cluster where the three nodes were down, in this case, if the standby nodes start first, it will pass through the statusrejecting_connections
andno_response
while the PostgreSQL process finishes starting. If this fails, please check the PostgreSQL logs under/var/vcap/sys/log/monit
.Must have 1 node in mode
master
, and all others on modestandby
: In a healthy cluster all nodes must have 1 master and N standby nodes following the sameupstream_node_id
. Note that on single deployments, we only have amaster
node.All nodes belonging to the cluster must be following the same upstream: It is possible to have
unknown
value during first boot, but after that, all nodes should replicate from the same upstream node, the primary. A node not following the current valid upstream may be in a split-brain situation. Note that the monitor uses this information to decide if a node should be blocked or not, so a master that is not followed by the majority of the nodes, is blocked. The same way a standby node following an invalid node, is also blocked. To validate if the cluster is in a healthy state, you can run both the script for the cluster status (the default execution) and the script for the replication status (using/v1/replication_status
as parameter), then compare the values for theupstream_node_id
.The valid primary is the one followed by the majority: This means that a valid master is followed by at least
(n / 2) + 1
nodes in the cluster, this includes the primary following itself.No node should be blocked: It is normal that a standby node recently added in the cluster to be blocked for a period. Currently, default is 60s. But after this period the node should be unblocked. Any other situation means that the cluster is not healthy. A blocked node is not necessarily the wrong node in the cluster, we can have a standby node failing to restart or even a standby that for any unforeseen reason fails to follow the new node. In case a node is blocked, check if all nodes in the cluster are running and following the same upstream and the a9s PostgreSQL Info log file:
/var/vcap/job/sys/log/postgresql-ha/postgresql-info-webservice.err.log
.Primary DNS Alias must point to the correct primary: This means the alias should point to the address of the correct primary of this cluster, it should agree with the information shown by the
debug/status.sh
script. Since anynines-deployment v4.13.0 primary alias monitoring is enabled by default, and it should correct the alias in case it is pointing to a different node that is not the valid primary of the cluster. This means that for the alias to be updated/v1/master
endpoint should return the status of the valid primary of the majority.All processes must be running: Even when BOSH shows PostgreSQL and repmgrd running, this status can provide a different value. This happens because before starting the process the script is running, waiting for
cloning
orstarting
, for example, BOSH will report the status running, but this endpoint will show the process ascloning
orstarting
. In a healthy cluster, all processes listed there should berunning
. If the status is different fromrunning
orcloning
, check the logs for PostgreSQL and repmgrd under/var/vcap/sys/log/monit/
.The value of latest_checkpoint must change for the standby nodes When using the script with
/v1/replication_status
as a parameter, the value oflatest_checkpoint
must increase for every execution. If the value remains unchanging, then the replication process is broken. This, however, does not apply to theprimary
node, as by definition is not instandby
mode.
Identifying Current Valid Primary of the Cluster
Use the method below that better fits your case and use more than one approach to decide.
0. Using a9s PostgreSQL Info (/v1/master
)
a9s PostgreSQL Info includes the GET /v1/master
endpoint. This endpoint provides the status
of the current valid primary of the cluster. It will request to every node of the cluster
for its current status, with this data an election is made and the primary (mode master
)
that has more nodes following, is identified as primary of the cluster.
This endpoint can also respond with No Quorum
, when the node receiving the request can not
reach at least (n/2) + 1
nodes of the cluster (including itself) or if no primary has at
least (n/2) + 1
standby
nodes replicating.
This will always provide the current valid master of the cluster, even if the primary alias is pointing to a different node.
1. Using repmgr
When you can still connect with the PostgreSQL cluster, the safest way to discover who
is the primary is to connect with the repmgr_cluster
database on each node.
$ bosh -d <deploymnet> ssh pg/X
$ sudo -i
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ chpst -u postgres:vcap psql repmgr_cluster
a9s PostgreSQL 13 and below
$ bosh -d <deploymnet> ssh pg/X
$ sudo su - vcap
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ psql repmgr_cluster
Connected to the repmgr_cluster
database, you can execute the following query:
On a9s PostgreSQL 11 and later versions:
repmgr_cluster=## SELECT * FROM repmgr.nodes;
On a9s PostgreSQL 10 and below versions:
repmgr_cluster=## SELECT * FROM repmgr_cluster.repl_nodes;
Then analyze the fields id
(node_id
), type
, upstream_node_id
and name
(node_name
),
where name
is the current node id (spec.id
), id
is spec.index + 1
, type
is the node
role (primary
or standby
) and upstream_node_id
is the id
of the upstream node id.
Compare the results of all nodes, since a node in a split-brain partition can show out of date information.
2. Using the Status Script
When executing the /var/vcap/jobs/postgresql-ha/bin/debug/status.sh
script, we can
look for the mode
property on every node, to find the one running as master
. If
the cluster is not healthy and more than one node is presenting himself as master
,
we can see the upstream_node_id
of all nodes and choose the one who has more standby
nodes replicating from it.
3. Check upstream_node.lock
File
a9s PostgreSQL Info gets the upstream_node_id
value from the upstream_node.lock file
located at /var/vcap/store/postgresqlXX-locks/upstream_node.lock
. So, when a9s PostgreSQL
Info is not available, you can go through every node of the cluster and check the content
of this file, then use this information as seen above. A node containing unknown
means
that right after executing repmgr register
, repmgr promote
or repmgr follow
the startup
script could not retrieve this information from the primary. This is a critical error.
4. Check the Logs
You first need to identify who was the primary when this node started checking the
/var/vcap/sys/log/postgresql-ha/pre-start.stdout.log
file. On every node, check for the
following line:
6838: Tue Jul 9 23:30:55 UTC 2019 - [debug][pre-start:254][init_standby_node] - Trying to find the master - master_id: d1df087d-bcea-406e-bdf5-17693a63e28a
Then check the /var/vcap/sys/log/postgresql-ha/repmgr_follow_promote.log
, to trace a
promote-follow history (keep the timestamp in mind to discover the precedence), from there
you can identify who was the last promoted node. Compare the observations with
/var/vcap/sys/log/postgresql-ha/repmgrd.log
(/var/vcap/sys/log/postgresql-ha/repmgrd.err.log
on a9s PostgreSQL 11), on the repmgrd log look for the a line like:
[2018-02-13 19:24:54] [NOTICE] STANDBY PROMOTE successful
If you can not find this file this means that the node was never promoted to primary.
If you can not find this file in any of the nodes, chances are that a failover never happened and the bootstrap node is still the master.
6. Check Control Data
PostgreSQL keeps some information regarding the state of the cluster in its files.
One way to access this information is by using the pg_controldata
binary shipped
with this release.
If the cluster was correctly stopped (This should be the case during a failed upgrade), the information about the last role of a node can be found using this approach. Of course, during the upgrade, multiple nodes can be stopped as primary. But this will give one more piece of information to make a decision about which node is the primary.
You can get this information executing the following command against every node in the cluster:
bosh -d <deployment-name> ssh pg/<instance-index-or-id> "sudo /var/vcap/packages/postgresql94/bin/pg_controldata /var/vcap/store/postgresql94 | grep 'Database cluster state'"
Example for a9s-pg, node 2:
bosh -d a9s-pg ssh pg/2 "sudo /var/vcap/packages/postgresql94/bin/pg_controldata /var/vcap/store/postgresql94 | grep 'Database cluster state'"
Which should give you the following output:
If the node is standby and down, you should see:
Database cluster state: shut down in recovery
If the node is primary and down, you should see:
Database cluster state: shut down
If the node is primary and running you should see:
Database cluster state: in production
If the node is standby and running you should see:
Database cluster state: in archive recovery
Primaries will have the state either in production
or shut down
. If you see both status in
one cluster analyze the logs to identify the last primary.