a9s-pg Upgrade from 9.4
Until anynines-deployment version v12.4.0, the a9s PostgreSQL BOSH released used by a9s-pg is the a9s PostgreSQL 9.4, this document describes the procedures and how to safely upgrade a9s-pg to a9s PostgreSQL 11.
Description
When deploying the new a9s-pg version with PostgreSQL 11, the pre-start script will detect that
a major upgrade is going on based on the PG_VERSION
file of the old data directory
(/var/vcap/store/postgresql94
), the existence of this file is necessary for the upgrade procedure.
Then data upgrade happens on the bootstrap node, when the major upgrade is detected, it will check
if the old part of the cluster has a valid primary, discard the current data on the old data dir,
clone the data from the current primary to ensure we have the most recent version and discard
issues with replication, migrate the fresh data using pg_upgrade
, after that this node is promoted
to primary, and a normal startup goes ahead.
Note that, this will force the bootstrap node to start as a primary, and at this point, your cluster will have two primary nodes, where the old portion has the majority.(To avoid losing data, please, look at Block Node section below)
As the upgrade goes on to the next node, it will clone the data from the new primary, and starts following the primary on the new version. The cluster should be ready for use after the second node finishes upgrading.
Preparing to Upgrade
Some steps are necessary to ensure a safe upgrade, and the possibility to recover in case of a disaster.
Increase the Persistent Disk
During pg_upgrade
, the data will be duplicated between the old version
(/var/vcap/store/postgresql94
) and the new data directory (/var/vcap/store/postgresql11
). Even
though pg_upgrade
runs with --link
, it can virtually use almost the same space as an upgrade
without --link
, for this reason you will need more than double of the necessary space to upgrade.
You must also keep a9s Parachute limits in mind. This means that with a parachute water line of
80%, data dir using 4GB of space, the total space necessary on the disk must be at least:
E.g.:
(2 * 4GB) + ((1 - 0.8) * (2 * 4GB)) = 9.6GB
Formula:
(2 * <data-size>) + ((1 - <a9s-parachute-watermark>) * (2 * <data-size>)) = <total>
While vacuuming the cluster after the upgrade, a little bit more space will be necessary, so
it is advised to add at least 2GB more to this. In the example above, this would give
round(11.6) = 12GB
. This is enough for a safe upgrade procedure. During post-start, after
the new data is analyzed, the old data is discarded, and you can decrease the size of the disk.
Execute a Backup Using the a9s Backup Framework
Before to trigger backup you need to check that it was configured correctly. If we look at the manifest of the a9s-pg, we should see that we installed backup-agent and postgresql-backup-endpoint:
releases:
- name: backup-agent
version: ...
- name: postgresql-backup-endpoint
version: ...
There are examples:
- https://github.com/anynines/anynines-deployment/blob/master/a9s-pg/a9s-pg.yml#L28-L31
- https://github.com/anynines/anynines-deployment/blob/master/a9s-pg/a9s-pg.yml#L36-L39
Now we need to check that we have all variables to link this a9s-pg to backup-manager
instance_groups:
...
- name: backup-agent
release: backup-agent
properties:
consul:
domain: ((iaas.consul.domain))
dc: dc1
agent_address: 127.0.0.1:8500
backup-agent:
endpoint:
username: ((/a9s_pg_backup_agent_username))
password: ((/a9s_pg_backup_agent_password))
...
- name: postgresql-backup-endpoint
release: postgresql-backup-endpoint
consumes:
cluster_nodes:
from: postgresql
properties:
postgresql-backup-endpoint:
username: admin
password: ((/a9s_pg_backup_endpoint_password))
port: 3001
cluster:
postgresql:
username: admin_user
password: ((a9s_pg_admin_password))
backup-agent:
username: ((/a9s_pg_backup_agent_username))
password: ((/a9s_pg_backup_agent_password))
info-service:
username: admin
password: ((/a9s_pg_postgresql_info_webservice))
For more information take look at:
- https://github.com/anynines/anynines-deployment/blob/master/a9s-pg/a9s-pg.yml#L180-L190
- https://github.com/anynines/anynines-deployment/blob/master/a9s-pg/a9s-pg.yml#L223-L242
After checking that everything fine we can trigger back up. The instruction of how to do that you can find a9s-pg Backup
Even though a backup of the a9s PostgreSQL 9.4 backup is not compatible with the a9s PostgreSQL 11 version, it is possible to reset the cluster, roll back to the old version, and restore a backup.
To know more about this, see the Disaster Recovery section below.
Manual Backup
It is also possible to create a local dump file with the changes for the 9.4,10 and 11 versions to be restored in case of a disaster by using the next command:
$ pg_dumpall --clean --oids --no-password --if-exists --quote-all-identifiers \
--host=<primary-addr> --port=5432 --username=admin_user -f a9s-pg.sql
The password for this user is on the credhub variable local a9s-pg a9s_pg_admin_password
.
It should be under the path /<bosh-directory-name>/a9s-pg/a9s_pg_admin_password
.
If the password does not seem to be working, try to set the PGPASSWORD
environment variable.
Even though a backup of the a9s PostgreSQL 9.4 backup is not compatible with the a9s PostgreSQL 11 version, it is possible to reset the cluster, roll back to the old version, and restore a backup.
To know more about this, see the Disaster Recovery section below.
Block Node
Because there is a window where the bootstrap node is upgrading but it is no longer streaming from the primary, which is on the old part of the cluster and a new version can not stream from it, this would cause existing connections to write to a node that will have the data discarded.
To avoid losing this portion of data, you can block the nodes with the following script:
#!/bin/bash
deployment="a9s-pg"
block_node="iptables -I INPUT -p tcp -s 0.0.0.0/0 --dport 5432 -j REJECT;"
for ip in $(bosh -d ${deployment} is --json | jq -r '.Tables[].Rows[].ips'); do
block_node="${block_node} iptables -I INPUT -p tcp -s ${ip} --dport 5432 -j ACCEPT;"
done
for instance in $(bosh -d ${deployment} is --json | jq -r '.Tables[].Rows[].instance'); do
echo "Inserting rules - node: ${instance}, command: ${block_node}"
bosh -d ${deployment} ssh ${instance} "sudo ${block_node}"
done
Important to know that we don't expect that VMS will be recreated during the upgrade, if it does not takes place you need to remove the iptables rules manually (see section Restore Manual Backup). Also, we want to release the upgrading procedure as fast as possible, as a result, we don't create a middle release to automate procedure of blocking nodes for external requests.
Run Upgrade
The upgrade takes place when deploying the new version of a9s-pg BOSH deployment. Please, refer to the a9s-pg documentation to know more about this.
If the task finishes with success, your cluster will be ready for use. If it fails, please refer to the Disaster Recovery section below.
Good to know that we have several timeouts in the process from initiate node with a new version to ready.
- The first is when we are waiting when a node initializes the main process and the timeout is 5 minutes.
- The second is when we are waiting for node after cloning data from new master to new stand by node and it waits 10 minutes.
- The third timeout we are waiting 3 hours for primary address if we previous doesn't response.
After the Upgrade has finished, it may happen that the virtual machine was not recreated, and therefore the cluster is still blocked. You can run the following script to unblock:
#!/bin/bash
deployment="a9s-pg"
for instance in $(bosh -d ${deployment} is --json | jq -r '.Tables[].Rows[].instance'); do
echo "Dropping rules - node: ${instance}"
bosh -d ${deployment} ssh ${instance} "sudo iptables -F"
done
Disaster Recovery
In case the upgrade procedure fails, and it has not been possible to fix it manually, you can:
- Cleanup all nodes in the cluster:
# monit stop all
# rm -rf /var/vcap/store/postgresql11*
After cleaning all the nodes, deploy the previous version of your a9s-pg manifest, to roll back the version to 9.4. Trigger a restore (we need to include steps on how to trigger a restore for a9s-pg) using the a9s Backup Framework, and try to upgrade to the new version again as described on Run Upgrade section.
Note: If the upgrade process fails on the first node for any reason, it should be ok to trigger again to try to finish the task, except if there is a primary found in the cluster.
Trigger a Restore Using the a9s Backup Framework
Please, refer to the a9s-pg documentation to know more about this.
After the restore has finished with success, trigger the upgrade procedure again as described on Run Upgrade
Restore Manual Backup
Using the local dump file created on the step above, restore this file to the cluster:
$ psql --host=<primary-addr> --username=admin_user --port=5432 --dbname=postgres < a9s-pg.sql
During the execution of this command, you might see some errors related to the repmgr database, it should be ok since the repmgr database is currently in use, and already with the correct information.
After the restore has finished with success, trigger the upgrade procedure again as described on Run Upgrade