a9s-pg Upgrade from 9.4

Until anynines-deployment version v12.4.0, the a9s PostgreSQL BOSH released used by a9s-pg is the a9s PostgreSQL 9.4, this document describes the procedures and how to safely upgrade a9s-pg to a9s PostgreSQL 11.

Description

When deploying the new a9s-pg version with PostgreSQL 11, the pre-start script will detect that a major upgrade is going on based on the PG_VERSION file of the old data directory (/var/vcap/store/postgresql94), the existence of this file is necessary for the upgrade procedure.

Then data upgrade happens on the bootstrap node, when the major upgrade is detected, it will check if the old part of the cluster has a valid primary, discard the current data on the old data dir, clone the data from the current primary to ensure we have the most recent version and discard issues with replication, migrate the fresh data using pg_upgrade, after that this node is promoted to primary, and a normal startup goes ahead.

Note that, this will force the bootstrap node to start as a primary, and at this point, your cluster will have two primary nodes, where the old portion has the majority.(To avoid losing data, please, look at Block Node section below)

As the upgrade goes on to the next node, it will clone the data from the new primary, and starts following the primary on the new version. The cluster should be ready for use after the second node finishes upgrading.

Preparing to Upgrade

Some steps are necessary to ensure a safe upgrade, and the possibility to recover in case of a disaster.

Increase the Persistent Disk

During pg_upgrade, the data will be duplicated between the old version (/var/vcap/store/postgresql94) and the new data directory (/var/vcap/store/postgresql11). Even though pg_upgrade runs with --link, it can virtually use almost the same space as an upgrade without --link, for this reason you will need more than double of the neessary space to upgrade. You must also keep a9s Parachute limits in mind. This means that with a parachute water line of 80%, data dir using 4GB of space, the total space necessary on the disk must be at least:

E.g.:

(2 * 4GB) + ((1 - 0.8) * (2 * 4GB)) = 9.6GB

Formula:

(2 * <data-size>) + ((1 - <a9s-parachute-watermark>) * (2 * <data-size>)) = <total>

While vacuuming the cluster after the upgrade, a little bit more space will be necessary, so it is advised to add at least 2GB more to this. In the example above, this would give round(11.6) = 12GB. This is enough for a safe upgrade procedure. During post-start, after the new data is analyzed, the old data is discarded, and you can decrease the size of the disk.

Execute a Backup Using the a9s Backup Framework

Before to trigger backup you need to check that it was configured correctly. If we look at the manifest of the a9s-pg, we should see that we installed backup-agent and postgresql-backup-endpoint:

releases:
- name: backup-agent
  version: ...
- name: postgresql-backup-endpoint
  version: ...

There are examples:

  • https://github.com/anynines/anynines-deployment/blob/master/a9s-pg/a9s-pg.yml#L28-L31
  • https://github.com/anynines/anynines-deployment/blob/master/a9s-pg/a9s-pg.yml#L36-L39

Now we need to check that we have all variables to link this a9s-pg to backup-manager

instance_groups:
...
  - name: backup-agent
    release: backup-agent
    properties:
      consul:
        domain: ((iaas.consul.domain))
        dc: dc1
        agent_address: 127.0.0.1:8500
      backup-agent:
        endpoint:
          username: ((/a9s_pg_backup_agent_username))
          password: ((/a9s_pg_backup_agent_password))
...
  - name: postgresql-backup-endpoint
    release: postgresql-backup-endpoint
    consumes:
      cluster_nodes:
        from: postgresql
    properties:
      postgresql-backup-endpoint:
        username: admin
        password: ((/a9s_pg_backup_endpoint_password))
        port: 3001
        cluster:
          postgresql:
            username: admin_user
            password: ((a9s_pg_admin_password))
          backup-agent:
            username: ((/a9s_pg_backup_agent_username))
            password: ((/a9s_pg_backup_agent_password))
          info-service:
            username: admin
            password: ((/a9s_pg_postgresql_info_webservice))

For more information take look at:

  • https://github.com/anynines/anynines-deployment/blob/master/a9s-pg/a9s-pg.yml#L180-L190
  • https://github.com/anynines/anynines-deployment/blob/master/a9s-pg/a9s-pg.yml#L223-L242

After checking that everything fine we can trigger back up. The instruction of how to do that you can find a9s-pg Backup

Even though a backup of the a9s PostgreSQL 9.4 backup is not compatible with the a9s PostgreSQL 11 version, it is possible to reset the cluster, roll back to the old version, and restore a backup.

To know more about this, see the Disaster Recovery section below.

Manual Backup

It is also possible to create a local dump file with the changes for the 9.4,10 and 11 versions to be restored in case of a disaster by using the next command:

$ pg_dumpall --clean --oids --no-password --if-exists --quote-all-identifiers \
  --host=<primary-addr> --port=5432 --username=admin_user -f a9s-pg.sql

The password for this user is on the credhub variable local a9s-pg a9s_pg_admin_password. It should be under the path /<bosh-directory-name>/a9s-pg/a9s_pg_admin_password.

If the password does not seem to be working, try to set the PGPASSWORD environment variable.

Even though a backup of the a9s PostgreSQL 9.4 backup is not compatible with the a9s PostgreSQL 11 version, it is possible to reset the cluster, roll back to the old version, and restore a backup.

To know more about this, see the Disaster Recovery section below.

Block Node

Because there is a window where the bootstrap node is upgrading but it is no longer streaming from the primary, which is on the old part of the cluster and a new version can not stream from it, this would cause existing connections to write to a node that will have the data discarded.

To avoid losing this portion of data, you can block the nodes with the following script:

#!/bin/bash

deployment="a9s-pg"

block_node="iptables -I INPUT -p tcp -s 0.0.0.0/0 --dport 5432 -j REJECT;"
for ip in $(bosh -d ${deployment} is --json | jq -r '.Tables[].Rows[].ips'); do
  block_node="${block_node} iptables -I INPUT -p tcp -s ${ip} --dport 5432 -j ACCEPT;"
done

for instance in $(bosh -d ${deployment} is --json | jq -r '.Tables[].Rows[].instance'); do
  echo "Inserting rules - node: ${instance}, command: ${block_node}"
  bosh -d ${deployment} ssh ${instance} "sudo ${block_node}"
done

Important to know that we don't expect that VMS will be recreated during the upgrade, if it does not takes place you need to remove the iptables rules manually (see section Restore Manual Backup). Also, we want to release the upgrading procedure as fast as possible, as a result, we don't create a middle release to automate procedure of blocking nodes for external requests.

Run Upgrade

The upgrade takes place when deploying the new version of a9s-pg BOSH deployment. Please, refer to the a9s-pg documentation to know more about this.

If the task finishes with success, your cluster will be ready for use. If it fails, please refer to the Disaster Recovery section below.

Good to know that we have several timeouts in the process from initiate node with a new version to ready.

  • The first is when we are waiting when a node initializes the main process and the timeout is 5 minutes.
  • The second is when we are waiting for node after cloning data from new master to new stand by node and it waits 10 minutes.
  • The third timeout we are waiting 3 hours for primary address if we previous doesn't response.

After the Upgrade has finished, it may happen that the virtual machine was not recreated, and therefore the cluster is still blocked. You can run the following script to unblock:

#!/bin/bash

deployment="a9s-pg"

for instance in $(bosh -d ${deployment} is --json | jq -r '.Tables[].Rows[].instance'); do
  echo "Dropping rules - node: ${instance}"
  bosh -d ${deployment} ssh ${instance} "sudo iptables -F"
done

Disaster Recovery

In case the upgrade procedure fails, and it has not been possible to fix it manually, you can:

  • Cleanup all nodes in the cluster:
# monit stop all
# rm -rf /var/vcap/store/postgresql11*

After cleaning all the nodes, deploy the previous version of your a9s-pg manifest, to roll back the version to 9.4. Trigger a restore (we need to include steps on how to trigger a restore for a9s-pg) using the a9s Backup Framework, and try to upgrade to the new version again as described on Run Upgrade section.

Note: If the upgrade process fails on the first node for any reason, it should be ok to trigger again to try to finish the task, except if there is a primary found in the cluster.

Trigger a Restore Using the a9s Backup Framework

Please, refer to the a9s-pg documentation to know more about this.

After the restore has finished with success, trigger the upgrade procedure again as described on Run Upgrade

Restore Manual Backup

Using the local dump file created on the step above, restore this file to the cluster:

$ psql --host=<primary-addr> --username=admin_user --port=5432 --dbname=postgres < a9s-pg.sql

During the execution of this command, you might see some errors related to the repmgr database, it should be ok since the repmgr database is currently in use, and already with the correct information.

After the restore has finished with success, trigger the upgrade procedure again as described on Run Upgrade