Skip to main content
Version: Latest

Manual Logical Backup Recovery

This document describes how to manually restore a logical backup on an a9s-pg deployment, potentially in another site.

info

This guide is written for the use-case of recovering an a9s-pg cluster by using the same existing deployment or using a new deployment in a new site (i.e. different BOSH director).

When using a new deployment in the same site, the reconfiguration of other framework components, so that they connect to the new a9s-pg deployment, will be necessary. This guide does not cover that scenario.

Steps that apply only to the scenario in which a new site is used are indicated accordingly.

Limitations

Restoring the a9s-pg to a previous point in time can result in loss of information regarding deployed Service Instances, which in turn causes ghost deployments that can no longer be handled by the a9s Service Broker and a9s BOSH Deployer.

Requirements

warning

In order to download and restore backups, it is necessary to setup an encryption key which must be stored securely. This is crucial in the case of a9s-pg, because without it no Logical Recovery is possible.

Therefore, it is strongly advised that the encryption key is stored securely, and that it is available when needed.

There are some requirements to accomplish this:

  • Crucial The backup encryption key (backup-encryption-key) and the Backup GUID (backup-guid) must be known by the Platform Operator. Please have a careful look at the Retrieve Required Information section.
  • Try to keep the same credentials on Credhub of the a9s-pg to restore, if a new a9s-pg is deployed.
  • aws-cli or azure-cli, with the necessary credentials to download the backup.
  • Access to the bucket where the backup is stored: The CLI must be configured with credentials that permit at least reading the files.

If you are not familiarized with the pg_dump and PostgreSQL dump files concept, please read the PostgreSQL documentation about the subject.

Command Usage Notice

The commands outlined in this document are designed to be executed on Linux operating systems. Using them on different Unix-based operating systems, such as macOS, may result in unforeseen outcomes; as a precaution, we recommend utilizing Linux for these operations.

Retrieve Required Information

Please note the following:

KeyDescription
Backup GUIDThe ID that is used by the Backup Manager to identify all Backups of a9s-pg.
Backup IDThe ID that is used by the Backup Manager to retrieve a specific Backup, based on the timestamp. It has the format <backup-guid>-<timestamp>.
Backup Encryption KeyThe encryption key used to decrypt the Backup.
Backup Encryption Key Pitfalls
  • The Platform Operator must have safely stored the backup-encryption-key and the backup-id, because it will not be possible to retrieve them in case a9s-pg is in a disastrous state and unreachable. Therefore, it is strongly necessary to retrieve this information as soon as the a9s-pg and Backup Manager are up and running.
  • During the first Backup Service execution the backup-encryption-key will be generated automatically. It is possible to retrieve this information by following the How to Discover an Existing Backup Encryption Key section.
    • Noteworthy, the backup-encryption-key collection must be done as soon as the a9s-pg and Backup Service are healthy, otherwise it will not be possible to gather this information.
  • In case the backup-encryption-key is changed, the new backup-encryption-key must be stored safely as soon as possible.
  • Even though the backup-encryption-key is generated automatically, it is a good idea to manually set an encryption key during the initial installation of the a9s Data Service Framework. This can be done once the a9s Backup Manager is installed.

Set the Backup Encryption Key Manually via a9s API V1

The encryption key can be set using the a9s API V1, see also Update Backup Configuration.

url="https://a9s-dashboard.example.com/v1/instances/a9s-pg"

curl -X PATCH --insecure --header "Authorization: $(bearer_token)" \
--header "Content-Type: application/json" "${url}/backups-config" \
--data '{"encryption_key":"<encryption_key>" }'

After the encryption key is set, a new backup can be triggered.

$ curl -X POST --insecure --header "Authorization: $(bearer_token)" \
--header "Content-Type: application/json" "${url}/backups" \
--data '{"encryption_key":"<encryption_key>" }'
=>{"id": 11,"message": "job to backup is queued"}

# check the status of the backup (can be queued, running, done, failed, deleted):
$ curl --insecure --header "Authorization: $(bearer_token)" "${url}/backups/11"
=> {"id":11,"size":272,"status":"done","triggered_at":"2022-04-12T12:00:31.047Z",
"finished_at":"2022-04-12T12:00:50.478Z", "downloadable":true}

Alternatively, the API of the Backup Manager itself could be used, as described in Integration with the a9s Backup Manager.

How to Discover an Existing Backup Encryption Key

The encryption key is created automatically at the first Backup Service execution or set by the platform operation using the Backup Service API or the a9s API V1.

Using a9s Backup Manager Ruby Console (IRB)

The a9s Backup Manager includes a script that runs the Interactive Ruby Shell already configured to access the current a9s Backup Manager database. To execute this script, access the a9s Backup Manager instance, become root and execute the following command:

/var/vcap/jobs/anynines-backup-manager/bin/rails_c

Inside the IRB shell, execute the following commands:

irb(main):001:0> instance = Instance.where(instance_id: "a9s-pg").first
irb(main):002:0> instance.backup_encryption_key

How to Discover the Backup ID and GUID

The backup id has the following format: <backup-guid>-<timestamp>.

The <backup-guid> is generated by the a9s Backup Manager to identify the a9s-pg Instance, and therefore links to all backups of a9s-pg. In order to identify the correct base backup to use when recovering a9s-pg, it is necessary to find the Backup GUID, as well as a corresponding timestamp.

Using a9s Backup Manager Ruby Console (IRB)

The a9s Backup Manager includes a script that runs the Interactive Ruby Shell already configured to access the current a9s Backup Manager database. To execute this script, access the a9s Backup Manager instance, become root and execute the following command:

/var/vcap/jobs/anynines-backup-manager/bin/rails_c

Inside the IRB shell, execute the following commands:

irb(main):001:0> Instance.where(instance_id: "a9s-pg").first.guid

Download Files

The logical backup storage follows the structure below:

AWS S3:

  • <bucket>: Or container where the backups are stored.
    • <backup-id>.json: Files holding metadata about the backup.
    • <backup-id>-<index>: Encrypted and split backup file.

Azure:

  • <bucket>: Or container where the backups are stored.
    • <backup-id>: Encrypted backup file.

Download Backup

The first step is to identify which available backup to use in you storage backend.

The following command should list the metadata files, whereby each file looks like <backup-guid>-<timestamp>.json. Each .json file is the metadata about a backup.

As mentioned, Backup ID is the name of the file without .json, i.e. it is the concatenation of the Backup GUID with a specific timestamp.

So the Backup ID you need might look like:

de2fe9e4-6535-423d-8709-dca87df7985c-1740130000

AWS S3:

Run the following to list all backup files for a :

aws s3 ls s3://<bucket>/<backup-guid>

Since the restore has to be based on the latest available backup, get the backup with most recent timestamp.

It is important to note that backup files are split into multiple files containing 1GB each, this means that if your base backup has more than 1GB of size, you will need to put together these files before restoring, for the moment let's download all files belonging to a backup.

Run the following to list all files for a specific backup:

aws s3 ls s3://<bucket>/<backup-id>

And the following command to download all backup related files:

aws s3 cp --exclude="*" --include="<backup-id>*" --recursive \
s3://<bucket> <tmp-base-backup-dir>

Azure:

Run the following to list all backup files for your instance:

az storage blob list --container-name <container> --prefix <backup-guid>- | jq '.[].name'

Since the restore has to be based on the latest available backup, get the backup with most recent timestamp.

To download the file, execute:

az storage blob download --container-name <container> --file <backup-id> --name <backup-id>

Prepare Files

The files are split (if you are using AWS S3) and encrypted. Before restoring, you will need to prepare these files.

The first step after downloading, is to decrypt them. Once the files are decrypted, you must join them, if necessary, before starting the recovery process.

Decrypt All Files

AWS S3:

To decrypt the backup execute the command below for all files that belongs to the backup. All split files must be decrypted together, and in the correct ascending order.

Usage of Compound Commands

The following commands make use of the following compound command cat $(ls -1v ...), which assumes the use of a Linux distribution operating system. Therefore, if you are using a different operating system for your environment, it is recommended that you adjust this compound command to accordingly.

cat $(ls -1v <backup-id>-*) \
| openssl enc -aes256 -md md5 -d -pass 'pass:<backup-encryption-key>' \
| gunzip -c > <dst-dir>/<backup-id>.dump

For example:

cat $(ls -1v b6f4c071-ef44-4af2-9608-531b4ce4823f-1548967947154-*) \
| openssl enc -aes256 -md md5 -d -pass 'pass:NYHD8MVmA55HEqoaYHpQaxfwEMcQ1wkI' \
| gunzip -c > /var/vcap/store/postgresql-restore/b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279.dump

Azure:

Execute the following command on the backup file:

cat <file-name> \
| openssl enc -aes256 -md md5 -d -pass 'pass:<backup-encryption-key>' \
| gunzip -c > <dst-dir>/<file-name>

For example:

cat b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279 \
| openssl enc -aes256 -md md5 -d -pass 'pass:NYHD8MVmA55HEqoaYHpQaxfwEMcQ1wkI' \
| gunzip -c > /var/vcap/store/postgresql-restore/b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279.dump

Deploy a New a9s-pg

note

This section applies only for fresh a9s-pg deployments in a new site.

Deploy a new a9s-pg instance or at least make sure there is a new empty a9s-pg up and running available to recover the data.

info

This step applies only for the New Site Scenario mentioned in the beginning of this guide.

Find a9s-pg Primary Node

Check the IP by the domain with the following command:

nslookup a9s-pg-psql-master-alias.node.dc1.<iaas.consul.domain>

Then, find the Virtual Machine(VM) index or VM id:

bosh -d <a9s-pg-deployment-name> instances

Alternatively, the methods in Discover which node is the primary can be tried.

Copy Files to the a9s-pg Instance

Before copying the files to your instance, make sure you have enough space to store the backup file and the created database and make sure you are restoring this dump file in the current primary node. The recovery process must be started in the primary node, then cloned by the standby nodes.

First, create a directory under /var/vcap/store/

bosh -d <a9s-pg-deployment-name> ssh pg/<primary-node-index-or-id>
sudo su -
mkdir /var/vcap/store/postgresql-restore
chown root.vcap -R /var/vcap/store/postgresql-restore
chmod g+w /var/vcap/store/postgresql-restore

With the directory prepared, copy the backup file (dump file) to the VM.

In the example below the file is transfer using bosh scp:

bosh -d <a9s-pg-deployment-name> scp <backup-id>.dump pg/<primary-node-index>:/var/vcap/store/postgresql-restore

Prepare a9s-pg

note

Restoring a .dump file will generate new WAL files as much as necessary to load the whole dump file.

This action can use a lot of disk space since all the data in the backup file will be written to the database, generating new WAL files that can take some time to be archived.

You can restore the dump file within the current running cluster, or in a new deployment. In any case, the data must be restored on the primary, so that it can be streamed to the standby nodes.

Prepare the Standby Nodes

Stop Standby Nodes

Find the a9s-pg Primary Node, then stop the standby nodes with monit stop:

bosh -d <deployment-name> ssh pg/<index-or-id>
sudo su -

monit stop postgresql-ha

Stop only the postgresql-ha process. The repmgr process depends on postgresql-ha so it will also be stopped with this commands.

Prepare the Primary

Stop repmgr

The recovery of the dump will overwrite the content of the repmgr database. To avoid issues with repmgr, it needs to be stopped on the primary without stopping the PostgreSQL process:

bosh -d <deployment-name> ssh pg/<index-or-id>
sudo su -

monit stop repmgr

Drop replication slots

Drop their replication slots within the primary node with:

sudo su -
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh

chpst -u postgres:vcap psql postgres
SELECT pg_drop_replication_slot(pg_replication_slots.slot_name) FROM pg_replication_slots WHERE active <> 't';

Terminate Running Process

To make sure no application is connected, it is recommended to block new connections to port 5432 with iptables, and execute the following command to drop existing active connections:

SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pid <> pg_backend_pid();

Cleanup Startup Lock Directory

A file may be left on the startup lock directory containing a PID that has been recycled by the operating system. In this case, when trying to restart the postgresql-ha process, it can fail due to a startup process already running when actually another process is reusing the PID.

To avoid this issue, after completely stopping the postgresql-ha process, check if there is any related process running with ps aux. If no related process is running, remove the content of the startup locks directory:

rm /tmp/postgresql-startup-locks/*

Recover Backup

The backup is a dump file generated with pg_dumpall.

So in order to recover, you can execute:

bosh -d <a9s-pg-deployment-name> ssh pg/<primary-node-index-or-id>
sudo -i
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
chpst -u postgres:vcap psql --quiet postgres < /var/vcap/store/postgresql-restore/b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279.dump

Update Database Roles

note

This section applies only for fresh a9s-pg deployments in a new site.

bosh -d <deployment-name> ssh pg/<primary-node-index-or-id>
sudo -i
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh

chpst -u postgres:vcap psql postgres -f /var/vcap/jobs/postgresql-ha/data/create_or_update_roles_and_databases.sql

Finish Cluster Setup

Finish Primary Setup

This section explains how to finish the process of setting up the primary with the new data.

Clean up repmgr

After restoring the dump file, the repmgr database must be cleaned up to make sure its previous state will not get in the way of the new repmgr cluster before starting the repmgr process again.

Access the repmgr database on the primary node.

bosh -d <deployment-name> ssh pg/<primary-node-index-or-id>
sudo su -
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh

chpst -u postgres:vcap psql repmgr_cluster

Then, remove all entries in all repmgr tables:

DO $$ DECLARE
r record;
BEGIN
FOR r IN (SELECT tablename FROM pg_tables WHERE schemaname = 'repmgr') LOOP
EXECUTE 'TRUNCATE repmgr.' || quote_ident(r.tablename) || ' CASCADE';
END LOOP;
END $$;

Register repmgr

At this point, the repmgr state is empty. The primary must be registered in order to be able to start the standby nodes. The following steps must be executed in the primary:

bosh -d <deployment-name> ssh pg/<primary-node-index-or-id>
sudo su -

chpst -u postgres:vcap bash -c 'source /var/vcap/jobs/postgresql-ha/helpers/vars.sh; \
source /var/vcap/jobs/postgresql-ha/helpers/repmgr_utils.sh; \
register_master_node;'

Start repmgr

Now the repmgr process can be started again in the primary.

bosh -d <deployment-name> ssh pg/<primary-node-index-or-id>
sudo su -

monit start repmgr

After starting repmgr, check that the repmgr process is running.

Note that monit summary can show a process as running when it is actually waiting. To make sure the process is running, see the Cluster Status

Start Standby Nodes

After configuring the primary node, clean up the data directory on the standby nodes:

rm -r /var/vcap/store/postgresqlXX/*

Then execute the pre-start script:

/var/vcap/jobs/postgresql-ha/bin/pre-start

At this point, data should have been cloned from the primary and it is possible to monit start postgresql-ha on the standby nodes.

Finally, the cluster is ready to be used again.

Remember to cleanup /var/vcap/store/postgresql-restore after the cluster is up and running.

Check Cluster Health

You can learn more about checking the cluster status here.

It is also possible to get some idea of the cluster health by looking at the metrics.