a9s PostgreSQL Manual Logical Backup Recovery

This document describes shortly how to manually restore a logical backup on a PostgreSQL deployment in another site.

Summary

Requirements

There are some requirements to acomplish this:

  • aws-cli or azure-cli: With the credentials necessary to download the backup.
  • Access to the bucket where the backup is stored: The CLI must be configured with credentials that permits at least reading the files.
  • Access to the old database where the backup metadata is stored: This is necessary to recover the secret used to encrypt the backup files. You will need to have an available Backup Manager able to read data from the database or take note previously of the secret.

If you are not familiarized with the pg_dump and PostgreSQL dump files concept, please read the PostgreSQL documentation about the subject.

How to Discover the Service Instance GUID

In order to identify the backup id on a9s Backup Manager, the Cloud Foundry service instance id must be identified. If the Cloud Foundry CLI can not be used (cf service <service-name> --guid), follow the step bellow.

Using Cloud Foundry and Service Broker Databases

Access the ccdb (Cloud Controller Database) database.

After connected, find the GUID of the desired service instance with:

SELECT name, guid FROM service_instances WHERE name = '<service-instance-name>';

How to Discover the a9s Backup Manager Service Instance ID

The backup id has the following format: <backup-manager-service-instance-id>-<timestamp>.

The <backup-manager-service-instance-id> is generated by the a9s Backup Manager to identify the instance. In order to identify the correct base backup to use when recovering an instance we need to find out the instance id on the a9s Backup Manager.

Note: The service instance ID generated by the a9s Backup Manager is not the same used by Cloud Foundry to identify the instance.

Using the a9s Backup Manager API

The a9s Backup Manager API supports querying for the backups of a given instance. You can use this method if the a9s Backup Manager for the failing site is still available.

Retrieve the a9s Backup Manager password for the admin user:

$ credhub get -n /backup_manager_password

Then, trigger the following request against your Backup Manager:

curl -u <backup-manager-username>:<backup-manager-password> \
  http://<backup-manager-host>:3000/instances/<service-instance-guid>

This command lists the backup and restore history, discover which backup to use checking the created_at and updated_at fields, and use the backup_id field content.

Using a9s Backup Manager Ruby Console (IRB)

The a9s Backup Manager includes a script that runs the Interactive Ruby Shell already configured to access the current a9s Backup Manager database. To execute this script, access the a9s Backup Manager instance, become root and execute the following command:

# /var/vcap/jobs/anynines-backup-manager/bin/rails_c

Inside the IRB shell, execute the following commands:

irb(main):001:0> Instance.where(instance_id: "<service-instance-guid>").first.guid

The <service-instance-guid> is the Cloud Foundry Service Instance GUID (How to Discover the Service Instance GUID).

Using the a9s Backup Manager Database

Use this method if your a9s Backup Manager is not available.

First, you need to access the backupmanager database.

# SELECT guid FROM instances WHERE instance_id='<service-instance-guid>';

The <service-instance-guid> is the Cloud Foundry Service Instance GUID (How to Discover the Service Instance GUID).

Download Files

The logical backup storage follows the structure bellow:

AWS S3:

  • <bucket>: Or container where the backups are stored.
    • <backup-manager-service-instance-id>.json: Files holding metadata about the backup.
    • backup_data_tmp: Directory holding the raw data of the backups.
      • <backup-manager-service-instance-id>-<index>: Encrypted and split backup file.

Azure:

  • <bucket>: Or container where the backups are stored.
    • <backup-manager-service-instance-id>: Encrypted backup file.

Download Backup

The first step is to identify which available backup to use in you storage backend.

AWS S3:

$ aws s3 ls s3://<bucket>/<backup-manager-service-instance-id>

This command should list the metadata files, each file looks like <backup-manager-service-instance-id>-<timestamp>.json. Each .json file is the metadata about a backup. Since we are restoring to the latest available backup, get the name of the backup with most recent timestamp.

The name of the backup (Backup ID) is the name of the file without .json.

It is important to note that backup files are split into multiple files containing 1GB each, this means that if your base backup has more than 1GB of size, you will need to put together this files before restoring, for the moment let's download all files benlonging to a backup.

$ aws s3 ls s3://<bucket>/backup_data_tmp/<backup-manager-service-instance-id>

To list these files.

And the following command to download all files that belongs to the backup:

$ aws s3 cp --exclude="*" --include="<backup-manager-service-instance-id>*" --recursive \
  s3://<bucket>/backup_data_tmp <tmp-base-backup-dir>

Azure:

$ az storage blob list --container-name <container> --prefix <backup-manager-service-instance-id>- | jq '.[].name'

This command will list all the available backups, where each file name looks like <backup-manager-service-instance-id>-<timestamp>. Since we are restoring to the latest available backup, get the name of the backup with most recent timestamp.

The name of the backup (Backup ID) is the name of the file.

To download the file, execute:

$ az storage blob download --container-name <container> --file <backup-manager-service-instance-id> --name <backup-manager-service-instance-id>

Prepare Files

The files are split (if you are using AWS S3) and encrypted. Before restoring, you will need to prepare these files.

The first step after downloading, is to decrypt them. Once the files are decrypted, you must join them, if necessary, before starting the recovery process.

Retrieve Backup Encryption Secret

Files are encrypted before stored in you backend storage. So in order to be able to extract this files we need to decrypt them.

Unless you have changed the secret used to encrypt the backup using the dashboard (in this case, you need to remember your secret), you will have to access the Backup Manager in the old site to retrieve the secret used to encrypt the backup.

$ bosh -d <deployment-name> ssh backup-manager # Sometimes called ancillary-services
$ sudo su -
$ /var/vcap/jobs/anynines-backup-manager/bin/rails_c
irb(main):001:0> Backup.where(backup_id: "<backup-manager-service-instance-id>").first.credentials[:filter_plugins] \
irb(main):002:0> .select{ |plugin| plugin[:name] == "encrypt" }.first[:password]

To know more about this, you can also check here.

Unfortunatelly, if you are not able to retrieve the credentials you will not be able to decrypt the files and go on with the restoring process.

Decrypt All Files

AWS S3:

To decrypt the backup execute the command bellow for all files that belongs to the backup:

$ cat <backup-manager-service-instance-id>-<index> \
    | openssl enc -aes256 -d -pass 'pass:<secret>' \
    | gunzip -c > <dst-dir>/<backup-manager-service-instance-id>-<index>

Note that if you are using AWS S3, your backup file might be split. In this case all files belonging to the backup file must be decrypted.

For example:

$ cat b6f4c071-ef44-4af2-9608-531b4ce4823f-1548967947154-0 \
  | openssl enc -aes256 -d -pass 'pass:NYHD8MVmA55HEqoaYHpQaxfwEMcQ1wkI' \
  | gunzip -c > /var/vcap/store/postgresql-restore/b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279-0
$ cat b6f4c071-ef44-4af2-9608-531b4ce4823f-1548967947154-1 \
  | openssl enc -aes256 -d -pass 'pass:NYHD8MVmA55HEqoaYHpQaxfwEMcQ1wkI' \
  | gunzip -c > /var/vcap/store/postgresql-restore/b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279-1

Join Split Files

As mentioned before, the backup file is split into multiple files with 1GB each, if your backup has less than this size you can skip the next step.

For each file belonging to the same base backup, join the files using cat.

$ cat <base-backup-manager-service-instance-id>-0 > <backup-manager-service-instance-id>.dump
$ cat <base-backup-manager-service-instance-id>-<index> >> <backup-manager-service-instance-id>.dump

E.g.:

$ cat b6f4c071-ef44-4af2-9608-531b4ce4823f-1548967947154-0 > b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279.dump
$ cat b6f4c071-ef44-4af2-9608-531b4ce4823f-1548967947154-1 >> b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279.dump
$ cat b6f4c071-ef44-4af2-9608-531b4ce4823f-1548967947154-2 >> b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279.dump
(...)

Make sure there is no new lines when joining a file with the other, this can corrupt the file making the restore operation to fail.

Azure:

Execute the following command on the backup file:

$ cat <file-name> \
       | openssl enc -aes256 -d -pass 'pass:<secret>' \
       | gunzip -c > <dst-dir>/<file-name>

For example:

$ cat b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279 \
  | openssl enc -aes256 -d -pass 'pass:NYHD8MVmA55HEqoaYHpQaxfwEMcQ1wkI' \
  | gunzip -c > /var/vcap/store/postgresql-restore/b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279.dump

Copy Files to the Recovery Instance

Before copying the files to your instance, make sure you have enough space to store the backup file and the created database and make sure you are restoring this dump file in the current master node. The recovery process must be started in the master node, then cloned by the standby nodes.

First, create a directory under /var/vcap/store/

$ bosh -d <deployment-name> ssh pg/<master-node-index-or-id>
$ sudo su -
# mkdir /var/vcap/store/postgresql-restore
# chown root.vcap -R /var/vcap/store/postgresql-restore
# chmod g+w /var/vcap/store/postgresql-restore

With the directory prepared, copy the backup file (dump file) to the VM.

In the example bellow the file is transfered using bosh scp:

$ bosh -d <deployment-name> scp <backup-manager-service-instance-id>-<timestamp>.dump pg/<master-node-index>:/var/vcap/store/postgresql-restore

Prepare PostgreSQL

You can restore the dump file with the current running cluster, data must be restore on master and it will be streamed to the standby nodes.

If you choose to stop the standby nodes before restoring, remember to drop their replication slots within the master node with:

SELECT pg_drop_replication_slot(pg_replication_slots.slot_name) FROM pg_replication_slots WHERE active <> 't';

Note that restoring a .dump file will generate new WAL files as much as the size of the backup file, if you have continuous archiving enabled, this can use a lot of the space since all the data in the backup file will be written to the database generating new WAL files that can take some time to backup.

Make sure not aplication is connected, you might want to block new connection to port 5432 with iptables and execute the following command to drop existing active connections:

SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pid <> pg_backend_pid();

Stop PostgreSQL Recovery Cluster (Optional)

The next step applies only to cluster deployments.

Discover which node is the master, then stop the standby nodes. Make sure you stop the master node last.

For each standby node, execute:

$ bosh -d <deployment-name> ssh pg/<index-or-id>
$ sudo su -
# monit stop postgresql-ha

Stop only the postgresql-ha process. The repmgr process depends on postgresql-ha so it will also be stopped with this commands.

Cleanup Startup Lock Directory

A file may be left on the startup lock directory containing a PID that has been recycled by the operating system. In this case, when trying to restart the postgresql-ha process, it can fail due to a startup process already running when actually another process is reusing the PID.

To avoid this issue, after completely stopping the postgresql-ha process, check if there is any related process running with ps aux. If no related process is running, remove the content of the startup locks directory:

rm /tmp/postgresql-startup-locks/*

Recover Backup

The backup is a dump file generated with pg_dumpall.

So in order to recover, you can execute:

$ bosh -d <deployment-name> ssh pg/<master-index-or-id>
$ sudo su - vcap
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ psql --quiet postgres < /var/vcap/store/postgresql-restore/b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279.dump

Update Database Name

Because the database name is based on the deployment name, you will have to delete the current database that matches the current deployment name and rename the database being recovered to the current deployment name.

# su - vcap
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ psql postgres

Once connected to the default database, fix its name:

DROP DATABASE IF EXISTS <new-deployment-name>;
ALTER DATABASE <old-deployment-name> RENAME TO <new-deployment-name>;

Exit the psql command line, and finish configuring users, database and extensions according to the deployment manifest:

$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ source /var/vcap/jobs/postgresql-ha/helpers/postgresql_utils.sh
$ create_extensions
$ create_or_update_users_and_databases
$ enable_statistics_collector # If you have app dynamincs support enabled
$ enable_app_dynamics_support # If you app dynamincs support enabled

After finish configuring, check that the repmgr process is running.

Note that monit summary can show a process as running when it is actually waiting. So make sure the process is running with ps aux.

Start Standy Nodes

After configuring the primary node, clean up the data diretory on the standby nodes:

# rm -r /var/vcap/store/postgresqlXX/*

Then execute the pre-start script:

# /var/vcap/jobs/postgresql-ha/bin/pre-start

At this point, data should have been cloned from the primary and it is possible to monit start postgresql-ha on the standby nodes, if running a cluster.

Now the cluster is ready to be used again.

Remember to cleanup /var/vcap/store/postgresql-restore after the the cluster is up and running.

Checking Cluster Health

You can know more about checking the cluster status here