Skip to main content
Version: 55.0.0

a9s PostgreSQL Manual Point-in-Time Recovery

This document describes shortly how to recover a backup to the latest available state of the data on a different site manually.

Requirements

There are some requirements to accomplish this:

  • aws-cli or azure-cli: With the credentials necessary to download the base backup and all necessary WAL files.
  • Access to the bucket where the backup is stored: The CLI must be configured with credentials that permits at least reading the files.
  • Access to the old database where the backup metadata is stored: This is necessary to recover the secret used to encrypt the backup files. You will need to have an available Backup Manager able to read data from the database or take note previously of the secret.
  • Backup was created with a9s Data Services version greater than 20.0.0. Due to a change in the directory organization older backups will not be found. If you are using an older version you can use the older documentation

If you are not familiarized with the continuous archiving and point-in-time recovery concepts, please read the PostgreSQL documentation about the subject.

Supported AWS S3 Storage Classes

While the a9s Backup Framework supports AWS S3 as a backup store, this does not extend to every AWS S3 Storage Class. Currently, the supported AWS S3 storage class is S3 Standard.

Command Usage Notice

The commands outlined in this document are designed to be executed on Linux operating systems. Using them on different Unix-based operating systems, such as macOS, may result in unforeseen outcomes; as a precaution, we recommend utilizing Linux for these operations.

How to Discover the Deployment Name

Some of the backup files are stored in directories that has the same name as the service instance deployment. If you know the deployment name you can skip this step, otherwise choose a method below according to your scenario.

Using Cloud Foundry CLI

The name of the database is the same used as deployment name, so it is possible to retrieve the node name with cf env, when you have an app bound to the service. You can also check a service key, if the service has any.

You can find the database name under:

VCAP_SERVICES: {
<service-name>: {
credentials: {
(...)
name: <database-name>,
(...)
}
}
}

Using Service Broker API

You can retrieve the broker information about the instance in order to retrieve the deployment name.

Retrieve service GUID:

cf service <service-instance-name> --guid

You can also check the ccdb database in order to find this guid, if you can not use cf service.

Once with the service instance GUID, retrieve the service broker password. The default username is admin:

credhub get -n /postgresql_service_broker_password

Then, trigger the following request against your service broker:

curl http://<broker-username>:<broker-password>@<broker-host>:3000/instances/<service-instance-guid>

The deployment name should be under:

{
(...)
"deployment_name": <deployment-name>,
(...)
}

Using Cloud Foundry and Service Broker Databases.

Use this method if your Cloud Foundry service or Service Broker is not available.

First, you need to access the ccdb (Cloud Controller Database) database.

After connected, find the GUID of the desired service instance with:

SELECT name, guid FROM service_instances WHERE name = '<service-instance-name>';

With the GUID in hands, check the deployment name in the a9s Service Broker database:

SELECT deployment_name FROM instances WHERE guid_at_tenant = '<service-instance-guid>';

How to Discover the a9s Backup Manager Service Instance ID

The backup id has the following format: <backup-manager-service-instance-id>-<timestamp>.

The <backup-manager-service-instance-id> is generated by the a9s Backup Manager to identify the instance. In order to identify the correct base backup to use when recovering an instance we need to find out the instance id on the a9s Backup Manager.

Note: The service instance ID generated by the a9s Backup Manager is not the same used by Cloud Foundry to identify the instance.

Using the a9s Backup Manager API

The a9s Backup Manager API supports querying for the backups of a given instance. You can use this method if the a9s Backup Manager for the failing site is still available.

First, retrieve service GUID:

cf service <service-instance-name> --guid

Retrieve the a9s Backup Manager password for the admin user:

credhub get -n /backup_manager_password

Then, trigger the following request against your Backup Manager:

curl -u <backup-manager-username>:<backup-manager-password> \
http://<backup-manager-host>:3000/instances/<service-instance-guid>

This command lists the backup and restore history, discover which backup to use checking the created_at and updated_at fields, and use the backup_id field content.

Using a9s Backup Manager Ruby Console (IRB)

The a9s Backup Manager includes a script that runs the Interactive Ruby Shell already configured to access the current a9s Backup Manager database. To execute this script, access the a9s Backup Manager instance, become root and execute the following command:

/var/vcap/jobs/anynines-backup-manager/bin/rails_c

Inside the IRB shell, execute the following commands:

irb(main):001:0> Instance.where(instance_id: "<service-instance-guid>").first.guid

The <service-instance-guid> is the Cloud Foundry Service Instance GUID, you can retrieve it with the following command or refer to the section Using Cloud Foundry and Service Broker Databases:

cf service <service-instance-name> --guid

Using the a9s Backup Manager Database

Use this method if your a9s Backup Manager is not available.

First, you need to access the backupmanager database.

SELECT guid FROM instances WHERE instance_id='<service-instance-guid>';

The <service-instance-guid> is the Cloud Foundry Service Instance GUID, you can retrieve it with the following command or refer to the section Using Cloud Foundry and Service Broker Databases:

cf service <service-instance-name> --guid

Download Files

The Continuous Archiving storage follows the structure below:

AWS S3:

  • <bucket>: Or container where the backups are stored.
    • <base-backup-id>.json: Files holding metadata about the base backups.
    • <base-backup-id>-<index>: Encrypted, split and compressed base backups files.
    • <instance-id>: Directory holding encrypted and compressed WAL files directory and the WAL files metdata.

Azure:

  • <bucket>: Or container where the backups are stored.
    • <base-backup-id>: Encrypted and compressed base backups files.
    • <deployment-name>: Encrypted and compressed WAL files directory.

Download Base Backup

The first step is to identify which available base backup to use in you storage backend.

AWS S3:

aws s3 ls s3://<bucket>/<backup-manager-service-instance-id>

This command lists files which the name follows the format <backup-manager-service-instance-id>-<timestamp>.json. Each .json file is the metadata about a base backup. Since we are restoring to the latest available point-in-time, get the name of the backup with most recent timestamp.

The name of the base backup (Backup ID) is the name of the file without .json.

It is important to note that backup files are split into multiple files containing 1GB each, this means that if your base backup has more than 1GB of size, you will need to put together these files before restoring, for the moment let's download all files belonging to a base backup:

aws s3 ls s3://<bucket>/<backup-id>

Now, let's download all files that forms the base backup:

aws s3 cp --exclude="*" --include="<backup-id>*" --recursive \
s3://<bucket> <tmp-base-backup-dir>

Azure:

az storage blob list --container-name <container> --prefix <backup-manager-service-instance-id>- | jq '.[].name'

This command will list all the available base backups, where each file name look like <backup-manager-service-instance-id>-<timestamp>. Since we are restoring to the latest available point-in-time, get the name of the backup with most recent timestamp.

The name of the base backup (Backup ID) is the name of the file without .json.

Let's download all files belonging to a base backup.

az storage blob download --container-name <container> --file <backup-id> --name <backup-id>

Download WAL Files

Download the WAL files directory:

AWS S3:

aws s3 cp --recursive s3://<bucket>/<instance-id>/ <target-dir>

Azure:

az storage blob download-batch --source <container-name> \
--pattern <deployment-name>/* --destination <destination-dir>

Prepare Files

The files are split (if you are using AWS S3), encrypted and compressed. Before restoring, you will need to prepare these files.

The first step after downloading the files is to decrypt them. Once the files are decrypted, you must join them, if necessary and decompress them before starting the recovery process.

Retrieve Backup Encryption Secret

Files are encrypted before stored in you backend storage. So in order to be able to extract this files we need to decrypt it.

Unless you have changed the secret used to encrypt the backup using the dashboard (in this case, you need to remember your secret), you will have to access the Backup Manager in the old site to retrieve the secret used to encrypt the backup.

$ bosh -d <deployment-name> ssh backup-manager ## Sometimes called ancillary-services
$ sudo su -
$ /var/vcap/jobs/anynines-backup-manager/bin/rails_c
irb(main):001:0> Backup.where(backup_id: "<backup-id>-<timestamp>").first.credentials[:filter_plugins]
irb(main):002:0* .select{ |plugin| plugin[:name] == "encrypt" }.first[:password]

You can find the secret of a WAL file with:

irb(main):001:0> BackupFragment.where(bucket_subdir: "backup-manager-service-instance-id").first.credentials[:filter_plugins] \
irb(main):002:0* .select{ |plugin| plugin[:name] == "encrypt" }.first[:password]

Note that if the encryption key changes there will be WAL files with different encryption keys, in this case you need to list the key for all of them:

irb(main):001:0> BackupFragment.where(bucket_subdir: "backup-manager-service-instance-id").each do |wal|
irb(main):002:0* pass = wal.credentials[:filter_plugins].select{ |plugin| plugin[:name] == "encrypt" }.first[:password]
irb(main):003:0* puts "#{wal.file_name}: #{pass}"
irb(main):004:0* end

To know more about this, you can also check here.

Unfortunately, if you are not able to retrieve the credentials you will not be able to decrypt the files and go on with the restoring process.

Decrypt All Files

For all files that belongs to the same base backup and for all WAL files downloaded, execute the following command:

cat <file-name> \
| openssl enc -aes256 -md md5 -d -pass 'pass:<secret>' | gunzip -c > <dst-dir>/<file-name>

For example:

cat 000000010000000000000002.00000028.backup-0 \
| openssl enc -aes256 -md md5 -d -pass 'pass:NYHD8MVmA55HEqoaYHpQaxfwEMcQ1wkI' | gunzip -c > /tmp/000000010000000000000002.00000028.backup-0

To decrypt the base backup execute the command bellow for all files that belongs to the backup. All split files must be decrypted together, and in the correct asc order (from index 0 to the last).

Usage of Compound Commands

The following commands make use of the following compound command cat $(ls -1v ...), which assumes the use of a Linux distribution operating system. Therefore, if you are using a different operating system for your environment, it is recommended that you adjust this compound command to accordingly.

cat $(ls -1v <backup-manager-service-instance-id>-*) \
| openssl enc -aes256 -md md5 -d -pass 'pass:<secret>' | gunzip -c > <dst-dir>/<base-backup-id>.tgz

To decrypt all WAL files, go inside the WAL files directories and execute:

for i in *-0; do
echo ${i};
cat ${i} | openssl enc -aes256 -md md5 -d -pass 'pass:<secret>' | gunzip -c > "${i::-2}.gz";
done
note

The expected output of the decrypt command is:

<name-of-decrypted-file>
*** WARNING : deprecated key derivation used.
Using -iter or -pbkdf2 would be better.

After decrypting the files you should confirm that the gzip files were generated running ls *.gz on the directory where the files were decrypted.

note

You may need to use different secrets if you have used many secrets during the service instance life. See the Retrieve Backup Encryption Secret section for more information.

This will create WAL files with the .gz extension in the current WAL files directory.

Copy Files to the Recovery Instance

Before copying the files to your instance, make sure you have enough space to store the WAL files after extracting (make an estimation where each WAL file has 16MB).

And make sure you are restoring this files in the current master node. The recovery process must be started in the master node, then cloned by the standby nodes.

First, create a directory under /var/vcap/store/

$ bosh -d <deployment-name> ssh pg/<master-node-index-or-id>
$ sudo su -
## mkdir /var/vcap/store/postgresql-restore
## chown root.vcap -R /var/vcap/store/postgresql-restore
## chmod g+w /var/vcap/store/postgresql-restore

With the directory prepared, copy the files to the VM.

In the example bellow, we put together the base backup and the directory containing the WAL files in a .tgz file named backup.tgz, the file is transferred using bosh scp.

$ bosh -d <deployment-name> scp backup.tgz pg/<master-node-index>:/var/vcap/store/postgresql-restore

Prepare PostgreSQL

You will need to stop the postgresql-ha process on all nodes as described bellow. And stop and start the nodes according to the order bellow.

Stop PostgreSQL Recovery Cluster

Discover which node is the master, if running a cluster. Then stop the standby nodes, and make sure you stop the master node last. If running a single instance, follow only the instruction that applies to the master node.

For each node, execute (starting with the standby nodes):

$ bosh -d <deployment-name> ssh pg/<index-or-id>
$ sudo su -
## monit stop postgresql-ha

Stop only the postgresql-ha process. The repmgr process depends on postgresql-ha so it will also be stopped with this commands.

Cleanup Startup Lock Directory

A file may be left on the startup lock directory containing a PID that has been recycled by the operating system. In this case, when trying to restart the postgresql-ha process, it can fail due to a startup process already running when actually another process is reusing the PID.

To avoid this issue, after completely stopping the postgresql-ha process, check if there is any related process running with ps aux. If no related process is running, remove the content of the startup locks directory:

rm /tmp/postgresql-startup-locks/*

Cleanup Persistent Storage

In order to properly restore, you have to clean up PostgreSQL data directory on master. a9s PostgreSQL is configured to store this data in the persistent disk attached to the BOSH VM and mounted under /var/vcap/store.

The data directory exists under /var/vcap/store/postgresqlXX, where XX is the major version.

rm -rf /var/vcap/store/postgresqlXX/*

Extract files

After extracting our backup.tgz file inside the /var/vcap/store/postgresql-restore, make sure you organize the content of the directory so all WAL files are in the <old-deployment-name>, this must match with the path used while creating the recovery.conf file. We will assume the directory have the following structure:

<backup-manager-service-instance-id>-<timestamp>.gz ## The decrypted base backup.
<backup-manager-service-instance-id>/ ## Directory containing all decrypted WAL files.

Extract Base Backup

Once the files are in you /var/vcap/store/postgresql-restore, the base backup must be restored inside the data directory.

tar -xzvf /var/vcap/store/postgresql-restore/<base-backup-id>.tgz --directory /var/vcap/store/postgresqlXX/

Extract WAL Files

With the WAL files decrypted, you will have some .gz files. It is time to extract this files and prepare them to be recovered.

for i in *.gz; do echo ${i}; gunzip -d ${i}; done

After extracted, each WAL file must have 16MB.

Recover Backup

Once the environment and the files are properly prepared, it is time to start PostgreSQL and finish configuring the database for the new deployment.

Create recovery.conf

With everything prepared, it is time to create the recovery.conf file that will be used to recover PostgreSQL.

When recovering an a9s PostgreSQL 13 and above

The recovery.conf file must be placed on /var/vcap/sys/tmp/postgresql-ha/recovery.conf.

touch /var/vcap/sys/tmp/postgresql-ha/recovery.conf
chown postgres:vcap /var/vcap/sys/tmp/postgresql-ha/recovery.conf
# for a9s PostgreSQL 13 the user must be `vcap`
chown vcap:vcap /var/vcap/sys/tmp/postgresql-ha/recovery.conf

Also, it is necessary to create the file /var/vcap/store/postgresqlXX/recovery.signal.

touch /var/vcap/store/postgresqlXX/recovery.signal
chown postgres:vcap /var/vcap/store/postgresqlXX/recovery.signal
# for a9s PostgreSQL 13 the user must be `vcap`
# chown vcap:vcap /var/vcap/store/postgresqlXX/recovery.signal
When recovering an a9s PostgreSQL 11 and below

The recovery.conf file must be placed on /var/vcap/store/postgresqlXX/recovery.conf, and be owned by the vcap user. The XX is the major version.

touch /var/vcap/store/postgresqlXX/recovery.conf
chown vcap:vcap /var/vcap/store/postgresqlXX/recovery.conf

The recovery.conf must contain the following content:

recovery_target_timeline = 'latest'
restore_command = 'cp /var/vcap/store/postgresql-restore/<backup-manager-service-instance-id>/%f %p'
recovery_target_action = 'promote'

Note that the restore_command must point to the directory where the WAL files have been extracted.

Start Master Node

After configuring the recovery.conf file, it is time to start the PostgreSQL process and let it run its magic.

Apply the correct PostgreSQL configuration, by copying the current configuration to the data directory:

$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ cp ${JOB_DIR}/config/{postgresql,pg_hba}.conf ${STORE_DIR}
$ chown postgres:vcap $STORE_DIR/{postgresql,pg_hba}.conf
a9s PostgreSQL 13 and below
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ cp ${JOB_DIR}/config/{postgresql,pg_hba}.conf ${STORE_DIR}
$ chown vcap:vcap $STORE_DIR/{postgresql,pg_hba}.conf

Start PostgreSQL process:

monit start postgresql-ha

Once, PostgreSQL is running, finish configuring the users and databases.

With PostgreSQL master running, access it with

$ sudo -i
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ chpst -u postgres:vcap psql postgres
a9s PostgreSQL 13 and below
$ su - vcap
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ psql postgres

Before applying the right database, you must delete the current empty database that will be replaced in the next step.

DROP DATABASE <new-deployment-name>;
note

Note that, if you have followed the steps correctly the database is empty, therefore there is no issue deleting it.

Once connected to the default database, fix its name:

ALTER DATABASE <old-deployment-name> RENAME TO <new-deployment-name>;

Exit the psql command line, and finish configuring users, database and extensions according to the deployment manifest:

$ sudo -i
$ chpst -u postgres:vcap bash -c 'source /var/vcap/jobs/postgresql-ha/helpers/vars.sh; \
source /var/vcap/jobs/postgresql-ha/helpers/postgresql_utils.sh; \
create_extensions; \
create_or_update_users_and_databases;'
$ chpst -u postgres:vcap bash -c 'source /var/vcap/jobs/postgresql-ha/helpers/vars.sh; \
source /var/vcap/jobs/postgresql-ha/helpers/postgresql_utils.sh; \
enable_statistics_collector; \
enable_app_dynamics_support;' ## If you enabled app dynamincs support
a9s PostgreSQL 13 and below
$ su - vcap
$ source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
$ source /var/vcap/jobs/postgresql-ha/helpers/postgresql_utils.sh
$ create_extensions
$ create_or_update_users_and_databases
$ enable_statistics_collector ## If you enabled app dynamincs support
$ enable_app_dynamics_support ## If you enabled app dynamincs support

After finish configuring, if the repmgr process did not start, please try again with monit start repmgr. You must only go to the next step once you have all the processes running.

Note that monit summary can show a process as running when it is actually waiting. So make sure the process is running with ps aux.

Start Standby Nodes

After configuring the master node, monit start postgresql-ha in the standby nodes, if running a cluster. After that, should be back online after cloning.

Remember to cleanup /var/vcap/store/postgresql-restore after the the cluster is up and running.

Checking Cluster Health

You can know more about checking the cluster status here.