a9s PostgreSQL Manual Logical Backup Recovery
This document briefly describes how to manually restore a logical backup on a PostgreSQL deployment on another site.
Requirements
There are some requirements to accomplish this:
- aws-cli or azure-cli: With the credentials necessary to download the backup.
- Access to the bucket where the backup is stored: The CLI must be configured with credentials that permits at least reading the files.
- Access to the old database where the backup metadata is stored: This is necessary to recover the secret used to encrypt the backup files. You will need to have an available a9s Backup Manager that is able to read data from the database or take note of the secret beforehand.
If you are not familiarized with the pg_dump
and PostgreSQL dump files concept, please read the official PostgreSQL
documentation about the subject.
The commands outlined in this document are designed to be executed on Linux operating systems. Using them on different Unix-based operating systems, such as macOS, may result in unforeseen outcomes; as a precaution, we recommend utilizing Linux for these operations.
How to Discover the Service Instance GUID
In order to identify the Backup ID on a9s Backup Manager, the Cloud Foundry Service Instance ID must be identified.
If the Cloud Foundry CLI can not be used (cf service <service-name> --guid
), follow the step below.
Using Cloud Foundry and Service Broker Databases
Access the ccdb
(Cloud Controller Database) database.
Once you are inside the database, find the GUID of the desired Service Instance with:
SELECT name, guid FROM service_instances WHERE name = '<service-instance-name>';
How to Discover the a9s Backup Manager Service Instance ID
The backup id has the following format: <backup-manager-service-instance-id>-<timestamp>
.
The <backup-manager-service-instance-id>
is generated by the a9s Backup Manager to identify the instance. In order to
identify the correct base backup to use when recovering an instance we need to find out the Service Instance ID on the
a9s Backup Manager.
The Service Instance ID generated by the a9s Backup Manager is not the same as the one used by Cloud Foundry to identify the Service Instance.
Using the a9s Backup Manager API
The a9s Backup Manager API supports querying for the backups of a given instance. You can use this method if the a9s Backup Manager for the failing site is still available.
- Retrieve the a9s Backup Manager password for the
admin
user:
credhub get -n /backup_manager_password
- Then, trigger the following request against your Backup Manager:
curl -u <backup-manager-username>:<backup-manager-password> \
http://<backup-manager-host>:3000/instances/<service-instance-guid>
This command lists the backup and restore history, discover which backup to use checking the created_at
and
updated_at
fields, and use the backup_id
field content.
Using a9s Backup Manager Ruby Console (IRB)
The a9s Backup Manager includes a script that runs the Interactive Ruby Shell already configured to access the current a9s Backup Manager database. To execute this script, access the a9s Backup Manager Virtual Machine (VM), become root and execute the following command:
/var/vcap/jobs/anynines-backup-manager/bin/rails_c
Inside the IRB shell, execute the following commands:
irb(main):001:0> Instance.where(instance_id: "<service-instance-guid>").first.guid
The <service-instance-guid>
is the Cloud Foundry Service Instance GUID.
Using the a9s Backup Manager Database
Use this method if your a9s Backup Manager is not available.
First, you need to access the backupmanager
database.
SELECT guid FROM instances WHERE instance_id='<service-instance-guid>';
The <service-instance-guid>
is the Cloud Foundry Service Instance GUID.
Download Files
The logical backup storage follows the structure below:
AWS S3:
- <bucket>: Or container where the backups are stored.
- <backup-manager-service-instance-id>.json: Files holding metadata about the backup.
- <backup-manager-service-instance-id>-<index>: Encrypted and split backup file.
Azure:
- <bucket>: Or container where the backups are stored.
- <backup-manager-service-instance-id>: Encrypted backup file.
Download Backup
The first step is to identify which available backup to use in you storage backend.
AWS S3:
aws s3 ls s3://<bucket>/<backup-manager-service-instance-id>
This command lists the metadata files, each file looks like <backup-manager-service-instance-id>-<timestamp>.json
.
Each .json
file is the metadata about a backup. Since we are restoring the latest available backup, get the name of
the backup with most recent timestamp.
The name of the backup (Backup ID) is the name of the file without .json
.
It is important to note that backup files are split into multiple files containing 1GB each. This means that if your base backup has more than 1GB of size, you will need to put together these files before restoring.
To list these files:
aws s3 ls s3://<bucket>/<backup-manager-service-instance-id>
And the following command to download all files that belongs to the backup:
aws s3 cp --exclude="*" --include="<backup-manager-service-instance-id>*" --recursive \
s3://<bucket> <tmp-backup-dir>
Azure:
az storage blob list --container-name <container> --prefix <backup-manager-service-instance-id>- | jq '.[].name'
This command will list all the available backups, where each file name looks like
<backup-manager-service-instance-id>-<timestamp>
. Since we are restoring to the latest available backup, get the name
of the backup with most recent timestamp.
The name of the backup (Backup ID) is the name of the file.
To download the file, execute:
az storage blob download --container-name <container> --file <backup-manager-service-instance-id> --name <backup-manager-service-instance-id>
Prepare Files
The files are split (if you are using AWS S3) and encrypted. Before restoring, you will need to prepare these files.
The first step after downloading, is to decrypt them. Once the files are decrypted, you must join them, if necessary, before starting the recovery process.
Retrieve Backup Encryption Secret
All backup files are encrypted before stored in you backend storage. So in order to be able to extract this files we need to decrypt them.
Unless you have changed the secret used to encrypt the backup using the dashboard (in this case, you need to remember your secret), you will have to access the Backup Manager in the old site to retrieve the secret used to encrypt the backup. You can retrieve the encryption secret with the following command:
bosh -d <deployment-name> ssh backup-manager
sudo su -
/var/vcap/jobs/anynines-backup-manager/bin/rails_c
irb(main):001:0> Backup.where(backup_id: "<backup-file-id>").first.credentials[:encryption][:key]
The accepted value for backup-file-id
is shown in the table below:
Name | Description |
---|---|
backup-manager-service-instance-id -<timestamp> | The id that the Backup Manager internally assigns to the instance with the timestamp appended at the end. This directly references a specific backup, which the command uses during execution. |
While the value of backup-file-id
can be similar to backup-<storage>-filename
we cannot guarantee that it will be
consistent from storage to storage. Thus, we recommend using the values described in the table above.
To know more about this, you can also check here.
Unfortunately, if you are not able to retrieve the credentials you will not be able to decrypt the files and go on with the restoring process.
Decrypt All Files
AWS S3:
To decrypt the backup execute the command below for all files that belongs to the backup. All split files must be decrypted together, and in the correct ascending order.
The following commands make use of the following compound command cat $(ls -1v ...)
, which assumes the use of a Linux distribution operating system. Therefore, if you are using a different operating system for your environment, it is recommended that you adjust this compound command to accordingly.
cat $(ls -1v <backup-file-id>-*) \
| openssl enc -aes256 -md md5 -d -pass 'pass:<secret>' \
| gunzip -c > <dst-dir>/<backup-manager-service-instance-id>.dump
For example:
cat $(ls -1v b6f4c071-ef44-4af2-9608-531b4ce4823f-1548967947154-*) \
| openssl enc -aes256 -md md5 -d -pass 'pass:NYHD8MVmA55HEqoaYHpQaxfwEMcQ1wkI' \
| gunzip -c > /var/vcap/store/postgresql-restore/b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279.dump
Azure:
Execute the following command on the backup file:
cat <backup-file-id> \
| openssl enc -aes256 -md md5 -d -pass 'pass:<secret>' \
| gunzip -c > <dst-dir>/<file-name>
For example:
cat b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279 \
| openssl enc -aes256 -md md5 -d -pass 'pass:NYHD8MVmA55HEqoaYHpQaxfwEMcQ1wkI' \
| gunzip -c > /var/vcap/store/postgresql-restore/b6f4c071-ef44-4af2-9608-531b4ce4823f-1569935509279.dump
The expected output of the decrypt command is:
<name-of-decrypted-file>
*** WARNING : deprecated key derivation used.
Using -iter or -pbkdf2 would be better.
After decrypting the files you should confirm that the .dump file was generated running `ls .dump` on the directory where the files were decrypted.
Copy Files to the Recovery Instance
Before copying the files to your instance, make sure you have enough space to store the backup file and the created database and make sure you are restoring this dump file in the current primary node. The recovery process must be started in the primary node, then cloned by the standby nodes.
First, create a directory under /var/vcap/store/
bosh -d <deployment-name> ssh pg/<primary-node-index-or-id>
sudo su -
mkdir /var/vcap/store/postgresql-restore
chown root.vcap -R /var/vcap/store/postgresql-restore
chmod g+w /var/vcap/store/postgresql-restore
With the directory prepared, copy the backup file (dump file) to the VM.
In the example below the file is transferred using bosh scp
:
bosh -d <deployment-name> scp <backup-manager-service-instance-id>-<timestamp>.dump \
pg/<primary-node-index>:/var/vcap/store/postgresql-restore
Prepare PostgreSQL
Restoring a .dump
file will generate new WAL files as much as necessary to load the whole dump file.
This action can use a lot of disk space since all the data in the backup file will be written to the database, generating new WAL files that can take some time to be archived.
Prepare the Standby Nodes
This section applies only to cluster deployments.
Stop Standby Nodes
Discover which node is the primary,
then stop the standby nodes with monit stop
:
bosh -d <deployment-name> ssh pg/<index-or-id>
sudo su -
monit stop postgresql-ha
Stop only the postgresql-ha
process. The repmgr
process depends on postgresql-ha
so it will also be stopped with
this commands.
Prepare the Primary
Stop repmgr
This section applies only to cluster deployments.
The recovery of the dump will overwrite the content of the repmgr database. To avoid issues with repmgr, we need to stop it on the primary without stopping the PostgreSQL process:
bosh -d <deployment-name> ssh pg/<index-or-id>
sudo su -
monit stop repmgr
Drop replication slots
This section applies only to cluster deployments.
Drop their replication slots within the primary node with:
sudo su -
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
chpst -u postgres:vcap psql postgres
a9s PostgreSQL 13 and below
sudo su - vcap
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
psql postgres
SELECT pg_drop_replication_slot(pg_replication_slots.slot_name) FROM pg_replication_slots WHERE active <> 't';
Terminate Running Process
To make sure no application is connected, you might want to block new connection to port 5432
with iptables and
execute the following command to drop existing active connections:
SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pid <> pg_backend_pid();
Cleanup Startup Lock Directory
A file may be left on the startup lock directory containing a PID that has been recycled by the operating system. In
this case, when trying to restart the postgresql-ha
process, it can fail due to a startup process already running when
actually another process is reusing the PID.
To avoid this issue, after completely stopping the postgresql-ha
process in the required nodes, check if there
is any related process running with ps aux
. If no related process is running, remove the content of the startup locks
directory:
rm -f /tmp/postgresql-startup-locks/*
Recover Backup
The backup is a dump file generated with pg_dumpall
.
So in order to recover, you can execute:
bosh -d <deployment-name> ssh pg/<primary-index-or-id>
sudo -i
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
chpst -u postgres:vcap psql --quiet postgres < /var/vcap/store/postgresql-restore/<backup-manager-service-instance-id>.dump
a9s PostgreSQL 13 and below
bosh -d <deployment-name> ssh pg/<primary-index-or-id>
sudo su - vcap
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
psql --quiet postgres < /var/vcap/store/postgresql-restore/<backup-manager-service-instance-id>.dump
Finish Cluster Setup for the New Deployment
Now that we restored the dump file, we can start configuring the PostgreSQL cluster according to the new deployment.
Finish Primary Setup
In this section, we finish setting up the primary with the new data.
Clean up repmgr
This section applies only to cluster deployments.
After restoring the dump file, we need to clean up the repmgr database to make sure its previous state will not get in the way of the new repmgr cluster before starting the repmgr process again.
Access the repmgr database on the primary node.
bosh -d <deployment-name> ssh pg/<primary-index-or-id>
sudo su -
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
chpst -u postgres:vcap psql repmgr_cluster
a9s PostgreSQL 13 and below
bosh -d <deployment-name> ssh pg/<primary-index-or-id>
sudo su - vcap
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
psql repmgr_cluster
Then, remove all entries in all repmgr tables:
DO $$ DECLARE
r record;
BEGIN
FOR r IN (SELECT tablename FROM pg_tables WHERE schemaname = 'repmgr') LOOP
EXECUTE 'TRUNCATE repmgr.' || quote_ident(r.tablename) || ' CASCADE';
END LOOP;
END $$;
Update Database Name
Because the database name is based on the deployment name, you will have to delete the current database that matches the current deployment name and rename the database being recovered to the current deployment name.
bosh -d <deployment-name> ssh pg/<primary-index-or-id>
sudo -i
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
chpst -u postgres:vcap psql postgres
a9s PostgreSQL 13 and below
su - vcap
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
psql postgres
Then, fix the name of the default database:
DROP DATABASE IF EXISTS <new-deployment-name>;
ALTER DATABASE <old-deployment-name> RENAME TO <new-deployment-name>;
Configure Extensions, Users and Databases
During start up we configure the available extensions, the users and databases declared in the manifest. In this step, we will handle this configuration manually.
bosh -d <deployment-name> ssh pg/<primary-node-index-or-id>
sudo -i
chpst -u postgres:vcap bash -c 'source /var/vcap/jobs/postgresql-ha/helpers/vars.sh; \
source /var/vcap/jobs/postgresql-ha/helpers/postgresql_utils.sh; \
create_extensions; \
create_or_update_users_and_databases;'
a9s PostgreSQL 13 and below
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
source /var/vcap/jobs/postgresql-ha/helpers/postgresql_utils.sh
create_extensions
create_or_update_users_and_databases
Configure Statistics Collectos and AppDynamics Support
In section, we reconfigure the statistics collector and the AppDynamics support to the new deployment.
bosh -d <deployment-name> ssh pg/<primary-node-index-or-id>
sudo su -
chpst -u postgres:vcap bash -c 'source /var/vcap/jobs/postgresql-ha/helpers/vars.sh; \
source /var/vcap/jobs/postgresql-ha/helpers/postgresql_utils.sh; \
enable_statistics_collector; \
enable_app_dynamics_support;' ## If you enabled app dynamincs support
a9s PostgreSQL 13 and below
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
source /var/vcap/jobs/postgresql-ha/helpers/postgresql_utils.sh
enable_statistics_collector ## If you have app dynamincs support enabled
enable_app_dynamics_support ## If you app dynamincs support enabled
Register repmgr
This section applies only to cluster deployments.
At this point, the repmgr state is empty, we need to register the the primary, in order to be able to start the standby nodes. The following steps must be executed in the primary:
bosh -d <deployment-name> ssh pg/<primary-node-index-or-id>
sudo su -
chpst -u postgres:vcap bash -c 'source /var/vcap/jobs/postgresql-ha/helpers/vars.sh; \
source /var/vcap/jobs/postgresql-ha/helpers/repmgr_utils.sh; \
register_master_node;'
a9s PostgreSQL 13 and below
su - vcap
source /var/vcap/jobs/postgresql-ha/helpers/vars.sh
source /var/vcap/jobs/postgresql-ha/helpers/repmgr_utils.sh
register_master_node
Start repmgr
This section applies only to cluster deployments.
At this point, we can start the repmgr process again in the primary.
bosh -d <deployment-name> ssh pg/<primary-node-index-or-id>
sudo su -
monit start repmgr
After starting repmgr, check that the repmgr
process is running.
Note that monit summary
can show a process as running when it is actually waiting. So make sure the process is running
see the Cluster Status
Start Standby Nodes
This section applies only to cluster deployments.
After configuring the primary node, clean up the data directory on the standby nodes:
rm -r /var/vcap/store/postgresqlXX/*
At this point, it is possible to monit start all
on the standby nodes, if running a cluster and let all process start
and clone take place again.
After the processes are started, the cluster is ready to be used again.
Remember to clean up /var/vcap/store/postgresql-restore
after the the cluster is up and running.
Checking Cluster Health
You can know more about checking the cluster status here.