a9s Data Service Framework Recovery
Introduction
Purpose of this document
This document describes conceptual backgrounds as well as detailed procedures used for the disaster recovery process of the a9s Data Services.
Target Group
This document is aimed at Platform Operators and other groups that either install or operate one of the pertinent systems.
Recoverability Requirements
In the a9s Data Services, with the exception of the single plans of a data service, data service instances are redundant and clustered together, therefore reducing the risk and scenarios where recovery has to be done by importing backups. For the remaining worst case scenarios (human error or malintent acts of nature), backups of the stateful data have to be prepared. Depending on how severe the impact on the system is, it may be enough to just redeploy the broken deployment. One or more of the following scenarios and solutions may apply to the situation.
Due to the nature of the complex and large architecture chosen to reduce the risks of failure, a recovery from such an event takes great deal of time until all affected components are redeployed. During recovery, the affected components might be not available for use, and dependent components are affected as well. For example if a data service deployment has to be recovered, all depending apps would not be able to access its data until the recovery process is completed.
Deployments require a blob store, such as Amazon S3 or the used IaaS equivalent, to store objects like a release blob. The BOSH directors store compiled artifacts in the blob store which can be reused in a worst case scenario, making system restoration a little faster, especially as compilation steps will be skipped.
Conceptual Scenarios and Solutions
This section describes several scenarios of system failures that could happen to the environment. Additionally, every scenario includes a series of steps to resolve the problem and get the environment running again.
Scenario: Full Disaster Recovery
This scenario describes a full platform recovery, from scratch, on the same IaaS platform.
Requirements
Cloud Foundry still available & knows data service instances
Required Data
- a9s Data Service Framework deployment manifest
- User defined ops files
- a9s-pg backup, see download backup
- Service instance backups
- CF internal database (the table with the service bindings is sufficient)
Solution Steps
- Deploy the BOSH director (see Deploying BOSH) or use BBR to restore a previous installation, see Restoring with BBR
- Deploy Consul DNS
- Deploy a9s-pg and recover its data see Scenario: a9s-pg Breaks
- Deploy a9s CF Service Guard
- Deploy Data Services
- Deploy Backup Manager
- Run the
deployment_updater
errand for each a9s Data Service deployment to redeploy missing service instances - Redeploy missing service instances
$ bosh run-errand deployment-updater -d <data-service>
- Service instances can be restored via the service dashboard or via the a9s API.
Scenario: a9s-pg Breaks
This scenario describes the recovery of a9s-pg when the previous VMs and persistent disks were lost.
Required data
- a9s Data Service Framework deployment manifest
- User defined ops files
- a9s-PG database backup, see download backup
- Service instance backups
- CF internal database (the table with the service bindings is sufficient)
Solution Steps
- Stop services (e.g. CC/API, Service Broker, Deployer) accessing the database
- Deploy a9s-pg
- For a9s-pg data recovery follow the instructions starting find a9s-PG master node
- Start services accessing the database again.
Scenario: A Data Service Breaks
This scenario describes the case when the installation of a data service has been inconsistently or completely lost. The recovery of a single data service instance must be performed on the respective a9s Data Service Dashboard and is not covered here. Since the stateful data is stored in a9s-pg, only the a9s Data Service needs to be redeployed.
As the recovery process is comparable to, redeployment it should be noted that you are required to both rebind your application(s) to your a9s Data Service Instance, and to immediately restage it.
Taking Cloud Foundry
as an example, this means that you should execute the following commands:
$ cf unbind-service <my_app> <my_service_instance>
$ cf bind-service <my_app> <my_service_instance>
$ cf restage <my_app>
Required data
- a9s Data Service Framework deployment manifest
- User defined ops files
- Service instance backups
- CF internal database (the table with the service bindings is sufficient)
Solution Steps
- Create an Ops-file to set up a new errand, named
recovery_deployment_updater
. This errand will be responsible of redeploying a Data Services' missing service instances by updating the ones that have a specific state, defined via thestrategy.update.instance_type
parameter.
In the example below, the Ops-file has been set to update a9s PostgreSQL Service Instances that are already up-to-date
and in state provisioned
:
- type: replace
path: /instance_groups/-
value:
name: recovery_deployment_updater
vm_type: nano
instances: 1
azs: [z1, z2, z3]
stemcell: ((iaas.postgresql_service.stemcells.service.alias))
lifecycle: errand
jobs:
- name: deployment-updater
release: deployment-updater
properties:
service_broker:
api_endpoint: http://prometheus-service-broker.service.dc1.((iaas.consul.domain)):3000
username: admin
password: ((/prometheus_service_broker_password))
strategy:
update:
force_update: true
instance_type: provisioned
- name: consul
release: a9s-consul
consumes:
consul_nodes: nil
properties:
consul:
domain: ((iaas.consul.domain))
dc: dc1
agent_address: 127.0.0.1:8500
server: false
encrypt: ((/cdns_encrypt))
cluster:
join_hosts: ((iaas.consul.consul_ips))
ssl_ca: ((/cdns_ssl.ca))
ssl_cert: ((/cdns_ssl.certificate))
ssl_key: ((/cdns_ssl.private_key))
networks:
- name: ((iaas.postgresql_service.network))
Redeploy the a9s Data Service, applying the Ops-file.
Run the
recovery_deployment_updater
errand of the deployment to redeploy missing service instances:
bosh -d <deployment> run-errand recovery_deployment_updater
This errand should update all service instances with the provisioned
state.
It is important to note that you can set other values for the strategy.update.instance_type
parameter, aside from
provisioned. The supported values of this property are listed in the a9s Deployment Updater documentation.