Version: Develop

a9s Data Service Framework Recovery

Introduction

Purpose of this document

This document describes conceptual backgrounds as well as detailed procedures used for the disaster recovery process of the a9s Data Services.

Target Group

This document is aimed at Platform Operators and other groups that either install or operate one of the pertinent systems.

Recoverability Requirements

In the a9s Data Services, with the exception of the single plans of a data service, data service instances are redundant and clustered together, therefore reducing the risk and scenarios where recovery has to be done by importing backups. For the remaining worst case scenarios (human error or malintent acts of nature), backups of the stateful data have to be prepared. Depending on how severe the impact on the system is, it may be enough to just redeploy the broken deployment. One or more of the following scenarios and solutions may apply to the situation.

Due to the nature of the complex and large architecture chosen to reduce the risks of failure, a recovery from such an event takes great deal of time until all affected components are redeployed. During recovery, the affected components might be not available for use, and dependent components are affected as well. For example if a data service deployment has to be recovered, all depending apps would not be able to access its data until the recovery process is completed.

Deployments require a blob store, such as Amazon S3 or the used IaaS equivalent, to store objects like a release blob. The BOSH directors store compiled artifacts in the blob store which can be reused in a worst case scenario, making system restoration a little faster, especially as compilation steps will be skipped.

Conceptual Scenarios and Solutions

This section describes several scenarios of system failures that could happen to the environment. Additionally, every scenario includes a series of steps to resolve the problem and get the environment running again.

Scenario: Full Disaster Recovery

This scenario describes a full platform recovery, from scratch, on the same IaaS platform.

Requirements

Cloud Foundry still available & knows data service instances

Required Data

a9s Data Service Framework deployment manifest
User defined ops files
a9s-pg backup, see download backup
Service instance backups
CF internal database (the table with the service bindings is sufficient)

Solution Steps

Deploy the BOSH director (see Deploying BOSH) or use BBR to restore a previous installation, see Restoring with BBR
Deploy Consul DNS
Deploy a9s-pg and recover its data see Scenario: a9s-pg Breaks
Deploy a9s CF Service Guard
Deploy Data Services
Deploy Backup Manager
Run the deployment_updater errand for each a9s Data Service deployment to redeploy missing service instances
Redeploy missing service instances

$ bosh run-errand deployment-updater -d <data-service>

Service instances can be restored via the service dashboard or via the a9s API.

Scenario: a9s-pg Breaks

This scenario describes the recovery of a9s-pg when the previous VMs and persistent disks were lost.

Required data

a9s Data Service Framework deployment manifest
User defined ops files
a9s-PG database backup, see download backup
Service instance backups
CF internal database (the table with the service bindings is sufficient)

Solution Steps

Stop services (e.g. CC/API, Service Broker, Deployer) accessing the database
Deploy a9s-pg
For a9s-pg data recovery follow the instructions starting find a9s-PG master node
Start services accessing the database again.

Scenario: A Data Service Breaks

This scenario describes the case when the installation of a data service has been inconsistently or completely lost. The recovery of a single data service instance must be performed on the respective a9s Data Service Dashboard and is not covered here. Since the stateful data is stored in a9s-pg, only the a9s Data Service needs to be redeployed.

Rebinding Application after Recovery

As the recovery process is comparable to, redeployment it should be noted that you are required to both rebind your application(s) to your a9s Data Service Instance, and to immediately restage it.

Taking Cloud Foundry as an example, this means that you should execute the following commands:

$ cf unbind-service <my_app> <my_service_instance>
$ cf bind-service <my_app> <my_service_instance>
$ cf restage <my_app>

Required data

a9s Data Service Framework deployment manifest
User defined ops files
Service instance backups
CF internal database (the table with the service bindings is sufficient)

Solution Steps

Create an Ops-file to set up a new errand, named recovery_deployment_updater. This errand will be responsible of redeploying a Data Services' missing service instances by updating the ones that have a specific state, defined via the strategy.update.instance_type parameter.

In the example below, the Ops-file has been set to update a9s PostgreSQL Service Instances that are already up-to-date and in state provisioned:

- type: replace
  path: /instance_groups/-
  value:
    name: recovery_deployment_updater
    vm_type: nano
    instances: 1
    azs: [z1, z2, z3]
    stemcell: ((iaas.postgresql_service.stemcells.service.alias))
    lifecycle: errand
    jobs:
      - name: deployment-updater
        release: deployment-updater
        properties:
          service_broker:
              api_endpoint: http://prometheus-service-broker.service.dc1.((iaas.consul.domain)):3000
              username: admin
              password: ((/prometheus_service_broker_password))
          strategy:
            update:
              force_update: true
              instance_type: provisioned
      - name: consul
        release: a9s-consul
        consumes:
          consul_nodes: nil
        properties:
          consul:
            domain: ((iaas.consul.domain))
            dc: dc1
            agent_address: 127.0.0.1:8500
            server: false
            encrypt: ((/cdns_encrypt))
            cluster:
              join_hosts: ((iaas.consul.consul_ips))
            ssl_ca: ((/cdns_ssl.ca))
            ssl_cert: ((/cdns_ssl.certificate))
            ssl_key: ((/cdns_ssl.private_key))
    networks:
      - name: ((iaas.postgresql_service.network))

Redeploy the a9s Data Service, applying the Ops-file.
Run the recovery_deployment_updater errand of the deployment to redeploy missing service instances:

bosh -d <deployment> run-errand recovery_deployment_updater

This errand should update all service instances with the provisioned state.

info

It is important to note that you can set other values for the strategy.update.instance_type parameter, aside from provisioned. The supported values of this property are listed in the a9s Deployment Updater documentation.

Introduction​

Purpose of this document​

Target Group​

Recoverability Requirements​

Conceptual Scenarios and Solutions​

Scenario: Full Disaster Recovery​

Requirements​

Required Data​

Solution Steps​

Scenario: a9s-pg Breaks​

Required data​

Solution Steps​

Scenario: A Data Service Breaks​

Required data​

Solution Steps​

Introduction

Purpose of this document

Target Group

Recoverability Requirements

Conceptual Scenarios and Solutions

Scenario: Full Disaster Recovery

Requirements

Required Data

Solution Steps

Scenario: a9s-pg Breaks

Required data

Solution Steps

Scenario: A Data Service Breaks

Required data

Solution Steps