Skip to main content
Version: 31.1.1

a9s Data Service Framework Recovery

Introduction

Purpose of this document

This document describes conceptual backgrounds as well as detailed procedures used for the disaster recovery process of the a9s Data Services.

Target Group

This document is aimed at Platform Operators and other groups that either install or operate one of the pertinent systems. operate one of the pertinent systems.

Recoverability Requirements

In the a9s Data Services, with the exception of the single plans of a data service, data service instances are redundant and clustered together, therefore reducing the risk and scenarios where recovery has to be done by importing backups. For the remaining worst case scenarios (human error or malintent acts of nature), backups of the stateful data have to be prepared. Depending on how severe the impact on the system is, it may be enough to just redeploy the broken deployment. One or more of the following scenarios and solutions may apply to the situation.

Due to the nature of the complex and large architecture chosen to reduce the risks of failure, a recovery from such an event takes great deal of time until all affected components are redeployed. During recovery, the affected components might be not available for use, and dependent components are affected as well. For example if a data service deployment has to be recovered, all depending apps would not be able to access its data until the recovery process is completed.

Deployments require a blob store, such as Amazon S3 or the used IaaS equivalent, to store objects like a release blob. The BOSH directors store compiled artifacts in the blob store which can be reused in a worst case scenario, making system restoration a little faster, especially as compilation steps will be skipped.

Conceptual Scenarios and Solutions

This section describes several scenarios of system failures that could happen to the environment. Additionally, every scenario includes a series of steps to resolve the problem and get the environment running again.

Scenario: Full Disaster Recovery

This scenario describes a full platform recovery, from scratch, on the same IaaS platform.

Requirements

Cloud Foundry still available & knows data service instances

Required Data

  • a9s Data Service Framework deployment manifest
  • User defined ops files
  • a9s-pg backup, see download backup
  • Service instance backups
  • CF internal database (the table with the service bindings is sufficient)

Solution Steps

  1. Deploy the BOSH director (see Deploying BOSH) or use BBR to restore a previous installation, see Restoring with BBR
  2. Deploy Consul DNS
  3. Deploy a9s-pg and recover its data see Scenario: a9s-pg Breaks
  4. Deploy Service Guard
  5. Deploy Data Services
  6. Deploy Backup Manager
  7. Run the deployment_updater errand for each a9s Data Service deployment to redeploy missing service instances
  8. Redeploy missing service instances
$ bosh run-errand deployment-updater -d <data-service>
  1. Service instances can be restored via the service dashboard or via the a9s API.

Scenario: a9s-pg Breaks

This scenario describes the recovery of a9s-pg when the previous VMs and persistent disks were lost.

Required data

  • a9s Data Service Framework deployment manifest
  • User defined ops files
  • a9s-PG database backup, see download backup
  • Service instance backups
  • CF internal database (the table with the service bindings is sufficient)

Solution Steps

  1. Stop services (e.g. CC/API, Service Broker, Deployer) accessing the database
  2. Deploy a9s-pg
  3. For a9s-pg data recovery follow the instructions starting find a9s-PG master node
  4. Start services accessing the database again.

Scenario: A Data Service Breaks

This scenario describes the case when the installation of a data service is has been inconsistently or completely lost. The recovery of a single data service instance must be be performed on the respective data service dashboard and is not covered here. Since the stateful data is stored in a9s-pg, only the a9s Data Service needs to be redeployed.

Required data

  • a9s Data Service Framework deployment manifest
  • User defined ops files
  • Service instance backups
  • CF internal database (the table with the service bindings is sufficient)

Solution Steps

  1. Redeploy the Data Service
  2. Run the deployment_updater errand of the deployment to redeploy missing service instances
$ bosh run-errand deployment-updater -d <data-service>
  1. Service instances can be restored via the service dashboard or via the a9s API.