Skip to main content
Version: 47.0.0

a9s Data Services Administrative Tasks

This document explains the most common tasks an a9s Data Service Operator should know.

Map a service instance guid to a BOSH deployment name

In order to find out which BOSH deployment belongs to which service instance, you can use the following command (jq must be installed):

curl --user admin:[deployer-api-password] [deployer-api-endpoint]/deployments.json | jq '.[] | select(.deployment_attributes.instance_guid == "[service-instance-guid]") | .name'

Update All Service Instances

There are three scenarios where the Platform Operator might want to update the existing Service Instance deployments:

  1. A new version of this anynines deployments repo is available and contains new BOSH releases or new configurations
  2. The Platform Operator uploaded a new stemcell to the BOSH director
  3. The Platform Operator changed the cloud-config in their setup

In case of the first two scenarios the Platform Operator first has to execute the templates-uploader errand so that the templates in the a9s Deployer are updated. After that the Platform Operator can use the deployment_updater errand to update all outdated deployments. In case of the third scenario you only have to execute the deployment_updater errand and all outdated deployments are updated.

You can simply run the deployment_updater errand by executing the following command:

bosh -d <deployment_name> run-errand deployment_updater

For more information about how to configure the a9s Deployment Updater see a9s Deployment Updater - Properties.

Caveats

  • When changing the template_name_v2 property for a service plan, the deployment_updater cannot recognize this change and needs to run for all instances.

Update a Specific Service Instance

Instead of updating all service instances of a service, it is also possible to update only one service instance. To do so, you must first find out the guid of the service instance. When using Cloud Foundry you can simply run:

$ cf service service-instance-name-in-cf --guid
34e68cdf-62cc-4ea6-a3e8-714026dba1f8

In this example, the service instance guid is 34e68cdf-62cc-4ea6-a3e8-714026dba1f8. Next you have to find out the endpoint of the Service Broker and the Service Broker admin password. Once you have this information you can trigger an update of the service instance by executing:

$ curl --user admin:[service-broker-password] -X PATCH [service-broker-hostname]:3000/v2/service_instances/[service-instance-guid] -d '{"plan_id":"6b1973db-e057-4a71-9832-a4b3f27a0d8f", "service_id": "7ee52a02-8839-43c2-a550-728ad736bbda"}'

The service_id and plan_id parameters can be fetched from the service catalog of the broker.

Interact with the Backup Manager

As an a9s Data Service operator, you can interact with the Backup Manager API in order to trigger backups and restores.

To trigger a backup of all service instances, execute:

curl [user]:[password]@[backup manager endpoint]/backup_agent/backup_all -d {}

To trigger a backup for a specific service instance, execute:

curl [user]:[password]@[backup manager endpoint]/backup_agent/backup -H "Content-Type: application/json" -d '{"instance_guid": "[service-instance-guid]"}'

List all backups:

curl [user]:[password]@[backup manager endpoint]/instances

Trigger a restore:

Next to the service instance guid, you need the backup id you want to restore. This backup id can be found by calling the /instances/[service-instance-guid] endpoint first and getting the field id for a specific backup (not the field backup_id!). The backup id is an integer.

Once you have this information, you can trigger the restore of the backup by running:

curl [user]:[password]@[backup manager endpoint]/backup_agent/restores -H "Content-Type: application/json" -d '{"instance_id": "[service-instance-guid]", "backup_id": [id]}'

Decrypt a backup whose encryption key is unknown

To decrypt an existing backup whose encryption key is unknown, access to the a9s Backup Manager is required. If you have access to the a9s Backup Manager, follow these steps to get the encryption key and decrypt the backup.

  1. Download the appropriate backup directly from the Backup Store. The backup name should be in the format <deployment-name>-<unix-timestamp>. To find the backup later in the database, you need the Created at date of the corresponding backup. You can either have a look onto the a9s Service Dashboard or convert the unix-timestamp of the backup file to a date.
  2. Connect to the a9s Backup Manager:
bosh -d backup-service ssh backup-manager
  1. Become root:
sudo -i
  1. Open the Rails console of the a9s Backup Manager:
/var/vcap/jobs/anynines-backup-manager/bin/rails_c
  1. Find the correct encryption key for the backup. The name of the backup file typically contains the backup_id, which can be used to find the related encryption key:
Backup.where(backup_id: "89fe4350-784e-4c03-a779-e88067d53cd8-1694685740323").first.credentials[:filter_plugins][0][:password]

Please note that Backup.where(backup_id: "89fe4350-784e-4c03-a779-e88067d53cd8-1694685740323").first.credentials[:filter_plugins][0][:password] returns an Array. The encryption plugin should be the first (index 0) but it is possible that it is in another position. This would depend on your configuration of the backup manager configuration.

If you are missing the correct backup_id, you can try to find the backup by filtering for the creation date. The date must be in the format Year-Month-Day Hour:Minute:Second. The time must be UTC. As an example we use the date 2018-11-26 13:45:53:

Backup.where("created_at >= ?", "2018-11-26 13:45:53")
  1. Decrypt the backup. As an example we use the backup file ~/Downloads/d70a4d9-1543239953810 and as password from the previous step 12345678:
cat ~/Downloads/d70a4d9-1543239953810 | openssl enc -aes256 -md md5 -d -pass 'pass:12345678' | gunzip -c > ~/Downloads/d70a4d9-1543239953810.decrypted

Get the error backtrace from a backup or restore

If a backup or restore fails the backtrace of the error is saved in the database. With these steps you can read the error backtrace.

  1. Connect to the a9s Backup Manager:
bosh -d backup-service ssh backup-manager
  1. Become root:
sudo -i
  1. Open the Rails console of the a9s Backup Manager:
var/vcap/jobs/anynines-backup-manager/bin/rails_c
  1. Get the Instance where the error happend:
instance = Instance.where(instance_id: "instance_guid").first
  1. Get the Backup that fails. Therefore you have multiple options:
  • If it's the last backup that failed:

    backup = instance.backups.last
  • If you know the backup_id e.g. d25ed99-1543410104023:

    backup = instance.backups.where(backup_id: "d25ed99-1543410104023")
  • If you don't know the backup_id you can find it by filtering for the creation date of the Backup/Restore. The date must be in the format Year-Month-Day Hour:Minute:Second. The time must be UTC. As an example we use the date 2018-11-26 13:45:53:

    backup = instance.backups.where("created_at >= ?", "2018-11-26 13:45:53").first
  1. Finally load the message and decode it:

    Base64.decode64(backup.backup_agent_task.msg)

Backups of a9s-pg

The backup of the a9s-pg can now be handled with the a9s Backup Manager. See a9s_pg_backup for details.

Rotate database encryption salts

To rotate the database encryption salts of the a9s Service Broker, the a9s Deployer or the Backup Manager you have to execute the following steps. Here exemplarily for the elasticsearch-service:

  1. Duplicate the current encryption salt
OLD_SALT=`credhub get -n "/<BOSH director name>/elasticsearch-service/elasticsearch_service_broker_db_salt"`
credhub set -n "/<BOSH director name>/elasticsearch-service/elasticsearch_service_broker_db_salt_old" -t password -w "${OLD_SALT}"

OLD_SALT=`credhub get -n "/<BOSH director name>/elasticsearch-service/elasticsearch_service_deployer_db_salt32"`
credhub set -n "/<BOSH director name>/elasticsearch-service/elasticsearch_service_deployer_db_salt32_old" -t password -w "${OLD_SALT}"
  1. Regenerate the encryption salt
credhub generate -n "/<BOSH director name>/elasticsearch-service/elasticsearch_service_broker_db_salt" -t password -l 32
credhub generate -n "/<BOSH director name>/elasticsearch-service/elasticsearch_service_deployer_db_salt32" -t password -l 32
  1. Redeploy the service
bosh -d elasticsearch-service deploy elasticsearch-service/elasticsearch-service.yml
  1. Execute the errands
bosh -d elasticsearch-service run-errand migrate-deployer-api-encrypted-database-fields
bosh -d elasticsearch-service run-errand migrate-service-broker-encrypted-database-fields

Delete obsolete backup metadata files

To delete obsolete metadata files from already deleted backups you can use the delete_metadata_files script on the Backup Manager VM. Therefore execute the following steps:

  1. Connect to the a9s Backup Manager:
bosh -d backup-service ssh backup-manager
  1. Become root:
sudo -i
  1. Execute the script:
/var/vcap/jobs/anynines-backup-manager/bin/delete_metadata_files

Rotate Consul certificates

Prerequisites

Find out BOSH director name
BOSH_NAME=`bosh env --json | jq '.Tables[0].Rows[0].name' -r`
Ensure the current CredHub CA entry is complete

You have to ensure the current CA value for the CredHub entry is not empty:

credhub get -n "/${BOSH_NAME}/consul-dns/cdns_ca" --output-json | jq .value.ca

If the previous command returns null you have to execute the following commands to copy the value of the current certificate into the value for the current CA:

credhub get -k private_key -n "/${BOSH_NAME}/consul-dns/cdns_ca" > /tmp/cdns_ca.private.pem
credhub get -k certificate -n "/${BOSH_NAME}/consul-dns/cdns_ca" > /tmp/cdns_ca.cert.pem

credhub set -n "/${BOSH_NAME}/consul-dns/cdns_ca" -t certificate -c /tmp/cdns_ca.cert.pem -p /tmp/cdns_ca.private.pem -r /tmp/cdns_ca.cert.pem

Rotate an expiring Consul CA and certificate

To rotate an expiring Consul CA and certificate you have to follow these steps:

Duplicate current CA
credhub get -k private_key -n "/${BOSH_NAME}/consul-dns/cdns_ca" > /tmp/cdns_ca.private.pem
credhub get -k certificate -n "/${BOSH_NAME}/consul-dns/cdns_ca" > /tmp/cdns_ca.cert.pem
credhub get -k ca -n "/${BOSH_NAME}/consul-dns/cdns_ca" > /tmp/cdns_ca.ca.pem

credhub set -n "/${BOSH_NAME}/consul-dns/cdns_ca_old" -t certificate -c /tmp/cdns_ca.cert.pem -p /tmp/cdns_ca.private.pem -r /tmp/cdns_ca.ca.pem
Regenerate current CA

To prevent CA rotation every year change the duration parameter.

credhub generate --duration=365 -n "/${BOSH_NAME}/consul-dns/cdns_ca" -c a9sConsulCA --is-ca -t certificate
Redeploy Environment (with old CA, new CA and old certificate)
consul-dns

Apply the following Ops file to the consul-dns deployment and redeploy the consul-dns deployment.

To prevent SSL certificate rotation every year change the duration parameter in the following Ops file.

IMPORTANT: Replace <bosh-director-name> with the director name from step 1 in all following Ops files.

- type: replace
path: /instance_groups/name=consul/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
- type: replace
path: /variables/name=~1cdns_ssl/options/duration?
value: 365
data-services

Apply the following Ops file to the x-service deployments and redeploy the x-service deployments. Run the templates-uploader errand and the force_deployment_updater errand after you redeployed the deployments.

- type: replace
path: /instance_groups/name=spi/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
- type: replace
path: /instance_groups/name=broker/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
- type: replace
path: /instance_groups/name=deployer-api/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"

- type: replace
path: /instance_groups/name=templates-uploader/jobs/name=template-uploader/properties/template-uploader/template-vars/~1cdns_ssl.ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"

# delete for a9s Prometheus and a9s LogMe
- type: replace
path: /instance_groups/name=service-dashboard/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"

# force update instances
- type: replace
path: /instance_groups/name=force_deployment_updater/jobs/name=deployment-updater/properties/strategy?
value:
update:
instance_type: provisioned

The a9s Prometheus and a9s LogMe deployment doesn't contain a service dashboard with a running Consul job. The OPS with this replacement must be deleted: /instance_groups/name=service-dashboard/jobs/name=consul/properties/consul/ssl_ca

The force update instances Ops entry guarantees that all instances will be updated even though instances are not outdated. The Ops entry is necessary, in the case of a Consul certificate rotation, because it is not possible to guarantee the instances are outdated, once only a CredHub value is changed.

a9s-pg

Apply the following Ops file to the a9s-pg deployment and redeploy the deployment.

- type: replace
path: /instance_groups/name=pg/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
backup-service

Apply the following Ops file to the backup-service deployment and redeploy the deployment.

- type: replace
path: /instance_groups/name=backup-manager/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
- type: replace
path: /instance_groups/name=backup-monit/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
service-guard

Apply the following Ops file to the service-guard deployment and redeploy the deployment.

- type: replace
path: /instance_groups/name=guard/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
Delete Consul certificate
credhub delete -n /cdns_ssl
Redeploy environment (with old CA, new CA and new certificate)

Redeploy the environment after the Consul certificate has been deleted.

IMPORTANT: Remember to apply the appropriate Ops file from step Redeploy Environment (with old CA, new CA and old certificate) to the corresponding deployment.

IMPORTANT: Do not forget to update the service instances using the deployment_updater errand.

If you are facing an issue that some service instance is still using the old consul certificate after using deployment_updater errand, you can use a command from Update a Specific Service Instance to trigger the updating process for this service.

Redeploy Environment (without old CA)

Redeploy the environment after the Consul certificate has been deleted.

IMPORTANT: This time it is important to NOT apply the Ops file from step Redeploy Environment (with old CA, new CA and old certificate).

IMPORTANT: Do not forget to upload the service templates with the templates-uploader erannd and update the service instances using the force_deployment_updater errand.

Update The CF Gorouter Request Timeout

When using the Cloud Foundry (CF) Gorouter, it is necessary to be aware that there is a timeout, and it might reject requests exceeding this timeout.

By default, the timeout is 15 minutes. Therefore, requests longer than this timeout will be canceled. In order to change the default timeout, it is required to update the Cloud Foundry deployment (specifically the gorouter job).

You need to add or modify the request_timeout_in_seconds property in the Cloud Foundry deployment manifest when updating the routing bosh release properties. Be aware that this value must be configured as a number (integer) representing the timeout in seconds.

Example:

(...)
jobs:
- name: gorouter
properties:
request_timeout_in_seconds: 3600 # 1 hour
(...)

Network Update

a9s Data Services support network update and relocation of the pool of addresses available for the service instances. The operator can update the BOSH Cloud Config and apply the changes by updating the service instances with the Deployment Updater Errand.

Warning: The a9s Data Services do not support network changes that affect the majority part of the nodes in a cluster without downtime. This happens because after the first node is updated, it may have different addresses for the remaining part of the cluster, and when the second node goes down for update, no part of the cluster will have a quorum to continue working as it should. Therefore we recommend using at least 3 availability zones with distinct network definitions for each one, and during the update, one availability zone should be updated at a time. For example:

* Modify the az1
* Apply the update to all service instances
* Modify the az2
* Apply the update to all service instances
* Modify the az3
* Apply the update to all service instances

Redis Network Update

The a9s Redis cluster instances are affected directly when updating the network because it violates the Cluster Deployment Update Strategy principle which is necessary to have all nodes (master and slaves) healthy during the cluster update. Considering the network update, it will update one availability zone at a time. This violates the principle because the node with new network IP at the availability zone update will be unreachable during the cluster update and then the cluster will not be fully healthy for some time. However, the 2 other nodes are still accessible.

Note: Considering the network update will update all availability at once, it will cause a big downtime because all nodes will have different IPs, hence the cluster will not have a quorum and will leave the Redis deployment unreachable until all nodes are updated.

The best way to do the update:

  • Update one availability zone at a time.
  • The a9s Redis stop-cluster-update-on-failure property must be set to false because the if the node updates, the cluster is not healthy, it will not fail, and continue updating the cluster.

Known issues:

  • The deployment update might delay a little bit because the cluster will not be fully healthy and it will reach the cluster-update-node-timeout. The time will depends on the cluster-update-node-timeout value and node update order determined by BOSH.
  • It might cause data loss because of the stop-cluster-update-on-failure property set as false. Read the Cluster Deployment Update Strategy section to understand better.