a9s Data Services Administrative Tasks
This document explains the most common tasks an a9s Data Service Operator should know.
Map a service instance guid to a BOSH deployment name
In order to find out which BOSH deployment belongs to which service instance, you can use the following command (jq
must be installed):
curl --user admin:[deployer-api-password] [deployer-api-endpoint]/deployments.json | jq '.[] | select(.deployment_attributes.instance_guid == "[service-instance-guid]") | .name'
Update All Service Instances
There are three scenarios where the Platform Operator might want to update the existing Service Instance deployments:
- A new version of this anynines deployments repo is available and contains new BOSH releases or new configurations
- The Platform Operator uploaded a new stemcell to the BOSH director
- The Platform Operator changed the cloud-config in their setup
In case of the first two scenarios the Platform Operator first has to execute the templates-uploader
errand so that the templates in the a9s Deployer are updated. After that the Platform Operator can use the
deployment_updater errand to update all outdated deployments.
In case of the third scenario you only have to execute the deployment_updater
errand and all outdated deployments are updated.
You can simply run the deployment_updater
errand by executing the following command:
bosh -d <deployment_name> run-errand deployment_updater
For more information about how to configure the a9s Deployment Updater see a9s Deployment Updater - Properties.
Caveats
- When changing the
template_name_v2
property for a service plan, thedeployment_updater
cannot recognize this change and needs to run for all instances.
Update a Specific Service Instance
Instead of updating all service instances of a service, it is also possible to update only one service instance. To do so, you must first find out the guid of the service instance. When using Cloud Foundry you can simply run:
$ cf service service-instance-name-in-cf --guid
34e68cdf-62cc-4ea6-a3e8-714026dba1f8
In this example, the service instance guid is 34e68cdf-62cc-4ea6-a3e8-714026dba1f8
. Next you have to find out the endpoint of the Service Broker and the Service Broker admin password. Once you have this information you can trigger an update of the service instance by executing:
$ curl --user admin:[service-broker-password] -X PATCH [service-broker-hostname]:3000/v2/service_instances/[service-instance-guid] -d '{"plan_id":"6b1973db-e057-4a71-9832-a4b3f27a0d8f", "service_id": "7ee52a02-8839-43c2-a550-728ad736bbda"}'
The service_id
and plan_id
parameters can be fetched from the service catalog of the broker.
Interact with the Backup Manager
As an a9s Data Service operator, you can interact with the Backup Manager API in order to trigger backups and restores.
To trigger a backup of all service instances, execute:
curl [user]:[password]@[backup manager endpoint]/backup_agent/backup_all -H "Content-Type: application/json" -d {}
To trigger a backup for a specific service instance, execute:
curl [user]:[password]@[backup manager endpoint]/backup_agent/backup -H "Content-Type: application/json" -d '{"instance_guid": "[service-instance-guid]"}'
List all backups:
curl [user]:[password]@[backup manager endpoint]/instances
Trigger a restore:
Next to the service instance guid, you need the backup id
you want to restore.
This backup id
can be found by calling the /instances/[service-instance-guid]
endpoint first and getting the field id
for a specific backup (not the field
backup_id
!). The backup id
is an integer.
Once you have this information, you can trigger the restore of the backup by running:
curl [user]:[password]@[backup manager endpoint]/backup_agent/restores -H "Content-Type: application/json" -d '{"instance_id": "[service-instance-guid]", "backup_id": [id]}'
Decrypt a backup whose encryption key is unknown
To decrypt an existing backup whose encryption key is unknown, access to the a9s Backup Manager is required. If you have access to the a9s Backup Manager, follow these steps to get the encryption key and decrypt the backup.
- Download the appropriate backup directly from the Backup Store. The backup name should be in the format
<deployment-name>-<unix-timestamp>
. To find the backup later in the database, you need the Created at date of the corresponding backup. You can either have a look onto the a9s Service Dashboard or convert the unix-timestamp of the backup file to a date. - Connect to the a9s Backup Manager:
bosh -d backup-service ssh backup-manager
- Become root:
sudo -i
- Open the Rails console of the a9s Backup Manager:
/var/vcap/jobs/anynines-backup-manager/bin/rails_c
- Find the correct encryption key for the backup. The name of the backup file typically contains the
backup_id
, which can be used to find the related encryption key:
Backup.where(backup_id: "89fe4350-784e-4c03-a779-e88067d53cd8-1694685740323").first.credentials[:filter_plugins][0][:password]
Please note that Backup.where(backup_id: "89fe4350-784e-4c03-a779-e88067d53cd8-1694685740323").first.credentials[:filter_plugins][0][:password]
returns an Array. The encryption plugin should be the first (index 0) but it is possible that it is in
another position. This would depend on your configuration of the backup manager configuration.
If you are missing the correct backup_id
, you can try to find the backup by filtering for the creation date.
The date must be in the format Year-Month-Day Hour:Minute:Second. The time must be UTC.
As an example we use the date 2018-11-26 13:45:53:
Backup.where("created_at >= ?", "2018-11-26 13:45:53")
- Decrypt the backup. As an example we use the backup file ~/Downloads/d70a4d9-1543239953810 and as password from the previous step 12345678:
cat ~/Downloads/d70a4d9-1543239953810 | openssl enc -aes256 -md md5 -d -pass 'pass:12345678' | gunzip -c > ~/Downloads/d70a4d9-1543239953810.decrypted
Get the error backtrace from a backup or restore
If a backup or restore fails the backtrace of the error is saved in the database. With these steps you can read the error backtrace.
- Connect to the a9s Backup Manager:
bosh -d backup-service ssh backup-manager
- Become root:
sudo -i
- Open the Rails console of the a9s Backup Manager:
var/vcap/jobs/anynines-backup-manager/bin/rails_c
- Get the Instance where the error happend:
instance = Instance.where(instance_id: "instance_guid").first
- Get the Backup that fails. Therefore you have multiple options:
If it's the last backup that failed:
backup = instance.backups.last
If you know the
backup_id
e.g. d25ed99-1543410104023:backup = instance.backups.where(backup_id: "d25ed99-1543410104023")
If you don't know the
backup_id
you can find it by filtering for the creation date of the Backup/Restore. The date must be in the format Year-Month-Day Hour:Minute:Second. The time must be UTC. As an example we use the date 2018-11-26 13:45:53:backup = instance.backups.where("created_at >= ?", "2018-11-26 13:45:53").first
Finally load the message and decode it:
Base64.decode64(backup.backup_agent_task.msg)
Backups of a9s-pg
The backup of the a9s-pg can now be handled with the a9s Backup Manager. See a9s_pg_backup for details.
Rotate database encryption salts
To rotate the database encryption salts of the a9s Service Broker, the a9s Deployer or the Backup Manager you have to execute
the following steps. Here exemplarily for the elasticsearch-service
:
- Duplicate the current encryption salt
OLD_SALT=`credhub get -n "/<BOSH director name>/elasticsearch-service/elasticsearch_service_broker_db_salt"`
credhub set -n "/<BOSH director name>/elasticsearch-service/elasticsearch_service_broker_db_salt_old" -t password -w "${OLD_SALT}"
OLD_SALT=`credhub get -n "/<BOSH director name>/elasticsearch-service/elasticsearch_service_deployer_db_salt32"`
credhub set -n "/<BOSH director name>/elasticsearch-service/elasticsearch_service_deployer_db_salt32_old" -t password -w "${OLD_SALT}"
- Regenerate the encryption salt
credhub generate -n "/<BOSH director name>/elasticsearch-service/elasticsearch_service_broker_db_salt" -t password -l 32
credhub generate -n "/<BOSH director name>/elasticsearch-service/elasticsearch_service_deployer_db_salt32" -t password -l 32
- Redeploy the service
bosh -d elasticsearch-service deploy elasticsearch-service/elasticsearch-service.yml
- Execute the errands
bosh -d elasticsearch-service run-errand migrate-deployer-api-encrypted-database-fields
bosh -d elasticsearch-service run-errand migrate-service-broker-encrypted-database-fields
Delete obsolete backup metadata files
To delete obsolete metadata files from already deleted backups you can use the delete_metadata_files
script on the Backup Manager VM. Therefore execute the following steps:
- Connect to the a9s Backup Manager:
bosh -d backup-service ssh backup-manager
- Become root:
sudo -i
- Execute the script:
/var/vcap/jobs/anynines-backup-manager/bin/delete_metadata_files
Rotate Consul certificates
Prerequisites
Find out BOSH director name
BOSH_NAME=`bosh env --json | jq '.Tables[0].Rows[0].name' -r`
Ensure the current CredHub CA entry is complete
You have to ensure the current CA value for the CredHub entry is not empty:
credhub get -n "/${BOSH_NAME}/consul-dns/cdns_ca" --output-json | jq .value.ca
If the previous command returns null
you have to execute the following
commands to copy the value of the current certificate into the value for the
current CA:
credhub get -k private_key -n "/${BOSH_NAME}/consul-dns/cdns_ca" > /tmp/cdns_ca.private.pem
credhub get -k certificate -n "/${BOSH_NAME}/consul-dns/cdns_ca" > /tmp/cdns_ca.cert.pem
credhub set -n "/${BOSH_NAME}/consul-dns/cdns_ca" -t certificate -c /tmp/cdns_ca.cert.pem -p /tmp/cdns_ca.private.pem -r /tmp/cdns_ca.cert.pem
Rotate an expiring Consul CA and certificate
To rotate an expiring Consul CA and certificate you have to follow these steps:
Duplicate current CA
credhub get -k private_key -n "/${BOSH_NAME}/consul-dns/cdns_ca" > /tmp/cdns_ca.private.pem
credhub get -k certificate -n "/${BOSH_NAME}/consul-dns/cdns_ca" > /tmp/cdns_ca.cert.pem
credhub get -k ca -n "/${BOSH_NAME}/consul-dns/cdns_ca" > /tmp/cdns_ca.ca.pem
credhub set -n "/${BOSH_NAME}/consul-dns/cdns_ca_old" -t certificate -c /tmp/cdns_ca.cert.pem -p /tmp/cdns_ca.private.pem -r /tmp/cdns_ca.ca.pem
Regenerate current CA
To prevent CA rotation every year change the duration parameter.
credhub generate --duration=365 -n "/${BOSH_NAME}/consul-dns/cdns_ca" -c a9sConsulCA --is-ca -t certificate
Redeploy Environment (with old CA, new CA and old certificate)
consul-dns
Apply the following Ops file to the consul-dns
deployment and redeploy the consul-dns
deployment.
To prevent SSL certificate rotation every year change the duration parameter in the following Ops file.
IMPORTANT: Replace <bosh-director-name>
with the director name from step 1
in all following Ops files.
- type: replace
path: /instance_groups/name=consul/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
- type: replace
path: /variables/name=~1cdns_ssl/options/duration?
value: 365
data-services
Apply the following Ops file to the x-service
deployments and redeploy the x-service
deployments. Run the
templates-uploader
errand and the force_deployment_updater
errand after you redeployed the deployments.
- type: replace
path: /instance_groups/name=spi/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
- type: replace
path: /instance_groups/name=broker/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
- type: replace
path: /instance_groups/name=deployer-api/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
- type: replace
path: /instance_groups/name=templates-uploader/jobs/name=template-uploader/properties/template-uploader/template-vars/~1cdns_ssl.ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
# delete for a9s Prometheus and a9s LogMe
- type: replace
path: /instance_groups/name=service-dashboard/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
# force update instances
- type: replace
path: /instance_groups/name=force_deployment_updater/jobs/name=deployment-updater/properties/strategy?
value:
update:
instance_type: provisioned
The a9s Prometheus and a9s LogMe deployment doesn't contain a
service dashboard with a running Consul job. The OPS with this replacement must
be deleted: /instance_groups/name=service-dashboard/jobs/name=consul/properties/consul/ssl_ca
The force update instances
Ops entry guarantees that all instances will be updated even though
instances are not outdated. The Ops entry is necessary, in the case of a Consul certificate rotation,
because it is not possible to guarantee the instances are outdated, once only a CredHub value is changed.
a9s-pg
Apply the following Ops file to the a9s-pg
deployment and redeploy the deployment.
- type: replace
path: /instance_groups/name=pg/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
backup-service
Apply the following Ops file to the backup-service
deployment and redeploy the deployment.
- type: replace
path: /instance_groups/name=backup-manager/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
- type: replace
path: /instance_groups/name=backup-monit/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
service-guard
Apply the following Ops file to the service-guard
deployment and redeploy the deployment.
- type: replace
path: /instance_groups/name=guard/jobs/name=consul/properties/consul/ssl_ca
value: "((/<bosh-director-name>/consul-dns/cdns_ca_old.ca))((/<bosh-director-name>/consul-dns/cdns_ca.ca))"
Delete Consul certificate
credhub delete -n /cdns_ssl
Redeploy environment (with old CA, new CA and new certificate)
Redeploy the environment after the Consul certificate has been deleted.
IMPORTANT: Remember to apply the appropriate Ops file from step Redeploy Environment (with old CA, new CA and old certificate) to the corresponding deployment.
IMPORTANT: Do not forget to update the service instances using the
force_deployment_updater
errand.
If you are facing an issue that some service instance is still using
the old consul certificate after using force_deployment_updater
errand,
you can use a command from Update a Specific Service Instance
to trigger the updating process for this service.
Redeploy Environment (without old CA)
Redeploy the environment after the Consul certificate has been deleted.
IMPORTANT: This time it is important to NOT apply the Ops file from step Redeploy Environment (with old CA, new CA and old certificate).
IMPORTANT: Do not forget to upload the service templates with the
templates-uploader
errand and update the service instances using the
force_deployment_updater
errand.
Update The CF Gorouter Request Timeout
When using the Cloud Foundry (CF) Gorouter
, it is necessary to be aware that there is a timeout, and
it might reject requests exceeding this timeout.
By default, the timeout is 15 minutes. Therefore, requests longer than this timeout will be canceled.
In order to change the default timeout, it is required to update the Cloud Foundry deployment
(specifically the gorouter
job).
You need to add or modify the request_timeout_in_seconds
property in the Cloud Foundry deployment manifest
when updating the routing bosh release
properties. Be aware that this value must be configured as a number (integer
) representing the timeout in seconds.
Example:
(...)
jobs:
- name: gorouter
properties:
request_timeout_in_seconds: 3600 # 1 hour
(...)
Network Update
a9s Data Services support network update and relocation of the pool of addresses available for the service instances. The operator can update the BOSH Cloud Config and apply the changes by updating the service instances with the Deployment Updater Errand.
Warning: The a9s Data Services do not support network changes that affect the majority part of
the nodes in a cluster without downtime. This happens because after the first node is updated, it
may have different addresses for the remaining part of the cluster, and when the second node goes
down for update, no part of the cluster will have a quorum to continue working as it should.
Therefore we recommend using at least 3
availability zones with distinct network definitions for
each one, and during the update, one availability zone should be updated at a time. For example:
* Modify the az1
* Apply the update to all service instances
* Modify the az2
* Apply the update to all service instances
* Modify the az3
* Apply the update to all service instances
Redis Network Update
The a9s Redis cluster instances are affected directly when updating the network because it violates the Cluster Deployment Update Strategy principle which is necessary to have all nodes (master and slaves) healthy during the cluster update. Considering the network update, it will update one availability zone at a time. This violates the principle because the node with new network IP at the availability zone update will be unreachable during the cluster update and then the cluster will not be fully healthy for some time. However, the 2 other nodes are still accessible.
Note: Considering the network update will update all availability at once, it will cause a big downtime because all nodes will have different IPs, hence the cluster will not have a quorum and will leave the Redis deployment unreachable until all nodes are updated.
The best way to do the update:
- Update one availability zone at a time.
- The a9s Redis stop-cluster-update-on-failure
property must be set to
false
because the if the node updates, the cluster is not healthy, it will not fail, and continue updating the cluster.
Known issues:
- The deployment update might delay a little bit because the cluster will not be fully healthy and
it will reach the cluster-update-node-timeout.
The time will depends on the
cluster-update-node-timeout
value and node update order determined by BOSH. - It might cause data loss because of the
stop-cluster-update-on-failure
property set asfalse
. Read the Cluster Deployment Update Strategy section to understand better.