Version: Develop

Cluster Recovery

When a cluster is in a broken state that requires manual intervention, the cluster state and the cause behind the error must be properly identified.

This document describes how to analyze a failed a9s Search cluster, and how to bring the failed nodes back and running.

Cluster Recovery

Before going ahead with the recovery, the current status of the cluster and all nodes need to be checked to identify the root cause of the issue.

1. Identify the Cluster Status

To identify the current status and possible issues of the cluster, the API endpoint of OpenSearch can be used for that purpose. The API can be called either by executing a curl command on one of the Data Service instances or connected application instances.

Retrieve the Credentials

To properly use the API endpoint of OpenSearch, the admin credentials need to be retrieved. The credentials can be either retrieved from the Service-Key of the cluster (Application Developer Level) or by the manifest of the deployed Service Instance.

OpenSearch Metrics

The current status of the cluster can be retrieved from the published OpenSearch metrics *.opensearch.*.*.*.*.cluster_health, which are available in case an a9s Prometheus Service Instance is connected to the cluster.

The most relevant metrics are: cluster_health.status, cluster_health.unassigned_shards and discovered_master. They define the overall status of the cluster, the number of unassigned shards and if the cluster has a connected master node.

The available metrics are further described in the Service Instance Metrics documentation page.

OpenSearch API

OpenSearch provides its API with several endpoints that can be used to query the actual status of the cluster. One of the available endpoints is _cluster/health, which provides the general status of the whole cluster. The _cat/nodes endpoint provides information about the individual nodes.

The OpenSearch API can either be called using any of the available nodes or through a connected application.

To retrieve the general information about the cluster from the OpenSearch API, the following commands need to be executed:

curl -XGET "https://<ip-of-instance-or-domain-of-service>:9200/_cluster/health?pretty" --insecure -u "<admin>:<password>"

Look for the status of the nodes (red/yellow/green) and the number of unassigned shards.

The status of all nodes can be inspected with the _cat/nodes API endpoint:

curl -XGET "https://<ip-of-instance-or-domain-of-service>:9200/_cat/nodes?v" --insecure -u "<admin>:<password>"

The node.role property defines the roles of the nodes. For example it can have the value dimr. All of the characters from the value have a different meaning. Possible characters are: m (master-eligible), d (data), r (remote_cluster_client) and i (ingest). The node which has the cluster_manager property set to * is the elected master node. Nodes with cluster_manager set to d are disconnected from the cluster.

Service Instance Logs

It can also be valuable to inspect the logs of the cluster, to identify the root cause of the issue. This can be done by retrieving the logs on any of the failed nodes.

2. Fix potential root causes of the issue

If the root cause for the issues was discovered by inspecting the cluster status and the logs, the cluster could be fixed by applying some of the following steps. Each of steps listed below should only be executed if the description matches with the information about the status of the cluster.

Recover Quorum (If lost)

In case a majority of the nodes are down and the remaining nodes cannot elect a new master node, the OpenSearch configuration can be adapted temporarily to allow the remaining nodes to form a cluster again.

The opensearch.yml file inside the /var/vcap/jobs/opensearch/config folder of the OpenSearch nodes contains a property called cluster.auto_shrink_voting_configuration which allows the cluster to forget about failed nodes. Setting this value to true allows the remaining nodes to form a cluster, even if one node is still failing. However, at least two nodes are required to form a cluster using this method.

caution

Setting the auto_shrink_voting_configuration to true can lead to data loss, depending on the number of failed nodes and the number of shards allocated to each index. If some index has only one shard configured, or one or more nodes have failed, then there is a risk that an index could get lost during this operation. In that case a restore of the latest backup needs to be done to recover the data.

Start the service as a single-node cluster

In case only one node is available and the other nodes are not properly running, the running node isn't able to form a working cluster by itself. It is possible to start the remaining node as a single-node cluster, enabling the other nodes to rejoin the cluster later. However, this will lead to the loss of all indices that aren't saved on this initial node.

caution

This method should only be used if there is a backup that can be used to recover all lost data. Depending on the number of configured replicas and instances, many indicies could get lost.

To start a node as a single cluster, adapt the opensearch.yml file inside the /var/vcap/jobs/opensearch/config folder on the node. The fields cluster.initial_master_nodes and discovery.seed_hosts must be adapted temporarily. The field discovery.seed_hosts must be set to an empty array to ensure the single-node cluster won't wait for other nodes to join. For the same reason, the cluster.initial_master_nodes field must be adapted to only contain the name of the running node. Make sure to save the old values to replace them later, as soon as the whole cluster is reestablished.

bosh -d abcd123456 ssh os/4b8594eb-4047-4377-8c2b-3b1e81301566

sudo -i
vi /var/vcap/jobs/opensearch/config/opensearch.yml

cluster.initial_master_nodes: ["os/4b8594eb-4047-4377-8c2b-3b1e81301566"]
discovery.seed_hosts: []

Afterwards, the cluster cache needs to be removed on the node and the opensearch process must be restarted to make the changes effective.

rm -rf /var/vcap/store/opensearch/nodes/0/_state
monit start opensearch

When the Opensearch process was started, it needs to be checked if the Opensearch API is available again.

As soon as the other nodes are repaired, the following commands can be executed on each of the remaining nodes to allow them to rejoin the cluster.

bosh -d abcd123456 ssh os/<vm-id>

sudo -i
rm -rf /var/vcap/store/opensearch/nodes/0/_state
monit restart opensearch

Finally, the previous values for the fields cluster.initial_master_nodes and discovery.seed_hosts need to be restored on the master node.

vi /var/vcap/jobs/opensearch/config/opensearch.yml
monit restart opensearch

Corrupted Indices

If there is an issue due to a corrupted index, try to close and re-open the affected index. This could fix the issue in some cases. The following commands need to executed to close and re-open the index:

curl -XPOST "https://<ip-of-instance-or-domain-of-service>:9200/<the-index>/_close" --insecure -u "<admin>:<password>"
curl -XPOST "https://<ip-of-instance-or-domain-of-service>:9200/<the-index>/_open" --insecure -u "<admin>:<password>"

Afterwards, the information about the index can be retrieved with:

curl -XGET "https://<ip-of-instance-or-domain-of-service>:9200/<the-index>/_recovery?pretty" --insecure -u "<admin>:<password>"

Fix Unassigned Shards

The cluster should be checked regarding unassigned shards. If there are any unassigned shards, it can affect the availability of some parts of the data. The status of the shards can be inspected by executing the following command:

curl -XGET "https://<ip-of-instance-or-domain-of-service>:9200/_cluster/allocation/explain?pretty" --insecure -u "<admin>:<password>"

This command will either return a list of all unassigned shards or an error called ClusterAllocationExplainRequest in case all shards are allocated. If this specific error occurs, this is not a problem, it just indicates that there are no unassigned shards.

Misconfigured Replica Count

If there are more replicas configured for a specific index than there are available nodes in the cluster, this will lead to unassigned shards. In that case the configuration for the index can be adapted with the following command (make sure to have at least 2 replicas per index to prevent data loss):

curl -XPUT "https://<ip-of-instance-or-domain-of-service>:9200/<index>/_settings" --insecure -u "<admin>:<password>" -H 'Content-Type: application/json' -d'
{
  "index" : {
    "number_of_replicas" : <number-of-replicas>  
  }
}'

In-memory Lock

In case the shard allocation failed with an in-memory error like failure IOException[failed to obtain in-memory shard lock] visible in the instance logs, then a possible solution could be to increase the amount of maximum retries for the allocation. This can be adapted with the following command:

curl -XPUT "https://<ip-of-instance-or-domain-of-service>:9200/<index>/_settings" --insecure -u "<admin>:<password>" -H 'Content-Type: application/json' -d'
{
  "index.allocation.max_retries": 20
}'

No Space Left for Allocation

Depending on the amount of saved documents, the instance size and if the default parachute limit was modified, this can lead to an issue with the shard allocation as well. Check the disk usage either in the Service Dashboard or directly on the instances.

In that case a proper solution would be to upgrade the Service Instance to a plan with a bigger disk space.

Manually Trigger Re-allocation of the Shards

If the underlying issue is resolved and there are still some unassigned shards, they can be manually rerouted and assigned by executing the following command:

curl -XPOST "https://<ip-of-instance-or-domain-of-service>:9200/_cluster/reroute?retry_failed=true" --insecure -u "<admin>:<password>"

3. Restart the Cluster

After the failed and the working nodes of the cluster were identified, the cluster can be recovered by restarting the failed nodes to re-join the cluster.

If the current master node is healthy, it should be restarted first. Afterwards all master-eligible nodes should be restarted.

The master node can be identified by retrieving the information about all nodes from the _cat/nodes API endpoint:

curl -XGET "https://<ip-of-instance-or-domain-of-service>:9200/_cat/nodes?v" --insecure -u "<admin>:<password>"

The node that has the cluster_manager property set to * is the master node.

The nodes should be restarted sequentially, allowing each node to rejoin before proceeding.

To restart a node, connect to the instance via ssh and execute:

bosh -d <deployment> ssh os/<index>
sudo -i
monit restart all

4. Reinitiate the Cluster from Backup

If the actual issue of the cluster couldn't be identified or fixed, then there is the possibility to restore the backup and bootstrap the cluster to make it available again. Depending on the point in time the last backup was created, this may result in data loss.

Stop the OpenSearch process on all nodes:

bosh -d <deployment> ssh os/<index>
sudo -i

monit stop opensearch

Backup the data directory of all nodes, to keep the data in case of any problems or further analysis. The backup can be stored to a local directory under /var/vcap/store. If the node does not have space available, backup to the local machine using bosh scp.

cp -rp /var/vcap/store/opensearch /var/vcap/store/opensearch.bkp

Afterwards, the data directory can be deleted on all nodes by executing the following command on every node:

rm -rf /var/vcap/store/opensearch/nodes/*

Then the OpenSearch process needs to be restarted on all nodes, one after another. This can be done by executing the following command:

monit start opensearch

To access the cluster again without the admin user, a new service key needs to be created. Furthermore, all applications need to be binded again, as the old users were deleted along with the indices.

Afterwards, the restore of the backup needs to be initiated. This can be done through the Service Dashboard as described in the Restore a Backup documentation.

Cluster Recovery​

1. Identify the Cluster Status​

Retrieve the Credentials​

OpenSearch Metrics​

OpenSearch API​

Service Instance Logs​

2. Fix potential root causes of the issue​

Recover Quorum (If lost)​

Start the service as a single-node cluster​

Corrupted Indices​

Fix Unassigned Shards​

Misconfigured Replica Count​

In-memory Lock​

No Space Left for Allocation​

Manually Trigger Re-allocation of the Shards​

3. Restart the Cluster​

4. Reinitiate the Cluster from Backup​

Cluster Recovery

1. Identify the Cluster Status

Retrieve the Credentials

OpenSearch Metrics

OpenSearch API

Service Instance Logs

2. Fix potential root causes of the issue

Recover Quorum (If lost)

Start the service as a single-node cluster

Corrupted Indices

Fix Unassigned Shards

Misconfigured Replica Count

In-memory Lock

No Space Left for Allocation

Manually Trigger Re-allocation of the Shards

3. Restart the Cluster

4. Reinitiate the Cluster from Backup