Cluster Recovery
When a cluster is in a broken state that requires manual intervention, the cluster state and the causing error must be properly identified.
This document describes how to analyze a failed a9s LogMe2 cluster and bring the failed node back up and running.
Cluster Recovery
Before going ahead with the recovery, the current status of the cluster and all nodes need to be checked to identify the root cause behind the issue.
1. Identify the Cluster Status
To identify the current cluster status and possible issues, the API endpoint of OpenSearch can be queried to gather more
information. The API can be called either through the OpenSearch Dashboard or by executing a curl
command on any of
the Data Service instances or connected application instances. The usage of the a9s LogMe2 Dashboard is explained in the
respective See Your Logs documentation section.
Retrieve the Credentials
To properly use the API endpoint of OpenSearch, the admin credentials need to be retrieved. The credentials can be either retrieved from the Service-Key of the cluster (Application Developer Level) or by the manifest of the deployed Service Instance.
OpenSearch Metrics
The current status of the cluster can be retrieved from the published OpenSearch metrics
*.opensearch.*.*.*.*.cluster_health
, which are available in case an a9s Prometheus Service Instance is connected to
the cluster.
The most relevant metrics are: cluster_health.status
, cluster_health.unassigned_shards
and discovered_master
. They
define the overall status of the cluster, the number of unassigned shards and if the cluster has a connected master
node.
OpenSearch API
OpenSearch provides its API with several endpoints that can be used to query the actual status of the cluster. One of
the available endpoints is _cluster/health
, which provides the general status of the whole cluster. The _cat/nodes
endpoint provides information about the individual nodes.
The OpenSearch API can either be called using any of the available nodes or through a connected application.
To retrieve the general information about the cluster from the OpenSearch API, the following commands need to be executed:
curl -XGET "https://<ip-of-instance-or-domain-of-service>:9200/_cluster/health?pretty" --insecure -u "<admin>:<password>"
Look for the status of the nodes (red/yellow/green) and the number of unassigned shards.
The status of all nodes can be inspected with the _cat/nodes
API endpoint:
curl -XGET "https://<ip-of-instance-or-domain-of-service>:9200/_cat/nodes?v" --insecure -u "<admin>:<password>"
The node.role
property defines the roles of the nodes. For example it can have the value dimr
. All of the characters
from the value have a different meaning. Possible characters are: m
(master-eligible), d
(data),
r
(remote_cluster_client) and i
(ingest). The node which has the cluster_manager
property set to *
is the
elected master node. Nodes with cluster_manager
set to d
are disconnected from the cluster.
Service Instance Logs
It can also be valuable to inspect the logs of the cluster, to identify the root cause of the issue. This can be done by retrieving the logs on any of the failed nodes.
2. Fix potential root causes of the issue
If the root cause for the issues was discovered by inspecting the cluster status and the logs, the cluster could be fixed by applying some of the following steps. Each of steps listed below should only be executed if the description matches with the information about the status of the cluster.
Recover Quorum (If lost)
In case a majority of the nodes are down and the remaining nodes cannot elect a new master node, the OpenSearch configuration can be adapted temporarily to allow the remaining nodes to form a cluster again.
The opensearch.yml
file inside the /var/vcap/jobs/opensearch/config
folder of the OpenSearch nodes contains a
property called cluster.auto_shrink_voting_configuration
which allows the cluster to forget about failed nodes.
Setting this value to true
allows the remaining nodes to form a cluster, even if one node is still failing. However,
at least two nodes are required to form a cluster using this method.
Setting the auto_shrink_voting_configuration
to true
can lead to data loss, depending on the number of failed nodes
and the number of shards allocated to each index. If some index has only one shard configured, or one or more nodes have
failed, then there is a risk that an index could get lost during this operation. In that case a restore of the latest
backup needs to be done to recover the data.
Start the service as a single-node cluster
In case only one node is available and the other nodes are not properly running, the running node isn't able to form a working cluster by itself. It is possible to start the remaining node as a single-node cluster, enabling the other nodes to rejoin the cluster later. However, this will lead to the loss of all indices that aren't saved on this initial node.
This method should only be used if there is a backup that can be used to recover all lost data. Depending on the number of configured replicas and instances, many indicies could get lost.
To start a node as a single cluster, adapt the opensearch.yml
file inside the /var/vcap/jobs/opensearch/config
folder on the node. The fields cluster.initial_master_nodes
and discovery.seed_hosts
must be adapted temporarily.
The field discovery.seed_hosts
must be set to an empty array to ensure the single-node cluster won't wait for other
nodes to join. For the same reason, the cluster.initial_master_nodes
field must be adapted to only contain the name of
the running node. Make sure to save the old values to replace them later, as soon as the whole cluster is
reestablished.
bosh -d abcd123456 ssh os/4b8594eb-4047-4377-8c2b-3b1e81301566
sudo -i
vi /var/vcap/jobs/opensearch/config/opensearch.yml
cluster.initial_master_nodes: ["os/4b8594eb-4047-4377-8c2b-3b1e81301566"]
discovery.seed_hosts: []
Afterwards, the cluster cache needs to be removed on the node and the opensearch process must be restarted to make the changes effective.
rm -rf /var/vcap/store/opensearch/nodes/0/_state
monit start opensearch
When the Opensearch process was started, it needs to be checked if the Opensearch API is available again.
As soon as the other nodes are repaired, the following commands can be executed on each of the remaining nodes to allow them to rejoin the cluster.
bosh -d abcd123456 ssh os/<vm-id>
sudo -i
rm -rf /var/vcap/store/opensearch/nodes/0/_state
monit restart opensearch
Finally, the previous values for the fields cluster.initial_master_nodes
and discovery.seed_hosts
need to be
restored on the master node.
vi /var/vcap/jobs/opensearch/config/opensearch.yml
monit restart opensearch
Corrupted Indices
If there is an issue due to a corrupted index, try to close and re-open the affected index. This could fix the issue in some cases. The following commands need to executed to close and re-open the index:
curl -XPOST "https://<ip-of-instance-or-domain-of-service>:9200/<the-index>/_close" --insecure -u "<admin>:<password>"
curl -XPOST "https://<ip-of-instance-or-domain-of-service>:9200/<the-index>/_open" --insecure -u "<admin>:<password>"
Afterwards, the information about the index can be retrieved with:
curl -XGET "https://<ip-of-instance-or-domain-of-service>:9200/<the-index>/_recovery?pretty" --insecure -u "<admin>:<password>"
Fix Unassigned Shards
The cluster should be checked regarding unassigned shards. If there are any unassigned shards, it can affect the availability of some parts of the data. The status of the shards can be inspected by executing the following command:
curl -XGET "https://<ip-of-instance-or-domain-of-service>:9200/_cluster/allocation/explain?pretty" --insecure -u "<admin>:<password>"
This command will either return a list of all unassigned shards or an error called ClusterAllocationExplainRequest
in
case all shards are allocated. If this specific error occurs, this is not a problem, it just indicates that there are no
unassigned shards.
Misconfigured Replica Count
If there are more replicas configured for a specific index than there are available nodes in the cluster, this will lead to unassigned shards. In that case the configuration for the index can be adapted with the following command (make sure to have at least 2 replicas per index to prevent data loss):
curl -XPUT "https://<ip-of-instance-or-domain-of-service>:9200/<index>/_settings" --insecure -u "<admin>:<password>" -H 'Content-Type: application/json' -d'
{
"index" : {
"number_of_replicas" : <number-of-replicas>
}
}'
In-memory Lock
In case the shard allocation failed with an in-memory error like
failure IOException[failed to obtain in-memory shard lock]
visible in the instance logs, then a possible solution
could be to increase the amount of maximum retries for the allocation. This can be adapted with the following command:
curl -XPUT "https://<ip-of-instance-or-domain-of-service>:9200/<index>/_settings" --insecure -u "<admin>:<password>" -H 'Content-Type: application/json' -d'
{
"index.allocation.max_retries": 20
}'
No Space Left for Allocation
Depending on the amount of saved documents, the instance size and if the default parachute limit was modified, this can lead to an issue with the shard allocation as well. Check the disk usage either in the Service Dashboard or directly on the instances.
In that case a proper solution would be to upgrade the Service Instance to a plan with a bigger disk space.
Manually Trigger Re-allocation of the Shards
If the underlying issue is resolved and there are still some unassigned shards, they can be manually rerouted and assigned by executing the following command:
curl -XPOST "https://<ip-of-instance-or-domain-of-service>:9200/_cluster/reroute?retry_failed=true" --insecure -u "<admin>:<password>"
3. Restart the Cluster
After the failed and the working nodes of the cluster were identified, the cluster can be recovered by restarting the failed nodes to re-join the cluster.
If the current master node is healthy, it should be restarted first. Afterwards all master-eligible nodes should be restarted.
The master node can be identified by retrieving the information about all nodes from the _cat/nodes
API endpoint:
curl -XGET "https://<ip-of-instance-or-domain-of-service>:9200/_cat/nodes?v" --insecure -u "<admin>:<password>"
The node that has the cluster_manager
property set to *
is the master node.
The nodes should be restarted sequentially, allowing each node to rejoin before proceeding.
To restart a node, connect to the instance via ssh
and execute:
bosh -d <deployment> ssh os/<index>
sudo -i
monit restart all
4. Reinitiate the Cluster from Backup
If the actual issue of the cluster couldn't be identified or fixed, then there is the possibility to restore the backup and bootstrap the cluster to make it available again. Depending on the point in time the last backup was created, this may result in data loss.
Stop the OpenSearch process on all nodes:
bosh -d <deployment> ssh os/<index>
sudo -i
monit stop opensearch
Backup the data directory of all nodes, to keep the data in case of any problems or further analysis. The backup can be
stored to a local directory under /var/vcap/store
. If the node does not have space available, backup to the local
machine using bosh scp
.
cp -rp /var/vcap/store/opensearch /var/vcap/store/opensearch.bkp
Afterwards, the data directory can be deleted on all nodes by executing the following command on every node:
rm -rf /var/vcap/store/opensearch/nodes/*
Then the OpenSearch process needs to be restarted on all nodes, one after another. This can be done by executing the following command:
monit start opensearch
To access the cluster again without the admin user, a new service key needs to be created. Furthermore, all applications need to be binded again, as the old users were deleted along with the indices.
Afterwards, the restore of the backup needs to be initiated. This can be done through the Service Dashboard as described in the Restore a Backup documentation.
In case the OpenSearch dashboard needs to be accessed to check the health and data of the cluster, make sure to delete the cookies in your browser. This ensures that outdated credentials cached within the cookies are removed.