Sunday 16 December 2018

Elastic Search Monitoring and Troubleshooting Notes

Standard


 Get the cluster health status

curl -XGET http://localhost:9200/_cluster/health?pretty

Get shards  details

curl -XGET http://localhost:9200/_cat/shards

Force Reroute identified shard to a specific node  

         Usually required to force assing unassigned_shards to a node
for shard in $(curl -XGET http://localhost:9200/_cat/shards | grep UNASSIGNED |                  awk '{print $2}'); do
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
"commands" : [ {
  "allocate" : {
  "index" : "index_taken_from_step_2",
  "shard" : $shard,
  "node" : "datanode",
  "allow_primary" : true
  }
}
]
}'
sleep 5
done

All Shards on a node stucks in initialization statue

        Restart the box could be the only solution.


Troubleshooting  Unavailable Shard Exception

a. Index folder is deleted 
     Recreate the index

b. Node hosting primary shard left cluster
     Reboot the node


Fixing UNASSIGNED shard error

Lookup for  faulty shard 

 curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep  UNASSIGNED

  for ES5+ :
  curl -XGET localhost:9200/_cluster/allocation/explain?pretty

If it looks like the unassigned shards belong to an index you thought you deleted already,
or an outdated index that you don’t need anymore,
then you can delete the index to restore your cluster status to green:

curl -XDELETE 'localhost:9200/index_name/'


Possible Reason or unassigned shard error

Shard allocation is purposefully delayed
Too many shards, not enough nodes
You need to re-enable shard allocation
Shard data no longer exists in the cluster
Low disk watermark
Multiple Elasticsearch versions

Shard allocation is purposefully delayed


When a node leaves the cluster, the master node temporarily delays shard reallocation to avoid needlessly wasting resources on rebalancing shards, in the event the original node is able to recover within a certain period of time (one minute, by default). If this is the case, your logs should look something like this:

[TIMESTAMP][INFO][cluster.routing] [MASTER NODE NAME] delaying allocation for [54] unassigned shards, next check in [1m]
      You can dynamically modify the delay period like so:

curl -XPUT 'localhost:9200/<INDEX_NAME>/_settings' -d
'{
"settings": {
   "index.unassigned.node_left.delayed_timeout": "30s"
}
}'
       Replacing <INDEX_NAME> with _all will update the threshold for all indices in your cluster.

After the delay period is over, you should start seeing the master assigning those shards. If not,            keep reading to explore solutions to other potential causes.

  Too many shards, not enough nodes

        Node should be > replica + 1
Add new nodes or decrease replica number

You need to re-enable shard allocation

curl -XPUT 'localhost:9200/_cluster/settings' -d
'{ "transient":
{ "cluster.routing.allocation.enable" : "all"
}
}'

Low disk watermark

query :  curl -s 'localhost:9200/_cat/allocation?v'
Increase allocation
curl -XPUT 'localhost:9200/_cluster/settings' -d
'{
"transient": {
  "cluster.routing.allocation.disk.watermark.low": "90%"
}
}'
Change to "persistent" if need to persist data between restart

   Shard data no longer exists in the cluster

In this case, primary shard 0 of the constant-updates index is unassigned. It may have been created on a node without any replicas (a technique used to speed up the initial indexing process), and the node left the cluster before the data could be replicated. The master detects the shard in its global cluster state file, but can’t locate the shard’s data in the cluster.

Another possibility is that a node may have encountered an issue while rebooting. Normally, when a node resumes its connection to the cluster, it relays information about its on-disk shards to the master, which then transitions those shards from “unassigned” to “assigned/started”. When this process fails for some reason (e.g. the node’s storage has been damaged in some way), the shards may remain unassigned.

In this scenario, you have to decide how to proceed: try to get the original node to recover and rejoin the cluster (and do not force allocate the primary shard), or force allocate the shard using the Reroute API and reindex the missing data using the original data source, or from a backup.

If you decide to allocate an unassigned primary shard, make sure to add the "allow_primary": "true" flag to the request:

curl -XPOST 'localhost:9200/_cluster/reroute' -d
'{ "commands" :
  [ { "allocate" :
  { "index" : "constant-updates", "shard" : 0, "node": "<NODE_NAME>", "allow_primary": "true" }
  }]
}'
Without the "allow_primary": "true" flag, we would have encountered the following error:

{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[NODE_NAME][127.0.0.1:9301][cluster:admin/reroute]"}],"type":"illegal_argument_exception","reason":"[allocate] trying to allocate a primary shard [constant-updates][0], which is disabled"},"status":400}
The caveat with forcing allocation of a primary shard is that you will be assigning an “empty” shard. If the node that contained the original primary shard data were to rejoin the cluster later, its data would be overwritten by the newly created (empty) primary shard, because it would be considered a “newer” version of the data.

You will now need to reindex the missing data, or restore as much as you can from a backup snapshot using the Snapshot and Restore API.



0 comments:

Post a Comment