Fix an Oversharded Elasticsearch Cluster

TL;DR

The default settings for Logstash index rotation are bad and will break your cluster after a few months unless you change the rotation strategy.

If you’re anything like me, you probably read somebody’s cool blog about how awesome ELK stack is and just had to have a piece of it. So you went through the quick start guide, googled your way through getting it up and running, then BAM you had an awesome logging system with all the bells and whistles! And at first it was awesome, you could do a keyword search across millions of logs from hundreds (or tens) of servers, make cool dashboards about which rogue countries are trying to view /wp-admin on your precious servers, even hook it into an alerting system…. But then things weren’t so good….

One day you look at the monitoring dashboard for your cluster and notice that things are a little off. For one thing, the load average is >0.5 on a server that brings in only a couple thousand logs per day!? How can this be happening? The old rusty but trusty RSyslog server never did this!?

Well, the answer lies in the interaction of ElasticSearch’s allocation of indices and shards.

Indices and Shards

When data is stored in ElasticSearch, it is made up of Indices. Each index is a group of documents, the fundamental unit of storage in ES. Since you should, ideally, have multiple servers in a cluster, each index is broken up into multiple shards. These shards can be primary or secondary depending on the state of your cluster, if it is healthy it will only interact with the primaries but if it is in recovery it will use the secondary shards to rebuild the indices. Not only do shards help with failover, they also help with performance (sometimes).

If you have 10 servers in your cluster, and all of todays logs are stored in one index with one primary and one secondary shard, it will only ever touch one server at a time. That means that when somebody runs a report about web traffic today, all that load will hit only one server. By sharding the index, you can multiply the performance by reading from all the servers with shards of that index.

And here lies the issue: By default, elasticsearch is configured to run on a very large cluster with very large datasets. But if you run a typical logging serer usecase, this will suffer greatly over time.

On a default install of ELK, each new index has 1 secondary (replica), with 5 shards. Now that’s great if you’re running a massive cluster, but if you’re only bringing in about 100MB per day of data it’s very overkill. Elastic.co recommends that shards be about 1-50GB in size for efficient storage, and indexes be rotated when the shards reach about that size. And of course, you shouldn’t have more shards than servers in the cluster because it will throw out any kind of load balancing algorithm that ES will perform in the background.

Breaking down over time

What I’ve experienced many times now is that ElasticSearch will eventually degrade because of the default settings.

Since Logstash, Filebeat, etc. are creating new indexes every day, the number of allocated shards will grow dramatically over time. One of my single-node clusters was found to have over 850 open shards after just a few months without careful watching. In another case, a SIEM using ES to store alerts choked on 740 shards with only about 200MB of actual data stored!

Prevent your cluster from jamming up

Of course, this isn’t a general recommendation for everyone, rather for small clusters that see less than about a gigabyte per day of logs.

Simply change the index pattern for the elasticserach output for your Logstash server:

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "logstash-%{+YYYY.MM}"
  }
}

This will change the config to rotate the logs monthly, giving you nice big shards that won’t start clogging up the server after a few months. Some may instead opt for weekly index rotation:

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "logstash-%{+YYYY.ww}"
  }
}

Either way, it will allow you to retain many more logs before running into shard allocation issues.

Fixing a broken cluster

If the cluster was installed with default settings, and now months later is having issues with shard allocation, this is a method to restore cluster health.

Using the reindex API, we can squash the daily indices together into monthly rollups, then delete the daily indices to free up shards.

First, get a list of indices on the server:

$ curl -s localhost:9200/_cat/indices | grep 2019.09

green  open logstash-2019.09.10  5 0   98132 0 39.9mb 39.9mb
green  open logstash-2019.09.13  5 0  108087 0 42.3mb 42.3mb
green  open logstash-2019.09.22  5 0   94699 0 38.2mb 38.2mb
green  open logstash-2019.09.04  5 0   97855 0   40mb   40mb
green  open logstash-2019.09.17  5 0   96680 0 39.5mb 39.5mb
green  open logstash-2019.09.30  5 0   96478 0 38.7mb 38.7mb
green  open logstash-2019.09.20  5 0   96381 0 39.4mb 39.4mb
green  open logstash-2019.09.18  5 0   96429 0 39.5mb 39.5mb
green  open logstash-2019.09.19  5 0  100313 0 40.5mb 40.5mb
green  open logstash-2019.09.27  5 0   94546 0 36.5mb 36.5mb
green  open logstash-2019.09.28  5 0  104873 0 39.2mb 39.2mb
green  open logstash-2019.09.14  5 0   98297 0 40.9mb 40.9mb
green  open logstash-2019.09.06  5 0   95710 0   39mb   39mb
green  open logstash-2019.09.07  5 0   96077 0 39.3mb 39.3mb
green  open logstash-2019.09.25  5 0  103183 0 40.5mb 40.5mb
green  open logstash-2019.09.05  5 0   98573 0 40.7mb 40.7mb
green  open logstash-2019.09.29  5 0   96221 0 37.9mb 37.9mb
green  open logstash-2019.09.01  5 0   98182 0 41.2mb 41.2mb
green  open logstash-2019.09.09  5 0   97034 0 39.8mb 39.8mb
green  open logstash-2019.09.02  5 0   96864 0 39.6mb 39.6mb
green  open logstash-2019.09.08  5 0  114932 0 45.2mb 45.2mb
green  open logstash-2019.09.21  5 0   94980 0 38.6mb 38.6mb
green  open logstash-2019.09.11  5 0   98533 0 39.6mb 39.6mb
green  open logstash-2019.09.15  5 0  103652 0 44.5mb 44.5mb
green  open logstash-2019.09.23  5 0   94977 0 38.6mb 38.6mb
green  open logstash-2019.09.12  5 0   97144 0 39.3mb 39.3mb
green  open logstash-2019.09.16  5 0   98439 0 41.1mb 41.1mb
green  open logstash-2019.09.03  5 0   96099 0 39.6mb 39.6mb
green  open logstash-2019.09.26  5 0   92118 0 38.5mb 38.5mb
green  open logstash-2019.09.24  5 0   98305 0 40.2mb 40.2mb

Here we can see all 30 indexes produced last month. Each is allocated five shards, for a total of 150 shards. Each one will be a chunk of about <10MB on disk, far below the recommended shard size.

We can confirm the number of shards by using the cat api to list them:

$ curl -s localhost:9200/_cat/shards | grep '2019.09' | wc -l
150

To help our cluster out, we will reindex them together and conserve some shards. First, create a new index for the month of September:

$ curl -X PUT  "localhost:9200/logstash-2019.09"

Optionally set the replicas for this index. I have one server, so I will set it to zero to save space.

$ curl -X PUT "localhost:9200/logstash-2019.09/_settings?pretty" -H 'Content-Type: application/json' -d'
{
    "index" : {
        "number_of_replicas" : 0
    }
}
'

Next, run the reindex job to merge the indices:

$ curl -X POST "localhost:9200/_reindex?pretty" -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "logstash-2019.09.*"
  },
  "dest": {
    "index": "logstash-2019.09"
  }
}
'

This will take all the indices that match the source query and merge them together into the destination. This will take some time, so I highly recommend running this command from screen or tmux to avoid having the session drop. When I merged these indices it took about 20 minutes on a low end Celeron CPU with an aging SSD, your mileage will vary.

While the job is in progress, you can check it from another terminal:

$ curl localhost:9200/_cat/tasks
...
indices:data/write/reindex  xxx:3439666 -      transport 1572045692559 23:21:32 13.8m     10.xx.yy.zz nodey
...

Finally, it will finish after some time.

Check the status of the index and make sure it appears correctly:

$ curl -s localhost:9200/_cat/indices | grep '2019.09 '
green open logstash-2019.09   5 0 2953793 0  1.3gb  1.3gb

When the reindex is done, it will output something like this:

{
  "took" : 1700736,
  "timed_out" : false,
  "total" : 2953793,
  "updated" : 0,
  "created" : 2953793,
  "deleted" : 0,
  "batches" : 2954,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

Great! Next, we can remove the source indices.

$ curl -X DELETE "localhost:9200/logstash-2019.09.*"

And once again check the index list to see that it is the only one remaining:

$ curl -s localhost:9200/_cat/indices | grep 2019.09
green open logstash-2019.09  5 0 2953793 0  1.3gb  1.3gb

And of course, we can see that the number of shards is dramatically lower by comparing it to the 150 we got before squashing the indices:

$ curl -s localhost:9200/_cat/shards | grep '2019.09' | wc -l
5

Best of all is that the load average will quickly go back down to an acceptable level and stay there!

Hopefully, this information will help restore performance to somebody else’s cluster, and ideally stop these kinds of horrible performance issues from happening in the first place.

Since the release of 7.0, there are new features to help rotate and manage indices, but the defaults remain the same.