ElasticSearch broke all my nice things (a story of cascading failure)

ElasticSearch broke all my nice things (a story of cascading failure)

About two weeks ago, I upgraded my single node ElasticSearch cluster from 6.8.6 to the latest 7.9 version. Last night, all hell broke loose…

The upgrade itself wasn’t perfect. There were some issues with my setup that the helpful “Upgrade Assistant” didn’t pick up before I had already committed. I was missing a few formerly optional parameters in my elasticsearch.yml config file, there were some odd field mappings that weren’t supported any more, and some date format issues with my grok scripts. But, after a few head scratches and some furious keyboard abuse, my system was back up and running, better than ever!

Well, as it turns out, it wasn’t exactly in good shape. In the name of progress, ElasticSearch feverishly deprecates previously recommended options. A perfectly valid configuration from only a few versions ago seems to now be ground zero for a complete showstopper.

In my case, the issue was that the index template created by Logstash (a very long time ago) was no longer valid because of a breaking change. The reason that this issue took so long to surface is because this particular index rotates on a monthly basis to keep the shards nice and big. During the update process there was no warning message or error that suggested that this was the case.

The next morning, I noticed that there weren’t any new security alerts. That was strange, because there’s always some new and exciting Mirai botnet variant poking around. Later that day I noticed that my Nextcloud was unresponsive when uploading photos. By the evening when work was done, I did some investigation. What I found was that Logstash and Elasticsearch had broken pretty much everything.

The first clue was the steadly increasing storage chart for my elastic virtual machine going from about 84% to 100% full in only a couple hours. Curious, I ssh’ed in to check. What I found was some absolutely monstrous syslog files, possibly some of the largest I’ve ever seen:

$ ls -lah /var/log/syslog*
-rw-r----- 1 syslog adm  11G Sep  1 17:40 /var/log/syslog
-rw-r----- 1 syslog adm  14G Sep  1 06:25 /var/log/syslog.1

Using less I looked into each file. Inside both were messages much like this duplicated millions and millions of times:

Sep  1 06:25:04 elastic logstash[873]: [2020-09-01T06:25:04,052][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2020.09", :_type=>"_doc", :routing=>nil}, #<LogStash::Event:0x109be9b2>], :response=>{"index"=>{"_index"=>"logstash-2020.09", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"[_default_] mappings are not allowed on new indices and should no longer be used. See [https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html#default-mapping-not-allowed] for more information."}}}}

I surmised that each of these log entries was created by one attempt to create the new index. Each attempt to create the new index was caused by one event processed by Logstash. And, of course, each event was a syslog message, an httpd log, or a suricata network event. Naturally, I have many systems that auto-update at midnight on the 1st of the month followed by a reboot. Each of these triggers quite a bit of network traffic and therefore creates many suricata logs. Thus, the perfect storm was created, dozens of devices updating caused billions of error logs to fill up my logging server.

After reading the linked document, I determined that the fastest way to get back up and running was to simply stop all the services, remove the index template that Logstash had created, then allow it to re-generate it on first startup. It was a fairly quick fix:

sudo systemctl stop logstash
sudo systemctl stop elasticsearch

curl -X DELETE  localhost:9200/_template/logstash

sudo systemctl start elasticsearch
sleep 120  #coffee break
sudo systemctl start logstash

Soon after, the system was back up and running. However, it turned out that there was more damage than I first thought.

In addition to Suricata network IDS, I also use Wazuh host IDS. Wazuh is based on Ossec, a trusty alerting engine for the paranoid and security-minded sysadmin. Part of ossec’s functionality is to read system logs in real time to detect potentially malicious activity. Normally this is a very useful feature, but that fateful night it caused the death of a second system.

You see, each time a message logged containing the word error, an event is forwarded to the Ossec server. This is a good thing for it to do, since errors are usually good to know about. In this case, there were billions of errors, so the effort of decoding and indexing all of these events caused tremendous IO load on both the ossec server VM and the underlying KVM hypervisor. By about 09:00, this disk had also filled up with logs.

Now, most homelabbers thin provision. It’s the dirty secret we tell ourselves is okay. In my case, both my KVM systems were overprovisioned. When the syslog hurricane hit, both VMs on their respective hosts hit the lid on storage, causing all the other VMs to lock up after their devices were put into read-only mode. Once I had deleted some old indices and syslogs, I was able to shrink both qcow disks using qemu-img, de-ballooning the drives and unlocking IO for the rest of my machines.

By the end of this ordeal, I’ve learned some important lessons about overprovisioning and the very limited performance of cheap Celeron NUCs. I’ve probably done permanent damage to the cheap off-brand SSDs as well, but that was always a risk I was willing to take.

But really, my frustrations are directed at Elastic Stack. I don’t understand why it’s necessary to rapidly deprecate functionality that was once default so quickly, and I’m very frustrated by the lack of error rate limiting in Logstash. All of this could have been avoided if Elastic services were more focused on stability over flashy features that most deployments don’t use.

I’m considering moving away from ELK stack after this. It’s caused too many issues for what it gives me, and requires more maintenance than most of the actual applications I self-host.

As somebody in the software industry said, “your competition isn’t competitors, it’s Microsoft Excel”. I think in this case, the competition isn’t another log server, it’s grep itself.