Metrics Collector Journey
Heili, being a software as a service monitoring platform processes lots of metrics data. As we grew, numbers of processed metrics grew as well. At some point we started to struggle with slow metrics collection and shot but annoying data transfer blackouts. Those blackouts took split of a second and we determined that they somehow related to data transfer from queue server to database. Drill-down pointed to Logstash, a component we used to data transfer. Before I can elaborate, here is short overview of Heili architecture.
Architecture and background
Heili is running in managed Kubernetes cluster (GKE) in Google Cloud Platform. Most of the nodes in the cluster are preemptible VMs that has maximum life span of 24 hours. Using preemptible allows us to significantly reduce the costs of running in a cloud environment.
Heili processes metrics using either telegraf or prometeus client. The metrics are received by RabbitMQ and stored in Elasticsearch. Using Elasticsearch allows us to get better insights and train anomaly detection models. Having those increases client applications reliability and so on.
Initial design and the collector
Heili is a real time monitoring platform. All and any the metrics from the queue must be stored in Elasticsearch in real-time. Any delay in processing can cause false alarms or worse, missing real problem on the monitor system. To deliver those metrics from the queue is a job for collector component. We started by testing two collectors: Logstash and Fluentd. Both are open source, have big community and a lot of plugins. Most importantly - our team had an experience with both of them under heavy load from previous projects. (There are other collectors on the market, but having minimum time to market, we decided to test only components that we had experience with).
Logstash is developed by Elastic, so it has native support for Elasticsearch and built-in plugin for RabbitMQ. Using simple configuration for input, parsing and output we had our metrics stored in Elasticsearch in couple of minutes. It was part of our original design.
Fluentd was developed by Treasure Data but now it’s a CNCF project. It has minimum built-in plugins but has dozens of community ones that can be easily added. For us meant that we could not use official docker image and had to customize our own with pre-installed plugins we need. Process of creating one is simple and straightforward. Component configuration is as simple as Logstash’s.
With both collectors running in the test environment we started increase metrics load. At the beginning, both collectors behaved great but at some point Fluentd stated to skip metrics. The reason was in format of the metrics we were receiving. While some of them were plain JSON, others were a JSONs collections (not list, just JSONs separated with new line). Logstash’s native JSON parser had no issues there but Fluentd’s did handled them well at all. Another issue raised was metrics timestamp, not all of them were read correctly by Fluentd. Both issues considered critical for us so we stopped the testing.
Conclusion: In default configuration Logstash was the right decision for Heili.
Production in Scale
Our scale kept increasing and we started to have issues with Logstash. We used preemptible instances but Logstash startup time was slow. During the instance changes, customers with larger volumes started to experience delay in collecting messages from the queue. We started to get false alarms of missing data. Investigating this problem, we found that Logstash has large memory footprint destabilizing the entire cluster. Having identified performance issue we increased resources allocation and it appeared to help but it was the right time to optimise.
Back to the Fluentd
From personal experience, i knew that both those issues could be solved by Fluentd. It’s written in CRuby so it has smaller memory footprint and faster load. (As compared to Logstash built with JRuby and use Java virtual machine).
Now we had time to build a better solution. We had to make custom AMQP input plugin that can handle our special metrics format solving the previous JSONs collections issue.
We were ready to start testing collectors again.
First test results were amazing:
Fluentd had average of 10.7 seconds (including docker image download), while Logstash was 69.5 seconds (maybe storing image in Google Registry, as we did for Fluentd, was reducing us average of 15 seconds)
Memory usage, just loot the screenshot (both collectors collect data from identical queues, running on identical nodes and identical resource configurations):
Not everything is gold
As I mentioned before, in our case, Fluentd’s parsing is different from Logstash. It shows in CPU usage, see the graph. Not a serious drawback as it is adding only couple of milliseconds to the entire parsing process.
Logstash is really great, it can handle really heavy load of big log messages. However it looks like that in a dynamic infrastructure , such as Heili the load time become a real issue. In more static environments with predictable loads it does a really good job.
When you picking the collector for logs or any other data (both of them can path data between variety of things) test them for your use case. Don’t be afraid to change the component if the use case changes.