7 min read

Why are log analytics so important in a monitoring system?

A monitoring system is a necessary component of any data platform. We can find a lot of different services that use different approaches to the same thing, and all of them provide quite similar features and work reliably, which is most important. Moreover, there is also an opportunity to take advantage of reading logs. This is a really useful feature when debugging an issue in an application or check what is happening on your platform. How can this be achieved?

Why do we need logs?

We need to start with some questions to explain why and for what monitoring systems with log analytics can be useful. The first thing is about performing complex monitoring of each process in our platform. When talking about Big Data solutions, it is imperative to check that all real-time processing jobs work as expected, because we have to act quickly if there are any issues. It is also important to validate how any changes in the code help in the processing part.

Here we can talk about processing jobs that run on Apache Spark and Apache Flink. The first part of the monitoring process is focused on getting metrics like the number of processed events, JVM statistics or used Task Managers. The second is about log analytics. We want to detect any warnings or errors in the log files and analyze them later during post mortem or to find any invalid data sources. Moreover, we can set up alerts based on the log files that could be really helpful for detecting issues, even with a different component.

There is also a need to provide all log files in real time, because any lag in sending them can cause problems and would not provide the required effect for IT and business developers. In the case of a Flink job, we want to check that all triggers work as expected, and if not then we would need to find the reason for this in the log files. We want to find values in logs later by looking for an exact phrase.

There are several solutions on the market, and we have tried many different approaches to finding the one tailored to the needs of the service.

Elastic stack and friends

The most common solution when talking about log analytics is Elastic stack. We can use ElasticSearch for indexing, Logstash for processing log files that are sent by Filebeat or Fluentd from machines directly and Kibana for data visualization and alerts. It is a really mature and well-developed platform where you can find a lot of plugins.

It is a great solution for indexing logs for business developers when you have to index all the content of log files. We also need to remember about technical requirements. You can tune up the parameters, but it still requires a great amount of CPU and RAM to run everything smoothly.

A great rival in the market

We had Elastic stack in one project. We had Filebeat, Logstash, ElasticSearch and Kibana and we were not able to make it faster, even after implementing some changes. The overall performance was not the best and we therefore started searching for a more powerful solution. Our case was focused on getting logs from Flink jobs and NiFi pipelines because we wanted to check what was inside their logs and find some target values in the historical data.

We have a monitoring system based on Prometheus and Grafana. We started by searching for available solutions that would provide better performance, and we could add log analytics in the Grafana directly.


Then we decided to test Loki. It is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost effective and easy to operate. It does not index the contents of the logs, but rather a set of labels for each log stream. The project was started in 2018 and was developed by Grafana Labs so, as you may expect, we can query Loki data in Grafana, which happens to be very useful.

Loki in practise

We started migration from ELK to Loki stack on development environment in one project, staying with two simultaneously working log analytics tools The first great thing about Loki is the simple configuration file. The second is about the overall simplicity of the installation process. We prepared Ansible roles for Promtail and Loki installation, and added some Molecule tests that could be implemented easily.

We decided to implement the Loki-based solution on the production environment where we process over several hundred thousand events per second, so it was the best place to test it with huge, real-time data processing pipelines. The overall experience was great, because Promtail and Loki can handle all the traffic and we can deliver all the log files in near real-time with no data loss, which is crucial.

Promtail can be installed on any server and can also be easily installed on the Kubernetes pods. We can run some relabeling on the machine directly or use regex for reducing the number of sent log files. This shows that we can adjust Promtail to suit our needs.


Loki provides LogQL for running queries on logs. It is really useful that its syntax is similar to PromQL, so most users can run queries with no issues. Moreover, Grafana supports adding panels that are based on the number of searched phrases in logs, which can be helpful as we can subsequently add alerts for it too. This feature is really welcomed by all kinds of developers, because they can find any relationships between metrics and the content of the log files in the dashboard in one tool, making work really efficient.

getindata loki lokiql bigdata

What about high availability? A default setup can be installed on the virtual machine and all data can be saved locally. It is the most basic configuration and if we want to provide high availability, we need to set up S3 storage (storing chunks) and a key-value database (storing index) like Cassandra. Here we can use Google Cloud Storage with Google BigTable or Amazon S3 with Amazon DynamoDB.


Monitor everything

A monitoring system is a must in any data platform or any different IT service. It provides knowledge about a current situation with processes, and any issues can be resolved automatically. We can trigger some actions in case of problems, like restarting a Flink job if it is down based on the metrics that shows the number of Task Managers, for example. If not, then we would get an alert. As described in the book called Site Reliability Engineering:

‘Alerts signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation’.

We can make our platform more stable and we can check how our enhancements work in development, staging and production environments by only taking a look at one tool. This is great for administrators and developers.

big data
13 April 2020

Want more? Check our articles

Radio DaTa Podcast

Data Journey with Yetunde Dada & Ivan Danov (QuantumBlack) – Kedro (an open-source MLOps framework) – introduction, benefits, use-cases, data & insights used for its development

In this episode of the RadioData Podcast, Adam Kawa talks with Yetunde Dada & Ivan Danov  about QuantumBlack, Kedro, trends in the MLOps landscape e.g…

Read more
complex event processing apache flink

My experience with Apache Flink for Complex Event Processing

My goal is to create a comprehensive review of available options when dealing with Complex Event Processing using Apache Flink. We will be building a…

Read more
airbyte column selectionobszar roboczy 1 4

Less data, less problems: Airbyte’s column selection is finally here

The Airbyte 0.50 release has brought some exciting changes to the platform: checkpointing (so that you don’t have to start from scratch in case of…

Read more
getindator man standing in front of a modern scheme showing mil 476f21ba 2f04 44d0 8c3b 8493e593b122

News Recommendation: the challenging area in building recommendation systems

Remember our whitepaper “Guide to Recommendation Systems. Implementation of Machine Learning in Business” from the middle of last year? Our data…

Read more
transfer legacy pipeline modern using gitlab cicd

How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 3

Please dive in the third part of a blog series based on a project delivered for one of our clients. Please click part I, part II to read the…

Read more
backendobszar roboczy 1 2 3x 100

Data Mesh as a proper way to organise data world

Data Mesh as an answer In more complex Data Lakes, I usually meet the following problems in organizations that make data usage very inefficient: Teams…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail:
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy