Use-cases/Project
7 min read

Why are log analytics so important in a monitoring system?

A monitoring system is a necessary component of any data platform. We can find a lot of different services that use different approaches to the same thing, and all of them provide quite similar features and work reliably, which is most important. Moreover, there is also an opportunity to take advantage of reading logs. This is a really useful feature when debugging an issue in an application or check what is happening on your platform. How can this be achieved?

Why do we need logs?

We need to start with some questions to explain why and for what monitoring systems with log analytics can be useful. The first thing is about performing complex monitoring of each process in our platform. When talking about Big Data solutions, it is imperative to check that all real-time processing jobs work as expected, because we have to act quickly if there are any issues. It is also important to validate how any changes in the code help in the processing part.

Here we can talk about processing jobs that run on Apache Spark and Apache Flink. The first part of the monitoring process is focused on getting metrics like the number of processed events, JVM statistics or used Task Managers. The second is about log analytics. We want to detect any warnings or errors in the log files and analyze them later during post mortem or to find any invalid data sources. Moreover, we can set up alerts based on the log files that could be really helpful for detecting issues, even with a different component.

There is also a need to provide all log files in real time, because any lag in sending them can cause problems and would not provide the required effect for IT and business developers. In the case of a Flink job, we want to check that all triggers work as expected, and if not then we would need to find the reason for this in the log files. We want to find values in logs later by looking for an exact phrase.

There are several solutions on the market, and we have tried many different approaches to finding the one tailored to the needs of the service.

Elastic stack and friends

The most common solution when talking about log analytics is Elastic stack. We can use ElasticSearch for indexing, Logstash for processing log files that are sent by Filebeat or Fluentd from machines directly and Kibana for data visualization and alerts. It is a really mature and well-developed platform where you can find a lot of plugins.

It is a great solution for indexing logs for business developers when you have to index all the content of log files. We also need to remember about technical requirements. You can tune up the parameters, but it still requires a great amount of CPU and RAM to run everything smoothly.

A great rival in the market

We had Elastic stack in one project. We had Filebeat, Logstash, ElasticSearch and Kibana and we were not able to make it faster, even after implementing some changes. The overall performance was not the best and we therefore started searching for a more powerful solution. Our case was focused on getting logs from Flink jobs and NiFi pipelines because we wanted to check what was inside their logs and find some target values in the historical data.

We have a monitoring system based on Prometheus and Grafana. We started by searching for available solutions that would provide better performance, and we could add log analytics in the Grafana directly.

getindata-prometheus-elastic-grafana

Then we decided to test Loki. It is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost effective and easy to operate. It does not index the contents of the logs, but rather a set of labels for each log stream. The project was started in 2018 and was developed by Grafana Labs so, as you may expect, we can query Loki data in Grafana, which happens to be very useful.

Loki in practise

We started migration from ELK to Loki stack on development environment in one project, staying with two simultaneously working log analytics tools The first great thing about Loki is the simple configuration file. The second is about the overall simplicity of the installation process. We prepared Ansible roles for Promtail and Loki installation, and added some Molecule tests that could be implemented easily.

We decided to implement the Loki-based solution on the production environment where we process over several hundred thousand events per second, so it was the best place to test it with huge, real-time data processing pipelines. The overall experience was great, because Promtail and Loki can handle all the traffic and we can deliver all the log files in near real-time with no data loss, which is crucial.

Promtail can be installed on any server and can also be easily installed on the Kubernetes pods. We can run some relabeling on the machine directly or use regex for reducing the number of sent log files. This shows that we can adjust Promtail to suit our needs.

getindata-prometheus-promtail-bigdata-

Loki provides LogQL for running queries on logs. It is really useful that its syntax is similar to PromQL, so most users can run queries with no issues. Moreover, Grafana supports adding panels that are based on the number of searched phrases in logs, which can be helpful as we can subsequently add alerts for it too. This feature is really welcomed by all kinds of developers, because they can find any relationships between metrics and the content of the log files in the dashboard in one tool, making work really efficient.

getindata loki lokiql bigdata

What about high availability? A default setup can be installed on the virtual machine and all data can be saved locally. It is the most basic configuration and if we want to provide high availability, we need to set up S3 storage (storing chunks) and a key-value database (storing index) like Cassandra. Here we can use Google Cloud Storage with Google BigTable or Amazon S3 with Amazon DynamoDB.

getindata-log-analytics-loki-grafana

Monitor everything

A monitoring system is a must in any data platform or any different IT service. It provides knowledge about a current situation with processes, and any issues can be resolved automatically. We can trigger some actions in case of problems, like restarting a Flink job if it is down based on the metrics that shows the number of Task Managers, for example. If not, then we would get an alert. As described in the book called Site Reliability Engineering:

‘Alerts signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation’.

We can make our platform more stable and we can check how our enhancements work in development, staging and production environments by only taking a look at one tool. This is great for administrators and developers.

big data
technology
prometheus
Grafana
13 April 2020

Want more? Check our articles

dsc3210
Big Data Event

A Review of the Big Data Technology Warsaw Summit 2022! Part 2. Top 3 best-rated presentations

The 8th edition of the Big Data Tech Summit left us wondering about the trends and changes in Big Data, which clearly resonated in many presentations…

Read more
getindata integartion tests spark applications
Use-cases/Project

Integration tests of Spark applications

You just finished the Apache Spark-based application. You ran so many times, you just know the app works exactly as expected: it loads the input…

Read more
llm cluster hugging face gke autopilot getindataobszar roboczy 1 4
Tutorial

Deploy open source LLM in your private cluster with Hugging Face and GKE Autopilot

Deploying Language Model (LLMs) based applications can present numerous challenges, particularly when it comes to privacy, reliability and ease of…

Read more
modern data platform dp framework components getindata
Tech News

Announcing the GetInData Modern Data Platform - a self-service solution for Analytics Engineers

The GID Modern Data Platform is live now! The Modern Data Platform (or Modern Data Stack) is on the lips of basically everyone in the data world right…

Read more
gidlogopngobszar roboczy 1 4
Tutorial

Cloud data warehouses: Snowflake vs BigQuery. What are the differences between the pricing models?

Companies planning to process data in the cloud face the difficulty of choosing the right data warehouse. Choosing the right solution is one of the…

Read more
blog7

5 main data-related trends to be covered at Big Data Tech Warsaw 2021 Part II

Trend 4. Larger clouds over the Big Data landscape  A decade ago,  only a few companies ran their Big Data infrastructure and pipelines in the public…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy