Logs can provide a lot of useful information about the environment and status of the application and should be part of our monitoring stack. We'll discuss how only metrics with logs are enriched, valuable sources of the truth of the platform, and then we'll talk about observability which became the key to creating a stable, efficient and continuous working service in any environment. This is the key to making our platform fully observable.
Observability is quite similar to DevOps. This is not only limited to technology, as it covers organizational culture and approach too. Besides, the concept of observability is prominent in the DevOps approach, because as described, monitoring goals are not limited to collecting and processing logs and metrics. It should deliver information about its state to make it observable. That is what we call observability. A great synonym would be understandability for users.
Let’s go and take a look at a solution designed for Kubernetes-native software that we can easily install on any cloud or on-premise infrastructure.
Managed vs. self-hosted solutions
The first issue that arises when talking about the use of the public cloud is in choosing the right solution: a fully automated cloud service, provided by a cloud service or self-managed applications. Each of the main public cloud providers deliver their own solution addressing the need to obtain and analyze logs from our applications and infrastructure. Google Cloud has its Logging feature; AWS CloudWatch Logs and Microsoft Azure - Azure Monitor. Initially, it looks great. It is self managed, we don't need to worry about scalability - and we can easily integrate it with any cloud service or with our own applications. Unfortunately, such an approach does not provide too much freedom in terms of configuration and in many cases, it can incur quite high costs.
This is where self-managed logs analytics tools come into play. The most popular one is Elastic stack, but there are also some great alternatives. The most interesting one that we use in multiple projects for different customers at GetInData is Loki made by Grafana, not to mention Graylog, DataDog, LogDNA and Sumo Logic.
The second issue is the number of logs, how many logs have been created, how many we require in our platform and the perspective of the system. It is necessary to plan the infrastructure, configure each log pipeline and estimate the costs - the last one being especially important when deciding to go with cloud managed services.
The third issue is about visualisation and alerting.
To summarise this part of the article, let’s analyze the following areas:
Number and size of the qlogs that will be sent to the system
Can we filter any logs at the source (like not sending all logs at INFO level) to reduce sent logs?
High Availability of the system
The age of the data we would like to run queries on.
The length of time we need to store these logs.
Can we use our current visualisation tool or do we need to install an additional application?
How can we manage access to the logs?
How can we provide alerts based on the content of the logs?
Use case: Grafana Loki in the cloud
Loki is a horizontally scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus. It was designed to be very cost-effective and easy to operate. It does not index the contents of the logs, but rather a set of labels for each log stream. The project was started in 2018 and was developed by Grafana Labs, so as you may expect, we can query Loki data in Grafana, which happens to be very useful.
Currently, there is a new release - 2.2.0 and we can notice great speed in development, adding new features and enhancements which is crucial when choosing the right tool.
Logs ingestion
Loki is responsible for log aggregation and running queries in logs, yet it still requires an external application to deliver logs to it. The first way is to add a dedicated pipeline in the application from which we can push logs to Loki directly, while the second, recommended and most widely used, is using the dedicated service Loki Promtail, FluentD or Fluentbit.
Two modes of Loki - installation and configuration
Loki can work in two different modes: monolithic and microservice.
The first one is a great solution to start the journey with Loki or for a platform in which we don't expect a high log load, as it is a simple setup and most users can create it with no major issues. On the other hand, Loki can be run with its microservices that are the key to making the logs analytics platform easily, horizontally scalable (depending on the infrastructure to which we install it).
Installing Loki can be easily achieved by using the official Helm chart maintained by Grafana Labs. We can customize the values file, add our own configuration and quickly deploy it to the target environment.
The best option for installing Loki in microservice mode is Kubernetes - each public cloud provider delivers its own managed Kubernetes like AWS EKS, Azure AKS or Google GKE. Here are the following components of Loki:
Distributor - this is responsible for handling incoming streams by clients.
Ingester - responsible for writing log data to long-term storage backends on the write path and returning log data for in-memory queries on the read path.
Querier - handles queries using the LogQL query language, fetching logs both from the ingesters and long-term storage.
(Optional) Query frontend - provides the querier’s API endpoints and can be used to accelerate the read path.
It is necessary to use etcd, Consul or Memberlist to deploy multiple ingesters. One of these components is used to shard series/logs across multiple ingesters.
The next requirement of Loki is storage. Fortunately, since the release of v1.5.0, we only have to use object storage instead of mixed object storage with key-value databases (like Cassandra or Google Cloud BigTable) that makes the whole platform cheaper and easier to maintain. We need to create a new bucket in AWS S3, Google Cloud Storage or Microsoft Azure Blob Storage, set the standard storage class, add required permissions to the utilised IAM user/role by Loki and that’s all. We can then start Loki.
Moreover, in the case of needing to have as fast queries as possible, we can add volumes to our Loki deployment and then the querier can cache the query results in the local storage, to reduce time spent on running the same query once again.
Alerting out of the box
Loki includes a component called Ruler that is responsible for continually evaluating a set of configurable queries and then alerting when certain issues occur, e.g. a high percentage of error logs. It can then send an event to the Alert Manager from which the alert can be sent to the email or Slack channel
Ruler supports object storage or local storage to store its state. It's important to mention that Ruler is also horizontally scalable. Similar to the ingesters, the Rulers establish a hash ring to divide up the responsibilities of evaluating rules.
One single place to see everything
One of the most important facts about Loki is that it is supported by Grafana. We can configure it in a few simple steps, create dashboards with a number of occurrences of the error or of the information and then set up the alert from Grafana, or combine such a panel with Prometheus metrics from the Flink job. This provides a great opportunity to create a complex dashboard in which we can make a fully observable platform. It can also be useful to create a self-healing platform - the action can be triggered based on the logs content.
Simplicity vs. performance
Loki doesn’t require too many resources, especially when compared to the Elastic stack. Unfortunately, it has a big impact on the query speed which is not ideal, this therefore being the main reason why Loki is a great tool for developers to understand logs from their applications, not for running business analysis based on logs
Data in ElasticSearch is stored on-disk as unstructured JSON objects. Both the keys for each object and the contents of each key are indexed. In Loki, logs are stored in plaintext form, tagged with a set of label names and values, where only the label pairs are indexed. This trade-off makes it cheaper to operate than a full index and allows developers to aggressively log from their applications.
Simple, well performing logs analytics tool
Loki seems to be the most interesting platform for technical logs analytics as it’s an open-source project. We can simply install it on any available Kubernetes in any environment with object storage or even on a virtual machine, whilst its features meet production requirements such as High Availability, alerting or data visualization in the tool that supports access management
At GetInData, we evaluate multiple configuration setups and we really know how to create a valuable, well performing and scalable platform for log analytics. If you want to know more, do not hesitate to contact us.
big data
analytics
monitoring system
Grafana
Loki
22 April 2021
Like this post? Spread the word
Want more? Check our articles
Tutorial
PyCaret and BigQueryML Inference Engine. Is this the fastest way to train and deploy a machine learning model?
Streamlining ML Development: The Impact of Low-Code Platforms Time is often a critical factor that can make or break the success of any endeavor. In…