Tutorial
11 min read

Flink on Kubernetes - how and why?

flink-on-kubernetes-how-why-getindata-albert-big-data-blog-macrovector

Flink is an open-source stream processing framework that supports both batch processing and data streaming programs. Streaming happens as data flows through the system with no compulsory time limitations in output.  Flink is used in a lot of projects, and we can see it in action in rule-based alerting, anomaly detection systems, web applications with personalized content or personalization of the offer in e-commerce. It can be integrated with the Hadoop ecosystem and Kubernetes, or it can run stand-alone.

Kubernetes has become the major player in the containerization field. It provides a lot of advantages and new challenges - we can call it the next step of IT evolution. Its maturity and main features allow more and more services to become available and to be deployed directly on Kubernetes. Apache Flink is a great example of such a service. We can find multiple ways to install and run it directly on Kubernetes which provide better scalability, easier implementation of Continuous Integration and Continuous Deployment, easier version upgrades and container reuse.

The most important difference between the solutions mentioned below is that they include different operators. Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components. Operators are used due to their flexibility (developers can create their own objects and simplify deployment) and offers the possibility to add custom logic, however operator preparation and validation is not simple - it requires a lot of work.

Note: Apache Flink by default exposes its own web UI with a description of the job, metrics, diagram of application, and information about TaskManagers.

Common aspects of Flink in Kubernetes

Let’s start with monitoring. We can set up the exporter, like the Prometheus exporter, add annotations to our pods and start scraping metrics from all the pods on which we run Flink jobs. It’s quite simple when using Prometheus and Kubernetes service discovery feature where we can define target annotations.

The second thing is High Availability. We need to start by connecting Flink jobs to object storage or HDFS in high availability in which we store data and Flink can save its checkpoints and savepoint. Moreover, Flink requires additional service to run two JobManagers to provide a high availability setup - here we can go with connecting them to Zookeeper or using the Kubernetes service to manage the active Flink instance.

Flink K8S operator

GitHub repository:
https://github.com/lyft/flinkk8soperator

lyft-flink-on-kubernetes-getindata-big-data

This service is developed by Lyft and is a great example of software that is constantly being improved by the community.  The operator acts as a control plane to manage the complete deployment lifecycle of the application. It is worth mentioning that the last version of this operator was released at the end of April 2020 with release v0.5.0. There is also a dedicated Slack workspace for the users. There is an annotation in the README file that the project is still in beta phase.

Installation and the upgrade process is quite simple. We can go with Kubernetes manifests or Helm chart - if we want to change something then we can go directly to GitHub to the fork repository and do what we want.

The CICD process is quite simple. We can use Kubernetes manifests or Helm charts with all updates and configuration there. Here, we use custom definitions managed by the operator, like the JobManager Deployment and Service, or TaskManager. If you have experience with deploying applications to Kubernetes, you'll know it is a smooth process. We can run additional checks to validate that our job has started successfully by calling the Flink job directly. This operator delivers blue green deployment that is a valuable feature. Blue-green deployment is a technique that reduces downtime and risk by running two identical production environments called Blue and Green. At any time, only one of the environments is live, with the live environment serving all production traffic - in this case we replace the old job with the new one when both are running and only one of them processes data.

The speed of submitting new jobs is not the best and it depends mostly on the amount of available resources - blue green deployment can take several minutes while some solutions described here are much faster.

Any configuration related to the job is managed in the same way, like in the YARN cluster, and it can be achieved by using templates with ConfigMaps stored in the repository. There are no problems with connecting Flink to object storage, kerberized components like Kafka or HDFS.

Advantages:

  • Blue-Green deployment
  • nice working custom objects on Kubernetes
  • open-source code available on GitHub
  • lightweight solution
  • simple addition of sidecar containers

Disadvantages:

  • no central management for jobs
  • no available SQL editor
  • we can’t run session clusters
  • starting  a job may take some time - still in beta

Kubernetes Operator for Apache Flink

GitHub repository:
https://github.com/GoogleCloudPlatform/flink-on-k8s-operator

gcp-flink-kubernetes-getindata-big-data-cloud

It is available in the Google Cloud repository but it’s worth mentioning that it’s not officially supported by the company from Mountain View. Talking about Operator, the installation of it is pretty straightforward and we can use Kubernetes manifests or Helm charts to deploy it in our environment.

It creates a custom resource definition FlinkCluster and runs a controller Pod to keep watching custom resources. Once a FlinkCluster custom resource is created and detected by the controller, the controller creates the underlying Kubernetes resource (e.g., JobManager Pod) based on the spec of the custom resource.

CICD pipelines are simple to build as we describe our Flink job within the Helm chart or Kubernetes manifest, so it's quite a typical setup for an application running in a containerized environment. Of course it’s important to check if a job is up by calling the Kubernetes API and getting the status of Flink’s pods and its object.

The latest release, version number 0.2.0, was added in September 2020 and it’s quite an early beta version. It provides the possibility to restart failed jobs from the last savepoint and we can also set up batch scheduling for JobManager and TaskManager pods.

Advantages:

  • open-source code available on GitHub
  • lightweight solution
  • simple adding of sidecar containers
  • job and session cluster available
  • support for GCP services (Google Cloud Storage connector, IAM service accounts)
  • support for Python Apache Beam

Disadvantages:

  • no central management for jobs
  • no SQL editor available
  • beta release 0.2.0 - not the most mature solution

Ververica Platform

Ververica is a platform created by the team behind Apache Flink, so we can be sure that we are dealing with a product from an experienced company that understands Flink and how it works under the hood. One of the most unique features of Ververica Platform is a web dashboard in which the user can create deployment targets (add information about the name of target group and assign it to the chosen namespace), default configuration of Flink job, session clusters on which we can run Flink SQL queries or jobs; managing the lifecycle of our application and, moreover, there is quite a powerful REST API to manage job configuration, creating new jobs. Ververica applies new enhancements based on user feedback.

ververica-platform-getindata-flink-on-kubernetes

It is also a useful platform for running Flink SQL queries directly from Ververica and taking a look into the results from the UI instead of the terminal.

The CICD process requires using the REST API to which we send requests with configuration in YAML or JSON format so it can be easily implemented into any CICD tool like Gitlab CI, Jenkins or GitHub actions.

Installation of the Ververica Platform comes with its own SQLite database (it can be replaced with PostgreSQL, MySQL in the case of any version except the Community Edition) in which it stores all its data. It also simplifies configuration of external blob storage that can be used as the artifactory for Flink JARs as well as the place to store save points and checkpoints of jobs. It works with S3-compatible object storage as well as with kerberized HDFS.

The Ververica Platform is available as a free Community Edition which comes with a limitation of one namespace for jobs and no authentication; Startup Edition and Enterprise Edition. It’s worth mentioning that the authors are constantly improving their product and we can see frequent updates.

Advantages:

  • simple installation process
  • powerful REST API that is useful for CICD pipelines
  • management of Flink jobs (configuration, state)
  • Flink SQL editor
  • Autorestart mechanism directly within the Ververica Platform (not only in Flink jobs

Disadvantages:

  • limitations in the Community Edition, i.e. limited number of active namespaces and no authentication
  • using custom Flink images requires multiple changes in the Dockerfile and utilized Flink shell scripts

Native Kubernetes from official Apache Flink

Documentation:
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/standalone/kubernetes.html

We can’t forget about the possibilities that are already available directly within Apache Flink. Since the release of 1.12.0 Flink has improved all aspects of running it in Kubernetes. It doesn’t require the addition of any CustomObjects or additional applications for managing Flink.

We have the following deployments in this case:

  • A Flink Session cluster is executed as a long-running Kubernetes Deployment. You can run multiple Flink jobs in a Session cluster. Each job needs to be submitted to the cluster after the cluster has been deployed.
  • A Flink Application cluster is a dedicated cluster which runs a single application

Native Kubernetes doesn’t introduce any operator and the installation is quite simple. Moreover, we are sure that it will be developed with adding enhancements to the service constantly as it is supported by a wide group of contributors of Apache Flink. Currently it’s really limited but in some cases it may be enough.

CICD process is the same as in any other application running in Kubernetes - we deploy everything straight from the CICD tool and we need to verify if the application is up or if it has any issues.

Advantages:

  • no installation required
  • simple adding of sidecar containers

Disadvantages:

  • no custom objects
  • no support for per-job clusters
  • limited features with no restart mechanism excluding Flink job directly.

Several ways to run Flink on Kubernetes

getindata-flink-on-kubernetes-big-data

Based on our experience, we can recommend using Ververica Platform as the reliable solution for most use cases. It provides a fast way to run the Flink job -  restarting the job with the new cluster is quite fast in comparison to, for example, Flink based on Lyft Operator and we can find some useful features like Flink SQL editor.

Surely, the choice of a perfect Flink operator depends on the exact use case, technical requirements and number of jobs. All the presented operators come from strong players in the Big Data market.

At GetInData you have access to 50+ distributed systems and cloud experts working with big data systems based on Apache Flink in multiple configurations. Do not hesitate to contact us, our team will be happy to discuss your real-time big data streaming project.

big data
kubernetes
apache flink
flink
getindata
CI/CD
23 February 2021

Want more? Check our articles

big data technology warsaw summit 2021 adam kawa przemysław gamdzyk
Big Data Event

The Big Data Technology Summit 2021 - review of presentations

Since 2015, the beginning of every year is quite intense but also exciting for our company, because we are getting closer and closer to the Big Data…

Read more
1YkseCzHNQ9Sxsi4BHnoCOQ
Use-cases/Project

Enabling Hive on Spark on CDH 5.14 — a few problems (and solutions)

Recently I’ve had an opportunity to configure CDH 5.14 Hadoop cluster of one of GetInData’s customers to make it possible to use Hive on Spark…

Read more
getindata grafana loki monitoring
Use-cases/Project

Why are log analytics so important in a monitoring system?

A monitoring system is a necessary component of any data platform. We can find a lot of different services that use different approaches to the same…

Read more
obszar roboczy 12 6blog

GetInData in 2020 - our achievements and challenges in Big Data environment

The end of 2020 has come, and it's time to stop for a moment and look back. The past year was not the easiest one and presented us with many…

Read more
cloud computing insurance
Tutorial

Cloud computing standard for the insurance industry

On June 16, 2021, the Polish Insurance Association published the Cloud computing standard for the insurance industry. It is a set of rules for the…

Read more
big data for e commerce

Big Data for E-commerce.

The year 2020 was full of challenges in many areas, and in many companies and organizations.  Often, it was necessary to introduce radical changes or…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

By submitting this form, you agree to our  Terms & Conditions