Tutorial
11 min read

Apache NiFi and Apache NiFi Registry on Kubernetes

Apache NiFi is a popular, big data processing engine with graphical Web UI that provides non-programmers the ability to swiftly and codelessly create data pipelines and free them from those dirty, text-based methods of implementation. It causes NiFi to be a widely used tool that offers a wide range of features. 

If you are interested in reading more about using Apache NiFi on production from the developer perspective and all tradeoffs that comes with its simplicity, then you should read the blog series:

Why Kubernetes?

Kubernetes is an open source system for managing containerized applications across multiple hosts. It is becoming more and more popular, also in the Big Data world, as it is a mature solution nowadays and interesting for many users. Besides being popular, Kubernetes provides the possibility of faster delivery of the results and simplifies deployment and updates (when we have prepared the CICD and Helm charts already).

Main challenges with NiFi on Kubernetes

Moving some applications to Kubernetes is pretty straightforward, but this is not the case with Apache NiFi. It is a statefulset application and in most cases it is deployed as the cluster on the bare-metal or virtual machine. It’d be perfect if not for one important thing about its architecture: each NiFi node does not share or replicate processing data between cluster nodes. This makes the whole process more complicated. Fortunately, there is a solution to overcome NiFi’s problems related to its  work in the cluster mode that requires additionally installed Zookeeper. This idea is about splitting the pipelines into  separate NiFi instances while each instance would work as the standalone. It seems to be the right way to create the stable NiFis on Kubernetes and we can easily manage their configurations from the repository by using, for example, Helm charts with a dedicated values file.

Here we come to the first important thing about NiFi. NiFi utilizes a great load of read and writes on disk, especially when we have a lot of pipelines that constantly read data, run operations on them and send data or schedule and monitor the processes based on the content of data flow. It requires pretty performant storage under the hood. Any slow disk or network storage doesn’t seem to be the right solution so here I recommend using the object storage (on-premise, like CEPH) or faster storage. In case of a high load of NiFi, the right solution would be to use local disk SSD, but this will not provide real High Availability of the service.

Big Data Blog Post - Apache NiFi and Apache NiFi Registry on Kubernetes
NiFi

The second thing is the network performance between Kubernetes clusters and Hadoop clusters or Kafka clusters. Surely, we read data from the source, process it and then save it somewhere - we have to have a stable, fast network connection to be sure that there won't be any bottlenecks. This is especially important during saving the file, like saving the file to HDFS and then rebuilding the Hive table.

The third thing is the migration process. It is not about changing the NiFis in one day -  we should schedule a plan for at least a week to move pipeline after pipeline and check if NiFi works as expected.

The fourth challenge is about making the right CICD pipeline for our needs. This, in fact, consists of multiple layers. The first is about building the base Docker image, here we can also use the official one. It is related to the building of specific images that contain our custom NARs - fortunately, using multistage Dockerfiles works in these cases and is the right choice here. The second part of it is about creating the deployment to Kubernetes. Using Helm or Helmfile are extensive, prepared for being used as the base for many deployments, and it's easy to learn about all details.

The fifth challenge focuses on exposing NiFi to the users. In the Kubernetes world, we can expose the application by using the service or ingress that is straightforward when we talk about NiFi with HTTP, but it becomes more complicated with HTTPS. In the Ingress, it is required to have set up the option responsible for making the SSL passthrough. The one problem that occurred in the project was that the certificate from Ingress was treated by NiFi as the try of the user authentication. On the other hand, it would work with the following scenario: HTTPS traffic is routed to the webservice responsible for the SSL termination and encrypts the traffic once again to NiFi.

The sixth challenge is based on monitoring. We mainly use Prometheus in our projects on-premise, and the configuration of the Prometheus exporter is as simple as adding a separate process in the NiFi responsible for pushing the metrics to the PushGateway from which Prometheus can read it. Here we come to the issue of NiFi - there appears to be memory leaks in some processors. It is crucial to start running NiFi in Kubernetes, monitoring usage of its resources and verifying how it handles processing data. One piece of advice: set up the minimum value for JVM at the same level and leave a bigger difference between the JVM memory value and the limit of RAM for the whole pod.

The seventh part is about managing certificates, because there are  multiple ways to achieve this. The simplest solution  is to generate a certificate each time using the NiFI Toolkit and run it as the sidecar for NiFI. Another one is to use the created certificate, created truststore and keystore (store them as secret files as Kubernetes secrets or as recommended, by using a secret manager like Google Cloud Secrets Manager or Hashicorp Vault), then mount it to the NiFi statefulset. If we need to add a certificate to the truststore, we can import it by re-uploading truststore or import it dynamically during each start.

How to deploy NiFi on Kubernetes. 

The Helm chart of the Apache NiFi and NiFi Registry is available here.

NiFi Registry on Kubernetes - prerequisites and deployment

Apache NiFi Registry was created to become a kind of Git repository for Apache NiFi pipelines. Under the hood, NiFi Registry uses a Git repository and stores all the information on the changes within its own database (by default, it is a H2 database but we can also set up PostgreSQL or MySQL database instead). It is not a perfect solution, there are a lot of difficulties with using it as part of the CICD pipeline but, in all honesty, it seems to be the right solution for the NiFi ecosystem to store all pipelines and their history in one place. The biggest challenge with managing it seems to be storing and updating its keystore and truststore, exactly like with Apache NiFi.

NiFi Registry Blog Big Data Data Analytics
NiFi Registry

Read more about using NiFi Registry in production and how we can make a CICD pipeline for NiFi with its help.

As the Apache NiFi, NiFi Registry is also a stateful application. It requires storage for a locally cloned Git repository and database (if we choose to use default H2). Here I recommend using PostgreSQL or MySQL which is a robust solution in comparison to H2 and we can manage it separately from the NiFi Registry.

NiFi on Kubernetes and Apache Ranger - how should you combine them?

An important aspect of any enterprise-grade data platform is managing permissions to all services. The most basic setup uses managing permissions, users and groups directly in Apache NiFi or Apache NiFi Registry, but thankfully we can use something complex and flexible. Here I mean Apache Ranger. It is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.

Big Data Blog NiFi. Nigi Registry, Kubernetes CI/CD Pipelines

With its extensive features, we can easily set up NiFi and NiFi Registry policies directly in the GUI or via REST API and, moreover, see the audit with information about access denied for any user that will not be accepted due to the Ranger policies. In this case, we would need to configure Ranger Audit in NiFis and set up the connection to Infra Solr and HDFS, used for storing hot and cold data about audit.

The basic requirements are about having installed Apache Ranger 

The first step of connecting Ranger and NiFi is about building the new Dockerfile with Ranger plugin for each NiFi. The best way to achieve this is to create a multistage Dockerfile in which we can build the plugin and then add it to the layer with NiFi itself. You can see the example in the repository.

The second one requires adding new policies in the Ranger which we can test simply during the first run of our secured setup.

The next is to create all files required by Ranger like the configuration of its connection to the Ranger from NiFi, audit settings and updating NiFi authorizers to use the Ranger Plugin. Then we need to start NiFis and verify if it works as expected.

NiFi and Kerberos

Setting up Kerberos within the NiFi is a piece of cake after finishing the deployment process. We need to remember the connection between the Kubernetes cluster and Kerberos service providers like FreeIPA or AD. The most important part is about creating the headless keytab that does not have them within its principal to simplify management of keytabs for NiFis. Moreover, we need to mount krb5.conf file and install Kerberos client packages.

Summary: NiFi entered a new era

Apache NiFi is one of the most popular Big Data applications and has been used for a long time. There have been  multiple releases and it’s a great part of the solution that has become quite mature. We can find many services like NiFi Registry or NiFi Tools within its own ecosystem. Using Kubernetes as the platform for running NiFI simplifies the deployment, management, upgrading and migration processes that are complex with the older setups.

Surely, Kubernetes is not the remedy for all issues with NiFi, but it may be a useful next step to make the NiFi platform a better one.

big data
kubernetes
apache nifi
bigdatatech
DevOps
big data project
16 June 2021

Want more? Check our articles

getindata how start big data project
Use-cases/Project

5 questions you need to answer before starting a big data project

For project managers, development teams and whole organizations, making the first step into the Big Data world might be a big challenge. In most cases…

Read more
4 pragmatic enablers data driven decision making notext
Use-cases/Project

4 pragmatic enablers of data-driven decision making

You could talk about what makes companies data-driven for hours. Fortunately, as a single picture is worth a thousand words, we can also use an…

Read more
big data blog getindata data enrichment flink sql http connector
Tutorial

Data Enrichment in Flink SQL using HTTP Connector For Flink - Part One

HTTP Connector For Flink SQL  In our projects at GetInData, we work a lot on scaling out our client's data engineering capabilities by enabling more…

Read more
bloggcpobszar roboczy 1 4
Tutorial

Data isolation in tenant architecture on the Google Cloud Platform (GCP)

Multi-tenant architecture, also known as multi-tenancy, is a software architecture in which a single instance of software runs on a server and serves…

Read more
propozycja2
Tutorial

Deploying efficient Kedro pipelines on GCP Composer / Airflow with node grouping & MLflow

Airflow is a commonly used orchestrator that helps you schedule, run and monitor all kinds of workflows. Thanks to Python, it offers lots of freedom…

Read more
modern data stack gcp workflowsobszar roboczy 1 4
Tutorial

GCP Workflows - how can you integrate a lightweight managed orchestrator in Modern Data Stack? GID Modern Data Platform

Modern Data Stack has been around for some time already. Both tools and integration patterns have become more mature and battle tested. We shared our…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy