Tutorial
9 min read

NiFi Ingestion Blog Series. PART II - We have deployed, but at what cost… - CI/CD of NiFi flow

Apache NiFi, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data pipelines and free them from those dirty, text based methods of implementation. Unfortunately, we live in a world of trade-offs, features come with a price. The purpose of our blog series is to present our experience and lessons learned when working with production NiFi pipelines. This will be organised into the following articles:

Apache Nifi - why do data engineers love it and hate it at the same time?

Part I - Fast development, painful maintenance

Part II - We have deployed, but at what cost… - CI/CD of NiFi flow

Part III - No coding, just drag and drop what you need, but if its not there… - custom processors, scripts, external services

Part IV - Universe made out of flow files - NiFi architecture

Part V - It’s fast and easy, what could possibly go wrong - one year history of certain NiFi flow

I have only one rule and that’s … - recommendations for using Apache NiFi


This post presents our approach to CI/CD of NiFi flows. Previous posts you can read on our blog.

NiFi Registry - the repository for the NiFi flows

NiFi is known for not forgiving any mistakes. The lack of builtin functionality to undo changes prompted the developers to create an external versioning system, which would enable rollbacks to the latest working version. This is how NiFi Registry was born. At first glance one can find it really useful as it allows easy versioning through each process group in NiFi. Commits are done directly from the process group’s web canvas which aids simplicity of usage.

One may think that it is just like Git for NiFi, but NiFi Registry lacks a lot of features of modern VCS. There are a few reasons why this is the case. Firstly, storing flow is more complicated than storing code. Components of the flow often refer to other components inside and outside of versioned process groups. They also contain internal metadata which cannot be moved to a repository. Secondly, this additional complexity makes it difficult to use the matured solutions implemented in other VCS. The most noticeable are features related to branches like merging or rebasing. If you collaborate with someone in the development of one flow, the most comfortable setting would be creating two versions of the flow, each collaborator then makes their changes, subsequently joining partial results into a complete one. This is unfortunately not possible - one of you will have to redo their changes manually.

Environment separation nightmare intro

The limitations mentioned above are still something that can be worked with, as long as you synchronize within your team. That aside, VCS are also used to improve migrations between environments. At some point, flow will move from the development environment to test and production. Just as for regular applications, automated migration is preferred as it saves time and is less error-prone. There are two possible approaches to achieve this: one registry for all environments or for each DEV/TEST/PROD. The first option is probably easier but raises a lot of security issues, the most obvious being that the production system can be modified from development without restrictions. Because of these potential security concerns, we decided to go with the second one. That means we needed a way to persist and import everything we want to migrate. We also needed some way of doing that automatically. Luckily, both NiFi and NiFI Registry provide REST API for all operations that you can do through a web interface. Unfortunately, this is just low level REST API. If we want to use it for automation, we need to handle connecting to components, mapping all the entities to objects and so on.

NiFi Toolkit to the rescue! … well kind of…

To address the issue above, the Apache community has created NiFi Toolkit- command line utilities that make use of the REST APIs, for example: importing and exporting flow, starting/stopping process groups and cluster nodes. Although NiFi toolkit provides plenty of useful features, it still doesn't support migration from one NiFi Registry to another. One will definitely need to write some code to manage this.

In the code of NiFi Registry, we can find POJOs for all the entities used by REST API and some helper methods, which make it easy to handle communication with our components. Toolkit is a really handy wrapper but that is all it is, any logic besides communication must be implemented by ourselves. For example, there is no command to import/export all the flows within a bucket. The only available option is to import/export a single flow.

When creating deployment scripts, we decided to extend NiFi Toolkit and create a Java application that makes use of the existing POJOs and methods. At some point it turned out that even though all the POJOs were available, only a handful of helper methods were implemented. You want to get status of your process group? No problem. You want to get the status of all the processors inside that process group? Yeah… no. You have to implement that yourself and if that wasn’t bad enough - if you want to integrate with the NiFi Toolkit code without changing the source, you will need to do some rather ugly hacks.

Additional hurdles in details

Whilst creating automated deployment for flows, we had to solve a lot of additional issues, mainly caused by missing functionalities in NiFi Toolkit. Yes, it’s technically doable to implement NiFi and NiFi Registry API calls into a deployment process, except for a dozen corner cases that need to be supported:

  • Objects identification - NiFi registry provides its own identification of objects, so it is not dependant on NiFi UUIDs or processor names. Nonetheless, ID generation causes some issues during migrations.

    • Process groups dependencies - If any versioned process group uses any other versioned process group, then it contains reference to that process group. This creates a relation between process groups in the NiFi Registry. In consequence, when you want to import something to our flow, you need to do it in the correct order, so no group is imported before its dependencies.
    • UUID mapping - When the process group is imported, a new identifier is generated. This creates an issue when one process group depends on another, because when dependency is imported, it gets a new UUID, so the group that uses it needs to have the reference updated.
    • Registry URL in references - The path to a dependency in NiFi Registry contains the URL of the registry as one of the properties. When moving to a different environment, you have to update the value to the URL of the registry you will be using.
    • Registry visibility- This issue could occur during testing e.g. on Docker, the URL visible for the user is different to the one under which NiFi sees NiFi Registry, resulting in hard-to-track errors.
  • Scopes - NiFi Registry stores everything about your flow, but it has to be in the versioned process group. That can be an issue for variables and controller services, which often have global scope.

    • Controller Service Migration - NiFi Registry doesn’t keep controller services from the outside process group. We have to create an additional mechanism dedicated to importing and exporting them.
    • Mapping Controller Services - If a processor is referring to a controller service outside its process group, NiFi Registry will keep reference but will not preserve the controller service. As a result, after importing the processor will have an invalid reference. We have to create a mechanism to map those references after importing the flow.
    • Controller Service dependencies - Some controller services are used not only by processors but also by other controller services. During migration, all components get new UUIDs so it’s necessary to update all references.
  • Controlling state of the flow - Components in NiFi have a state which depends on the status for processors and number of queued flowfiles for the queues. To allow modification, we need to stop the processor or empty the queue. This can be tricky without using web ui.

    • Graceful stop mechanism - To avoid conflicts while deploying the new version, it is best to ensure that there are no flowfiles in the flow queues. To do so, it is not enough to just change the status of all components to be stopped. You have to implement a mechanism that waits for all flowfiles to get processed and then stops everything. Particular steps will depend on the structure of the flow.

Please be aware that what may seem a tiny scripting project can turn into a complex implementation.

Summary

What we loved: NiFi Registry is a great tool for development and with NiFi Toolkit it provides a great aid in migrations between environments.

What we hated: Technology is still immature and require a lot of additional work to make actual migration process automated and reliable.

This has been a really tedious journey and it is worth mentioning what we were trying to achieve. So we tried to deploy a system (NiFi) that allows creating extensive data pipelines without writing a single line of code. We ended up having to write a lot of code to be able to deploy those pipelines.

big data
technology
apache nifi
getindata
15 September 2020

Want more? Check our articles

Big Data Event

2³ Reasons To Speak at Big Data Tech Warsaw 2020 (February 27th, 2020)

Big Data Technology Warsaw Summit 2020 is fast approaching. This will be 6th edition of the conference that is jointly organised by Evention and…

Read more

5 reasons to follow us on Linkedin. Celebrating 1,000 followers on our profile!

We are excited to announce that we recently hit the 1,000+ followers on our profile on Linkedin. We would like to send a special THANK YOU :) to…

Read more
Use-cases/Project

Anomaly detection implemented in podcasting company

Being a Data Engineer is not only about moving the data but also about extracting value from it. Read an article on how we implemented anomalies…

Read more
Tutorial

Apache NiFi - why do data engineers love it and hate it at the same time? Blog Series Introduction

Learning new technologies is like falling in love. At the beginning, you enjoy it totally and it is like wearing pink glasses that prevent you from…

Read more
Tutorial

Avoiding the mess in the Hadoop Cluster

This blog is based on the talk “Simplified Data Management and Process Scheduling in Hadoop” that we gave at the Big Data Technical Conference in…

Read more
Big Data Event

Big Data Tech Warsaw Summit 2019 summary

It’s been already more than a month after Big Data Tech Warsaw Summit 2019, but it’s spirit is still among us — that’s why we’ve decided to prolong it…

Read more

Contact us

Fill out this simple form. Our team will contact you promptly to discuss the next steps.

hello@getindata.comFist bump illustration

Any questions?

Choose one
By submitting this form, you agree to our  Terms & Conditions