Tutorial

9 min read

NiFi Ingestion Blog Series. PART II - We have deployed, but at what cost… - CI/CD of NiFi flow

Apache NiFi, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data pipelines and free them from those dirty, text based methods of implementation. Unfortunately, we live in a world of trade-offs, features come with a price. The purpose of our blog series is to present our experience and lessons learned when working with production NiFi pipelines. This will be organised into the following articles:

Apache Nifi - why do data engineers love it and hate it at the same time?

Part I - Fast development, painful maintenance

Part II - We have deployed, but at what cost… - CI/CD of NiFi flow

Part III - No coding, just drag and drop what you need, but if its not there… - custom processors, scripts, external services

Part IV - Universe made out of flow files - NiFi architecture

Part V - It’s fast and easy, what could possibly go wrong - one year history of certain NiFi flow

I have only one rule and that’s … - recommendations for using Apache NiFi

This post presents our approach to CI/CD of NiFi flows. Previous posts you can read on our blog.

NiFi Registry - the repository for the NiFi flows

NiFi is known for not forgiving any mistakes. The lack of builtin functionality to undo changes prompted the developers to create an external versioning system, which would enable rollbacks to the latest working version. This is how NiFi Registry was born. At first glance one can find it really useful as it allows easy versioning through each process group in NiFi. Commits are done directly from the process group’s web canvas which aids simplicity of usage.

One may think that it is just like Git for NiFi, but NiFi Registry lacks a lot of features of modern VCS. There are a few reasons why this is the case. Firstly, storing flow is more complicated than storing code. Components of the flow often refer to other components inside and outside of versioned process groups. They also contain internal metadata which cannot be moved to a repository. Secondly, this additional complexity makes it difficult to use the matured solutions implemented in other VCS. The most noticeable are features related to branches like merging or rebasing. If you collaborate with someone in the development of one flow, the most comfortable setting would be creating two versions of the flow, each collaborator then makes their changes, subsequently joining partial results into a complete one. This is unfortunately not possible - one of you will have to redo their changes manually.

Environment separation nightmare intro

The limitations mentioned above are still something that can be worked with, as long as you synchronize within your team. That aside, VCS are also used to improve migrations between environments. At some point, flow will move from the development environment to test and production. Just as for regular applications, automated migration is preferred as it saves time and is less error-prone. There are two possible approaches to achieve this: one registry for all environments or for each DEV/TEST/PROD. The first option is probably easier but raises a lot of security issues, the most obvious being that the production system can be modified from development without restrictions. Because of these potential security concerns, we decided to go with the second one. That means we needed a way to persist and import everything we want to migrate. We also needed some way of doing that automatically. Luckily, both NiFi and NiFI Registry provide REST API for all operations that you can do through a web interface. Unfortunately, this is just low level REST API. If we want to use it for automation, we need to handle connecting to components, mapping all the entities to objects and so on.

NiFi Toolkit to the rescue! … well kind of…

To address the issue above, the Apache community has created NiFi Toolkit- command line utilities that make use of the REST APIs, for example: importing and exporting flow, starting/stopping process groups and cluster nodes. Although NiFi toolkit provides plenty of useful features, it still doesn't support migration from one NiFi Registry to another. One will definitely need to write some code to manage this.

In the code of NiFi Registry, we can find POJOs for all the entities used by REST API and some helper methods, which make it easy to handle communication with our components. Toolkit is a really handy wrapper but that is all it is, any logic besides communication must be implemented by ourselves. For example, there is no command to import/export all the flows within a bucket. The only available option is to import/export a single flow.

When creating deployment scripts, we decided to extend NiFi Toolkit and create a Java application that makes use of the existing POJOs and methods. At some point it turned out that even though all the POJOs were available, only a handful of helper methods were implemented. You want to get status of your process group? No problem. You want to get the status of all the processors inside that process group? Yeah… no. You have to implement that yourself and if that wasn’t bad enough - if you want to integrate with the NiFi Toolkit code without changing the source, you will need to do some rather ugly hacks.

Additional hurdles in details

Whilst creating automated deployment for flows, we had to solve a lot of additional issues, mainly caused by missing functionalities in NiFi Toolkit. Yes, it’s technically doable to implement NiFi and NiFi Registry API calls into a deployment process, except for a dozen corner cases that need to be supported:

Objects identification - NiFi registry provides its own identification of objects, so it is not dependant on NiFi UUIDs or processor names. Nonetheless, ID generation causes some issues during migrations.
- Process groups dependencies - If any versioned process group uses any other versioned process group, then it contains reference to that process group. This creates a relation between process groups in the NiFi Registry. In consequence, when you want to import something to our flow, you need to do it in the correct order, so no group is imported before its dependencies.
- UUID mapping - When the process group is imported, a new identifier is generated. This creates an issue when one process group depends on another, because when dependency is imported, it gets a new UUID, so the group that uses it needs to have the reference updated.
- Registry URL in references - The path to a dependency in NiFi Registry contains the URL of the registry as one of the properties. When moving to a different environment, you have to update the value to the URL of the registry you will be using.
- Registry visibility- This issue could occur during testing e.g. on Docker, the URL visible for the user is different to the one under which NiFi sees NiFi Registry, resulting in hard-to-track errors.
Scopes - NiFi Registry stores everything about your flow, but it has to be in the versioned process group. That can be an issue for variables and controller services, which often have global scope.
- Controller Service Migration - NiFi Registry doesn’t keep controller services from the outside process group. We have to create an additional mechanism dedicated to importing and exporting them.
- Mapping Controller Services - If a processor is referring to a controller service outside its process group, NiFi Registry will keep reference but will not preserve the controller service. As a result, after importing the processor will have an invalid reference. We have to create a mechanism to map those references after importing the flow.
- Controller Service dependencies - Some controller services are used not only by processors but also by other controller services. During migration, all components get new UUIDs so it’s necessary to update all references.
Controlling state of the flow - Components in NiFi have a state which depends on the status for processors and number of queued flowfiles for the queues. To allow modification, we need to stop the processor or empty the queue. This can be tricky without using web ui.
- Graceful stop mechanism - To avoid conflicts while deploying the new version, it is best to ensure that there are no flowfiles in the flow queues. To do so, it is not enough to just change the status of all components to be stopped. You have to implement a mechanism that waits for all flowfiles to get processed and then stops everything. Particular steps will depend on the structure of the flow.

Please be aware that what may seem a tiny scripting project can turn into a complex implementation.

Summary

What we loved: NiFi Registry is a great tool for development and with NiFi Toolkit it provides a great aid in migrations between environments.

What we hated: Technology is still immature and require a lot of additional work to make actual migration process automated and reliable.

This has been a really tedious journey and it is worth mentioning what we were trying to achieve. So we tried to deploy a system (NiFi) that allows creating extensive data pipelines without writing a single line of code. We ended up having to write a lot of code to be able to deploy those pipelines.

big data

technology

apache nifi

getindata

Last updated: 15 September 2020

Written by

Tomasz Nazarewicz

Data Engineer

Paweł Leszczyński

Data Engineer

Like this post?
Spread the word

Want more? Check our articles

Big Data Event

Overview of InfoShare 2024 - Part 2: Data Quality, LLMs and Data Copilot

Welcome back to our comprehensive coverage of InfoShare 2024! If you missed our first part, click here to catch up on demystifying AI buzzwords and…

Radio DaTa Podcast

Data & analytics at Acast, AI & trends in the podcasting industry

In this episode of the RadioData Podcast, Adama Kawa talks with Jonas Björk from Acast. Mentioned topics include: analytics use cases implemented at…

getindator create an image set in a high tech data operations r cb3ee8f5 f68a 41b0 86c3 12eb597539c0

Tutorial

dbt-flink-adapter - job lifecycle management. Transforming data streaming

It's been a year since the announcement of the dbt-flink-adapter, and the concept of enabling real-time analytics with dbt and Flink SQL is simply…

getindator create a futuristic professional cover graphic for a ccc2673a 08c9 4c0f 9cb7 4bf7e4ec1031

Tutorial

How to predict Subscription Churn: key elements of building a churn model

Despite the era of GenAI hype, classical machine learning is still alive! Personally, I used to use ChatGPT (e.g. for idea generation), however I…

Looking Back at 2024: GetInData’s in Data & AI

Let’s take a moment to look back at 2024 and celebrate everything we’ve achieved. This year has been all about sharing knowledge, creating impactful…

Big Data Event

Truecaller, GetInData and Google’s contribution to Big Data Tech Warsaw Summit

GetInData, Google and Truecaller participate in the Big Data Tech Warsaw Summit 2019. It’s already less than two weeks to the 5th edition of Big Data…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

NiFi Ingestion Blog Series. PART II - We have deployed, but at what cost… - CI/CD of NiFi flow

NiFi Registry - the repository for the NiFi flows

Environment separation nightmare intro

NiFi Toolkit to the rescue! … well kind of…

Additional hurdles in details

Summary

Like this post?Spread the word

Want more? Check our articles

Overview of InfoShare 2024 - Part 2: Data Quality, LLMs and Data Copilot

Data & analytics at Acast, AI & trends in the podcasting industry

dbt-flink-adapter - job lifecycle management. Transforming data streaming

How to predict Subscription Churn: key elements of building a churn model

Looking Back at 2024: GetInData’s in Data & AI

Truecaller, GetInData and Google’s contribution to Big Data Tech Warsaw Summit

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!