How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 3
Please dive in the third part of a blog series based on a project delivered for one of our clients. Please click part I, part II to read the…
Read moreApache NiFi, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data pipelines and free them from those dirty, text based methods of implementation. Unfortunately, we live in a world of trade-offs, features come with a price. The purpose of our blog series is to present our experience and lessons learned when working with production NiFi pipelines. This will be organised into the following articles:
Apache Nifi - why do data engineers love it and hate it at the same time?
Part I - Fast development, painful maintenance
Part II - We have deployed, but at what cost… - CI/CD of NiFi flow
Part IV - Universe made out of flow files - NiFi architecture
Part V - It’s fast and easy, what could possibly go wrong - one year history of certain NiFi flow
I have only one rule and that’s … - recommendations for using Apache NiFi
This post presents our approach to CI/CD of NiFi flows. Previous posts you can read on our blog.
NiFi is known for not forgiving any mistakes. The lack of builtin functionality to undo changes prompted the developers to create an external versioning system, which would enable rollbacks to the latest working version. This is how NiFi Registry was born. At first glance one can find it really useful as it allows easy versioning through each process group in NiFi. Commits are done directly from the process group’s web canvas which aids simplicity of usage.
One may think that it is just like Git for NiFi, but NiFi Registry lacks a lot of features of modern VCS. There are a few reasons why this is the case. Firstly, storing flow is more complicated than storing code. Components of the flow often refer to other components inside and outside of versioned process groups. They also contain internal metadata which cannot be moved to a repository. Secondly, this additional complexity makes it difficult to use the matured solutions implemented in other VCS. The most noticeable are features related to branches like merging or rebasing. If you collaborate with someone in the development of one flow, the most comfortable setting would be creating two versions of the flow, each collaborator then makes their changes, subsequently joining partial results into a complete one. This is unfortunately not possible - one of you will have to redo their changes manually.
The limitations mentioned above are still something that can be worked with, as long as you synchronize within your team. That aside, VCS are also used to improve migrations between environments. At some point, flow will move from the development environment to test and production. Just as for regular applications, automated migration is preferred as it saves time and is less error-prone. There are two possible approaches to achieve this: one registry for all environments or for each DEV/TEST/PROD. The first option is probably easier but raises a lot of security issues, the most obvious being that the production system can be modified from development without restrictions. Because of these potential security concerns, we decided to go with the second one. That means we needed a way to persist and import everything we want to migrate. We also needed some way of doing that automatically. Luckily, both NiFi and NiFI Registry provide REST API for all operations that you can do through a web interface. Unfortunately, this is just low level REST API. If we want to use it for automation, we need to handle connecting to components, mapping all the entities to objects and so on.
To address the issue above, the Apache community has created NiFi Toolkit- command line utilities that make use of the REST APIs, for example: importing and exporting flow, starting/stopping process groups and cluster nodes. Although NiFi toolkit provides plenty of useful features, it still doesn't support migration from one NiFi Registry to another. One will definitely need to write some code to manage this.
In the code of NiFi Registry, we can find POJOs for all the entities used by REST API and some helper methods, which make it easy to handle communication with our components. Toolkit is a really handy wrapper but that is all it is, any logic besides communication must be implemented by ourselves. For example, there is no command to import/export all the flows within a bucket. The only available option is to import/export a single flow.
When creating deployment scripts, we decided to extend NiFi Toolkit and create a Java application that makes use of the existing POJOs and methods. At some point it turned out that even though all the POJOs were available, only a handful of helper methods were implemented. You want to get status of your process group? No problem. You want to get the status of all the processors inside that process group? Yeah… no. You have to implement that yourself and if that wasn’t bad enough - if you want to integrate with the NiFi Toolkit code without changing the source, you will need to do some rather ugly hacks.
Whilst creating automated deployment for flows, we had to solve a lot of additional issues, mainly caused by missing functionalities in NiFi Toolkit. Yes, it’s technically doable to implement NiFi and NiFi Registry API calls into a deployment process, except for a dozen corner cases that need to be supported:
Objects identification - NiFi registry provides its own identification of objects, so it is not dependant on NiFi UUIDs or processor names. Nonetheless, ID generation causes some issues during migrations.
Scopes - NiFi Registry stores everything about your flow, but it has to be in the versioned process group. That can be an issue for variables and controller services, which often have global scope.
Controlling state of the flow - Components in NiFi have a state which depends on the status for processors and number of queued flowfiles for the queues. To allow modification, we need to stop the processor or empty the queue. This can be tricky without using web ui.
Please be aware that what may seem a tiny scripting project can turn into a complex implementation.
What we loved: NiFi Registry is a great tool for development and with NiFi Toolkit it provides a great aid in migrations between environments.
What we hated: Technology is still immature and require a lot of additional work to make actual migration process automated and reliable.
This has been a really tedious journey and it is worth mentioning what we were trying to achieve. So we tried to deploy a system (NiFi) that allows creating extensive data pipelines without writing a single line of code. We ended up having to write a lot of code to be able to deploy those pipelines.
Please dive in the third part of a blog series based on a project delivered for one of our clients. Please click part I, part II to read the…
Read moreMachine learning is becoming increasingly popular in many industries, from finance to marketing to healthcare. But let's face it, that doesn't mean ML…
Read moreIn this episode of the RadioData Podcast, Adama Kawa talks with Alessandro Romano about FREE NOW use cases: data, techniques, signals and the KPIs…
Read moreHappy New Year 2021! Exactly a year ago nobody could expect how bad for our health, society, and economy the year 2020 will be. COVID-19 infected all…
Read more2020 was a very tough year for everyone. It was a year full of emotions, constant adoption and transformation - both in our private and professional…
Read moreIn this episode of the RadioData Podcast, Adam Kawa talks with Michał Wróbel about business use cases at RenoFi (a U.S.-based FinTech), the Modern…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?