8 min read

NiFi Ingestion Blog Series. PART III - No coding, just drag and drop what you need, but if it’s not there… - custom processors, scripts, external services

Apache NiFI, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data pipelines and free them from those dirty, text-based methods of implementation. Unfortunately, we live in a world of trade-offs, and those features come with a price. The purpose of our blog series is to present our experience and lessons learned when working with production NiFI pipelines. This will be organised into the following articles:

In this post we focus on creating data flows with ready-to-go processors, the limitations of such an approach and the solutions we have applied. Previous posts you can read on our blog.

What should you do when you reach the limits?

We are genuine fans of Lego which is typical for many engineers ;-) Lego provides different brick series for different age categories and this brings different capabilities to what can be built. One can create almost anything with Lego Technic, but it takes some time to make something in a more complex way, even if you are a grown up. On the other hand, with Lego Duplo, one can create high buildings really fast at the age of three. The only issue is when one wants to add more details because huge Duplo cuboids prevent from creating custom things. Fortunately one can mix different Lego series and for instance use Lego Classic on the top of Duplo.

getindata nifi blog post

If data flows were built out of Lego, then NiFi definitely stands for Duplo. You can build great things really fast and there are some options to create custom logic when no out-of-the-box NiFi processor is present.

It’s fast, it’s easy and it’s free - custom Groovy scripts

The really nice extension can be done with the ExecuteGroovyScript processor that allows writing scripts in Groovy. It’s well integrated with NiFi and just a couple lines of Groovy code can solve really complex problems. The processor allows you to put a script body into a text attribute and you are done. The only disadvantage is the manual testing within NiFi each time the script is modified. As the approach solves many issues quickly, it can get really popular within NiFi flow. At some point, you realize that flow contains dozens of inline Groovy scripts that share some common logic elements.

This can be solved with a Groovy project that contains classes instead of scripts, that are fully covered with unit tests and packaged into a jar file. Once a jar file is deployed on all NiFi nodes, it is included in the classpath of ExecuteGroovyScript processors. Class methods from within the jar file are used instead of writing the inline scripts. The serious disadvantage we encountered was that, when uploading new jars onto NiFi nodes, it was required to manually reload the classpath for each processor to get the new version loaded. Another was that Groovy code can read flowfile attributes and both systems got tightly coupled. In other words, if you want to change an attribute in NiFi, not only all the NiFi processors need to be checked you also need to make sure that it won't break the Groovy code stored in another repository. That’s how the monolithiest of monoliths get built.

We need to go deeper - custom processors?

While scripts provide a great interface to extend functionalities of Nifi, they have some limitations, both from usability and maintenance perspectives. To maximize customization, we can create our own processors, with a few notable advantages compared to scripts. The first is better abstraction; in the case of processors, the user can look into build-in documentation, check help messages next to property name etc. However, in the case of script, looking into the script code is almost always necessary. We can also define as many output connections as we want instead of just success and failure. In addition, because every custom processor is just a maven project, it can make use of all traditional programming features, versioning with VCSs like Git, using test frameworks and creating CI/CD pipelines. What's more, NiFi provides an interface for adding new components in a plugin-like manner, so no need to recompile anything. Since version 1.8 there is even the option of dynamically adding new components during runtime and switching between versions of components is available from the level of WebUI. Unfortunately, NiFi will ignore components with the same version as ones previously loaded, so it's impossible to dynamically replace the jar with an already existing version.

All those mechanisms are great for programmers but since NiFi is a tool designed for people who do not necessarily like to code, the additional complexity in creating components is a major downside compared to scripts. Everything has to be done in accordance with the NiFi framework,

- just the necessity of using Maven or some similar system is a major complication, especially if a task executed by a component is fairly simple. Another disadvantage is that you need to have to access the NiFi cluster via ssh or configured CI/CD to put a custom processor into Nifi, which might be a problem security-wise. The same as with scripts, it’s just adding additional parts to one monolithic system.

Offloading business logic from NiFi

No one plans the building of monolith monsters. It is just tiny bricks of tightly coupled things added one by one each day. The best way to avoid tight coupling is using state-of-the-art engineering methods such as…. microservices. Microservices ensure the encapsulation of business logic into elegant and tiny components which have a clear definition of the API they expose. This is something that really worked in our projects. Whenever some complex logic is required, instead of dozens of untestable NiFi processors, it is really worth creating a REST service endpoint. We favour that approach most of all because NiFi can easily send HTTP/HTTPS requests and handle JSON responses. There are plenty of mature frameworks for writing rest services in a language of your preference. The lack of unit tests in NiFi is a serious limitation. When You build complex things and have unit tests, you can easily refactor your code and continuously make it better each day. Without them, making improvements is risky and is often avoided, thus the code base or NiFi flow gets difficult to maintain. Moreover, microservices can be used by other systems to communicate with NiFi.

The approach with microservices works well unless big amounts of data is sent through the network. In other words, it suits the scenarios where complex logic can be kept separate to data volumes at scale. In other cases, Apache Spark jobs can be triggered from NiFi.


What we loved? NiFi is like Lego Duplo and it's great that it can be extended with other Lego bricks like groovy scripts, custom processors or offloading logic to microservices. Each of the approaches has its pros and cons. It's always good when you have multiple options and pick the one that serves your needs best.

What we hated? When working with real life business logic, we prefer using Apache Spark for bigger data volumes or rest services with smaller amounts of data. In other words, for custom logic we prefer avoiding NiFi.

big data
apache nifi
2 October 2020

Want more? Check our articles

getindata monitoring alert data streaming platfrorm

How to build continuous processing for real-time data streaming platform?

Real-time data streaming platforms are tough to create and to maintain. This difficulty is caused by a huge amount of data that we have to process as…

Read more
power of bigdata

Power of Big Data: Marketing

In the "Power of Big Data" series, I will talk about the possibilities that Big Data solutions give to individual business sectors. It should be noted…

Read more
complex event processing apache flink

My experience with Apache Flink for Complex Event Processing

My goal is to create a comprehensive review of available options when dealing with Complex Event Processing using Apache Flink. We will be building a…

Read more
1 RsDrT5xOpdAcpehomqlOPg
Big Data Event

2³ Reasons To Speak at Big Data Tech Warsaw 2020 (February 27th, 2020)

Big Data Technology Warsaw Summit 2020 is fast approaching. This will be 6th edition of the conference that is jointly organised by Evention and…

Read more
covid 19 pandemia

Fighting COVID-19 with Google Cloud - quarantine tracking system

Coronavirus is spreading through the world. At the moment of writing this post (on the 26th of March 2020) over 475k people have been infected and…

Read more

How to build Digital Marketing Platform making the best out of Google Cloud

Nowadays digital marketing is a competitive business and it’s easy to tell that we are way past the point when a catchy slogan or shiny banner would…

Read more

Contact us

Fill out this simple form. Our team will contact you promptly to discuss the next steps.

hello@getindata.comFist bump illustration

Any questions?

Choose one
By submitting this form, you agree to our  Terms & Conditions