Tutorial
7 min read

NiFi Ingestion Blog Series. PART III - No coding, just drag and drop what you need, but if it’s not there… - custom processors, scripts, external services

Apache NiFI, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data pipelines and free them from those dirty, text-based methods of implementation. Unfortunately, we live in a world of trade-offs, and those features come with a price. The purpose of our blog series is to present our experience and lessons learned when working with production NiFI pipelines. This will be organised into the following articles:

Apache NiFi - why do data engineers love it and hate it at the same time?

Part I - Fast development, painful maintenance

Part II - We have deployed, but at what cost… - CI/CD of NiFi flow

Part III - No coding, just drag and drop what you need, but if its not there… - custom processors, scripts, external services

Part IV - Universe made out of flow files - NiFi architecture

Part V - It’s fast and easy, what could possibly go wrong - one year history of certain NiFi flow

I have only one rule and that’s … - recommendations for using Apache NiFi


In this post we focus on creating data flows with ready-to-go processors, the limitations of such an approach and the solutions we have applied. Previous posts you can read on our blog.

What should you do when you reach the limits?

We are genuine fans of Lego which is typical for many engineers ;-) Lego provides different brick series for different age categories and this brings different capabilities to what can be built. One can create almost anything with Lego Technic, but it takes some time to make something in a more complex way, even if you are a grown up. On the other hand, with Lego Duplo, one can create high buildings really fast at the age of three. The only issue is when one wants to add more details because huge Duplo cuboids prevent from creating custom things. Fortunately one can mix different Lego series and for instance use Lego Classic on the top of Duplo.

getindata nifi blog post

If data flows were built out of Lego, then NiFi definitely stands for Duplo. You can build great things really fast and there are some options to create custom logic when no out-of-the-box NiFi processor is present.

It’s fast, it’s easy and it’s free - custom Groovy scripts

The really nice extension can be done with the ExecuteGroovyScript processor that allows writing scripts in Groovy. It’s well integrated with NiFi and just a couple lines of Groovy code can solve really complex problems. The processor allows you to put a script body into a text attribute and you are done. The only disadvantage is the manual testing within NiFi each time the script is modified. As the approach solves many issues quickly, it can get really popular within NiFi flow. At some point, you realize that flow contains dozens of inline Groovy scripts that share some common logic elements.

This can be solved with a Groovy project that contains classes instead of scripts, that are fully covered with unit tests and packaged into a jar file. Once a jar file is deployed on all NiFi nodes, it is included in the classpath of ExecuteGroovyScript processors. Class methods from within the jar file are used instead of writing the inline scripts. The serious disadvantage we encountered was that, when uploading new jars onto NiFi nodes, it was required to manually reload the classpath for each processor to get the new version loaded. Another was that Groovy code can read flowfile attributes and both systems got tightly coupled. In other words, if you want to change an attribute in NiFi, not only all the NiFi processors need to be checked you also need to make sure that it won't break the Groovy code stored in another repository. That’s how the monolithiest of monoliths get built.

We need to go deeper - custom processors?

While scripts provide a great interface to extend functionalities of Nifi, they have some limitations, both from usability and maintenance perspectives. To maximize customization, we can create our own processors, with a few notable advantages compared to scripts. The first is better abstraction; in the case of processors, the user can look into build-in documentation, check help messages next to property name etc. However, in the case of script, looking into the script code is almost always necessary. We can also define as many output connections as we want instead of just success and failure. In addition, because every custom processor is just a maven project, it can make use of all traditional programming features, versioning with VCSs like Git, using test frameworks and creating CI/CD pipelines. What's more, NiFi provides an interface for adding new components in a plugin-like manner, so no need to recompile anything. Since version 1.8 there is even the option of dynamically adding new components during runtime and switching between versions of components is available from the level of WebUI. Unfortunately, NiFi will ignore components with the same version as ones previously loaded, so it's impossible to dynamically replace the jar with an already existing version.

All those mechanisms are great for programmers but since NiFi is a tool designed for people who do not necessarily like to code, the additional complexity in creating components is a major downside compared to scripts. Everything has to be done in accordance with the NiFi framework,

- just the necessity of using Maven or some similar system is a major complication, especially if a task executed by a component is fairly simple. Another disadvantage is that you need to have to access the NiFi cluster via ssh or configured CI/CD to put a custom processor into Nifi, which might be a problem security-wise. The same as with scripts, it’s just adding additional parts to one monolithic system.

Offloading business logic from NiFi

No one plans the building of monolith monsters. It is just tiny bricks of tightly coupled things added one by one each day. The best way to avoid tight coupling is using state-of-the-art engineering methods such as…. microservices. Microservices ensure the encapsulation of business logic into elegant and tiny components which have a clear definition of the API they expose. This is something that really worked in our projects. Whenever some complex logic is required, instead of dozens of untestable NiFi processors, it is really worth creating a REST service endpoint. We favour that approach most of all because NiFi can easily send HTTP/HTTPS requests and handle JSON responses. There are plenty of mature frameworks for writing rest services in a language of your preference. The lack of unit tests in NiFi is a serious limitation. When You build complex things and have unit tests, you can easily refactor your code and continuously make it better each day. Without them, making improvements is risky and is often avoided, thus the code base or NiFi flow gets difficult to maintain. Moreover, microservices can be used by other systems to communicate with NiFi.

The approach with microservices works well unless big amounts of data is sent through the network. In other words, it suits the scenarios where complex logic can be kept separate to data volumes at scale. In other cases, Apache Spark jobs can be triggered from NiFi.

Conclusion

What we loved? NiFi is like Lego Duplo and it's great that it can be extended with other Lego bricks like groovy scripts, custom processors or offloading logic to microservices. Each of the approaches has its pros and cons. It's always good when you have multiple options and pick the one that serves your needs best.

What we hated? When working with real life business logic, we prefer using Apache Spark for bigger data volumes or rest services with smaller amounts of data. In other words, for custom logic we prefer avoiding NiFi.

big data
technology
apache nifi
getindata
2 October 2020

Want more? Check our articles

Big Data Event

2³ Reasons To Speak at Big Data Tech Warsaw 2020 (February 27th, 2020)

Big Data Technology Warsaw Summit 2020 is fast approaching. This will be 6th edition of the conference that is jointly organised by Evention and…

Read more

5 reasons to follow us on Linkedin. Celebrating 1,000 followers on our profile!

We are excited to announce that we recently hit the 1,000+ followers on our profile on Linkedin. We would like to send a special THANK YOU :) to…

Read more
Use-cases/Project

Anomaly detection implemented in podcasting company

Being a Data Engineer is not only about moving the data but also about extracting value from it. Read an article on how we implemented anomalies…

Read more
Tutorial

Apache NiFi - why do data engineers love it and hate it at the same time? Blog Series Introduction

Learning new technologies is like falling in love. At the beginning, you enjoy it totally and it is like wearing pink glasses that prevent you from…

Read more
Tutorial

Avoiding the mess in the Hadoop Cluster

This blog is based on the talk “Simplified Data Management and Process Scheduling in Hadoop” that we gave at the Big Data Technical Conference in…

Read more
Big Data Event

Big Data Tech Warsaw Summit 2019 summary

It’s been already more than a month after Big Data Tech Warsaw Summit 2019, but it’s spirit is still among us — that’s why we’ve decided to prolong it…

Read more

Contact us

Fill out this simple form. Our team will contact you promptly to discuss the next steps.

hello@getindata.comFist bump illustration

Any questions?

Choose one
By submitting this form, you agree to our  Terms & Conditions