Tutorial

8 min read

NiFi Ingestion Blog Series. PART III - No coding, just drag and drop what you need, but if it’s not there… - custom processors, scripts, external services

Apache NiFI, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data pipelines and free them from those dirty, text-based methods of implementation. Unfortunately, we live in a world of trade-offs, and those features come with a price. The purpose of our blog series is to present our experience and lessons learned when working with production NiFI pipelines. This will be organised into the following articles:

Apache NiFi - why do data engineers love it and hate it at the same time?

Part I - Fast development, painful maintenance

Part II - We have deployed, but at what cost… - CI/CD of NiFi flow

Part III - No coding, just drag and drop what you need, but if its not there… - custom processors, scripts, external services

Part IV - Universe made out of flow files - NiFi architecture

Part V - It’s fast and easy, what could possibly go wrong - one year history of certain NiFi flow

I have only one rule and that’s … - recommendations for using Apache NiFi

In this post we focus on creating data flows with ready-to-go processors, the limitations of such an approach and the solutions we have applied. Previous posts you can read on our blog.

What should you do when you reach the limits?

We are genuine fans of Lego which is typical for many engineers ;-) Lego provides different brick series for different age categories and this brings different capabilities to what can be built. One can create almost anything with Lego Technic, but it takes some time to make something in a more complex way, even if you are a grown up. On the other hand, with Lego Duplo, one can create high buildings really fast at the age of three. The only issue is when one wants to add more details because huge Duplo cuboids prevent from creating custom things. Fortunately one can mix different Lego series and for instance use Lego Classic on the top of Duplo.

getindata nifi blog post

If data flows were built out of Lego, then NiFi definitely stands for Duplo. You can build great things really fast and there are some options to create custom logic when no out-of-the-box NiFi processor is present.

It’s fast, it’s easy and it’s free - custom Groovy scripts

The really nice extension can be done with the ExecuteGroovyScript processor that allows writing scripts in Groovy. It’s well integrated with NiFi and just a couple lines of Groovy code can solve really complex problems. The processor allows you to put a script body into a text attribute and you are done. The only disadvantage is the manual testing within NiFi each time the script is modified. As the approach solves many issues quickly, it can get really popular within NiFi flow. At some point, you realize that flow contains dozens of inline Groovy scripts that share some common logic elements.

This can be solved with a Groovy project that contains classes instead of scripts, that are fully covered with unit tests and packaged into a jar file. Once a jar file is deployed on all NiFi nodes, it is included in the classpath of ExecuteGroovyScript processors. Class methods from within the jar file are used instead of writing the inline scripts. The serious disadvantage we encountered was that, when uploading new jars onto NiFi nodes, it was required to manually reload the classpath for each processor to get the new version loaded. Another was that Groovy code can read flowfile attributes and both systems got tightly coupled. In other words, if you want to change an attribute in NiFi, not only all the NiFi processors need to be checked you also need to make sure that it won't break the Groovy code stored in another repository. That’s how the monolithiest of monoliths get built.

We need to go deeper - custom processors?

While scripts provide a great interface to extend functionalities of Nifi, they have some limitations, both from usability and maintenance perspectives. To maximize customization, we can create our own processors, with a few notable advantages compared to scripts. The first is better abstraction; in the case of processors, the user can look into build-in documentation, check help messages next to property name etc. However, in the case of script, looking into the script code is almost always necessary. We can also define as many output connections as we want instead of just success and failure. In addition, because every custom processor is just a maven project, it can make use of all traditional programming features, versioning with VCSs like Git, using test frameworks and creating CI/CD pipelines. What's more, NiFi provides an interface for adding new components in a plugin-like manner, so no need to recompile anything. Since version 1.8 there is even the option of dynamically adding new components during runtime and switching between versions of components is available from the level of WebUI. Unfortunately, NiFi will ignore components with the same version as ones previously loaded, so it's impossible to dynamically replace the jar with an already existing version.

All those mechanisms are great for programmers but since NiFi is a tool designed for people who do not necessarily like to code, the additional complexity in creating components is a major downside compared to scripts. Everything has to be done in accordance with the NiFi framework,

- just the necessity of using Maven or some similar system is a major complication, especially if a task executed by a component is fairly simple. Another disadvantage is that you need to have to access the NiFi cluster via ssh or configured CI/CD to put a custom processor into Nifi, which might be a problem security-wise. The same as with scripts, it’s just adding additional parts to one monolithic system.

Offloading business logic from NiFi

No one plans the building of monolith monsters. It is just tiny bricks of tightly coupled things added one by one each day. The best way to avoid tight coupling is using state-of-the-art engineering methods such as…. microservices. Microservices ensure the encapsulation of business logic into elegant and tiny components which have a clear definition of the API they expose. This is something that really worked in our projects. Whenever some complex logic is required, instead of dozens of untestable NiFi processors, it is really worth creating a REST service endpoint. We favour that approach most of all because NiFi can easily send HTTP/HTTPS requests and handle JSON responses. There are plenty of mature frameworks for writing rest services in a language of your preference. The lack of unit tests in NiFi is a serious limitation. When You build complex things and have unit tests, you can easily refactor your code and continuously make it better each day. Without them, making improvements is risky and is often avoided, thus the code base or NiFi flow gets difficult to maintain. Moreover, microservices can be used by other systems to communicate with NiFi.

The approach with microservices works well unless big amounts of data is sent through the network. In other words, it suits the scenarios where complex logic can be kept separate to data volumes at scale. In other cases, Apache Spark jobs can be triggered from NiFi.

Conclusion

What we loved? NiFi is like Lego Duplo and it's great that it can be extended with other Lego bricks like groovy scripts, custom processors or offloading logic to microservices. Each of the approaches has its pros and cons. It's always good when you have multiple options and pick the one that serves your needs best.

What we hated? When working with real life business logic, we prefer using Apache Spark for bigger data volumes or rest services with smaller amounts of data. In other words, for custom logic we prefer avoiding NiFi.

big data

technology

apache nifi

getindata

Last updated: 2 October 2020

Written by

Tomasz Nazarewicz

Data Engineer

Paweł Leszczyński

Data Engineer

Like this post?
Spread the word

Want more? Check our articles

Tutorial

Data pipeline evolution at Linkedin on a few pictures

Data Pipeline Evolution The LinkedIn Engineering blog is a great resource of technical blog posts related to building and using large-scale data…

Big Data Event

COVID-19 changes Big Data Tech Warsaw 2021 but makes it greater at the same time.

Happy New Year 2021! Exactly a year ago nobody could expect how bad for our health, society, and economy the year 2020 will be. COVID-19 infected all…

Tutorial

Airflow in a multi-teams / multi-tenant environment. Deployment strategies

This article explores using Airflow 2 in environments with multiple teams (tenants) and concludes with a brief overview of out-of-the-box features to…

getindata amundsen feast machine learining notext

Tutorial

Machine Learning Features discovery with Feast and Amundsen

One of the main challenges of today's Machine Learning initiatives is the need for a centralized store of high-quality data that can be reused by Data…

Use-cases/Project

5 questions you need to answer before starting a big data project

For project managers, development teams and whole organizations, making the first step into the Big Data world might be a big challenge. In most cases…

getindator illustration of squirrel holding a trophy and standi 537810f1 a5a2 4f1a a701 18b280cf6acf 720

3 Apache Flink Blogs That Will Revolutionize Your Streaming Game

Streaming analytics is no longer just a buzzword—it’s a must-have for modern businesses dealing with dynamic, real-time data. Apache Flink has emerged…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

NiFi Ingestion Blog Series. PART III - No coding, just drag and drop what you need, but if it’s not there… - custom processors, scripts, external services

What should you do when you reach the limits?

It’s fast, it’s easy and it’s free - custom Groovy scripts

We need to go deeper - custom processors?

Offloading business logic from NiFi

Conclusion

Like this post?Spread the word

Want more? Check our articles

Data pipeline evolution at Linkedin on a few pictures

COVID-19 changes Big Data Tech Warsaw 2021 but makes it greater at the same time.

Airflow in a multi-teams / multi-tenant environment. Deployment strategies

Machine Learning Features discovery with Feast and Amundsen

5 questions you need to answer before starting a big data project

3 Apache Flink Blogs That Will Revolutionize Your Streaming Game

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!