Tutorial
8 min read

NiFi Ingestion Blog Series. PART III - No coding, just drag and drop what you need, but if it’s not there… - custom processors, scripts, external services

Apache NiFI, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data pipelines and free them from those dirty, text-based methods of implementation. Unfortunately, we live in a world of trade-offs, and those features come with a price. The purpose of our blog series is to present our experience and lessons learned when working with production NiFI pipelines. This will be organised into the following articles:


In this post we focus on creating data flows with ready-to-go processors, the limitations of such an approach and the solutions we have applied. Previous posts you can read on our blog.

What should you do when you reach the limits?

We are genuine fans of Lego which is typical for many engineers ;-) Lego provides different brick series for different age categories and this brings different capabilities to what can be built. One can create almost anything with Lego Technic, but it takes some time to make something in a more complex way, even if you are a grown up. On the other hand, with Lego Duplo, one can create high buildings really fast at the age of three. The only issue is when one wants to add more details because huge Duplo cuboids prevent from creating custom things. Fortunately one can mix different Lego series and for instance use Lego Classic on the top of Duplo.

getindata nifi blog post

If data flows were built out of Lego, then NiFi definitely stands for Duplo. You can build great things really fast and there are some options to create custom logic when no out-of-the-box NiFi processor is present.

It’s fast, it’s easy and it’s free - custom Groovy scripts

The really nice extension can be done with the ExecuteGroovyScript processor that allows writing scripts in Groovy. It’s well integrated with NiFi and just a couple lines of Groovy code can solve really complex problems. The processor allows you to put a script body into a text attribute and you are done. The only disadvantage is the manual testing within NiFi each time the script is modified. As the approach solves many issues quickly, it can get really popular within NiFi flow. At some point, you realize that flow contains dozens of inline Groovy scripts that share some common logic elements.

This can be solved with a Groovy project that contains classes instead of scripts, that are fully covered with unit tests and packaged into a jar file. Once a jar file is deployed on all NiFi nodes, it is included in the classpath of ExecuteGroovyScript processors. Class methods from within the jar file are used instead of writing the inline scripts. The serious disadvantage we encountered was that, when uploading new jars onto NiFi nodes, it was required to manually reload the classpath for each processor to get the new version loaded. Another was that Groovy code can read flowfile attributes and both systems got tightly coupled. In other words, if you want to change an attribute in NiFi, not only all the NiFi processors need to be checked you also need to make sure that it won't break the Groovy code stored in another repository. That’s how the monolithiest of monoliths get built.

We need to go deeper - custom processors?

While scripts provide a great interface to extend functionalities of Nifi, they have some limitations, both from usability and maintenance perspectives. To maximize customization, we can create our own processors, with a few notable advantages compared to scripts. The first is better abstraction; in the case of processors, the user can look into build-in documentation, check help messages next to property name etc. However, in the case of script, looking into the script code is almost always necessary. We can also define as many output connections as we want instead of just success and failure. In addition, because every custom processor is just a maven project, it can make use of all traditional programming features, versioning with VCSs like Git, using test frameworks and creating CI/CD pipelines. What's more, NiFi provides an interface for adding new components in a plugin-like manner, so no need to recompile anything. Since version 1.8 there is even the option of dynamically adding new components during runtime and switching between versions of components is available from the level of WebUI. Unfortunately, NiFi will ignore components with the same version as ones previously loaded, so it's impossible to dynamically replace the jar with an already existing version.

All those mechanisms are great for programmers but since NiFi is a tool designed for people who do not necessarily like to code, the additional complexity in creating components is a major downside compared to scripts. Everything has to be done in accordance with the NiFi framework,

- just the necessity of using Maven or some similar system is a major complication, especially if a task executed by a component is fairly simple. Another disadvantage is that you need to have to access the NiFi cluster via ssh or configured CI/CD to put a custom processor into Nifi, which might be a problem security-wise. The same as with scripts, it’s just adding additional parts to one monolithic system.

Offloading business logic from NiFi

No one plans the building of monolith monsters. It is just tiny bricks of tightly coupled things added one by one each day. The best way to avoid tight coupling is using state-of-the-art engineering methods such as…. microservices. Microservices ensure the encapsulation of business logic into elegant and tiny components which have a clear definition of the API they expose. This is something that really worked in our projects. Whenever some complex logic is required, instead of dozens of untestable NiFi processors, it is really worth creating a REST service endpoint. We favour that approach most of all because NiFi can easily send HTTP/HTTPS requests and handle JSON responses. There are plenty of mature frameworks for writing rest services in a language of your preference. The lack of unit tests in NiFi is a serious limitation. When You build complex things and have unit tests, you can easily refactor your code and continuously make it better each day. Without them, making improvements is risky and is often avoided, thus the code base or NiFi flow gets difficult to maintain. Moreover, microservices can be used by other systems to communicate with NiFi.

The approach with microservices works well unless big amounts of data is sent through the network. In other words, it suits the scenarios where complex logic can be kept separate to data volumes at scale. In other cases, Apache Spark jobs can be triggered from NiFi.

Conclusion

What we loved? NiFi is like Lego Duplo and it's great that it can be extended with other Lego bricks like groovy scripts, custom processors or offloading logic to microservices. Each of the approaches has its pros and cons. It's always good when you have multiple options and pick the one that serves your needs best.

What we hated? When working with real life business logic, we prefer using Apache Spark for bigger data volumes or rest services with smaller amounts of data. In other words, for custom logic we prefer avoiding NiFi.

big data
technology
apache nifi
getindata
2 October 2020

Want more? Check our articles

włdek blogobszar roboczy 1 4x 100
Tutorial

Artificial Intelligence regulatory initiatives of EU countries

AI regulatory initiatives of EU countries On April 21, 2021, the EU Commission adopted a proposal for a regulation on artificial intelligence…

Read more
transfer legacy pipeline modern gitlab cicd kubernetes kaniko
Tutorial

How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 2

Please dive in the second part of a blog series based on a project delivered for one of our clients. If you miss the first part, please check it here…

Read more
paweł lesszczyński 2obszar roboczy 1 4x 100
Tutorial

Alert backoff with Flink CEP

Flink complex event processing (CEP).... ....provides an amazing API for matching patterns within streams. It was introduced in 2016 with an…

Read more
covid 19 pandemia
Use-cases/Project

Fighting COVID-19 with Google Cloud - quarantine tracking system

Coronavirus is spreading through the world. At the moment of writing this post (on the 26th of March 2020) over 475k people have been infected and…

Read more
flink
Tutorial

ETL 2.0 Why you should switch into stream processing

If you are looking at Nifi to help you in your data ingestions pipeline, there might be an interesting alternative. Let’s assume we want to simply…

Read more
wp stream blogingobszar roboczy 1 4x 100
Whitepaper

White Paper: Stream Processing Explained

Stream Processing In this White Paper we cover topic such as characteristic of streaming, the challegnges of stream processing, information about open…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

By submitting this form, you agree to our  Terms & Conditions