6 min read

NiFi Ingestion Blog Series. PART I - Advantages and Pitfalls of Lego Driven Development

Apache NiFi, big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data pipelines and free them from those dirty, text based methods of implementation. Unfortunately, we live in a world of trade-offs, features come with a price. The purpose of our blog series is to present our experience and lessons learned when working with production NiFi pipelines. This will be organised into the following articles:

This post is our brief summary of strengths and weaknesses of Apache NiFi from the perspective of a flow developer.

Apache Nifi GetInData Tutorial

Rapid Development

Let’s go first through the things that are part of user experience: imagine you are data engineer and your task is to create a proof of concept for simple data ingestion to Data Lake. Assume you may have multiple sources, there are no transformations nor business logic required. The use-case is just moving data from point A to point B, with some logging in the middle. NiFi seems to be the right tool for the job.

The first part, development. You’d like to be done with your work and go on to more important stuff like saving the world or that funny youtube video. Fortunately for you, NiFi has hundreds of ready to use processors which include a variety of connectors to data sources and data sinks, all you have to do is drag and drop them from top bar to canvas and configure. If you have any doubts about properties or component in general, you can right-click on it and open documentation. After that, you create connections between processors to set the order, in which they will be executed. BAM, you’re pretty much done. Because data flow that you created is its own visual representation, checking if the logic is correct is a trivial task.

Next step, testing if everything works. NiFi allows you to start and stop each processors separately, so you can control until which point you want to execute your flow or from which to start. You can also check the input and output for every stage of processing. In case of failure, you get a bulletin with description of what happened. Combine those three features and you get a great insight into exact workings of the flow. With that, the proof of concept solution is done. You’re generally impressed with the simplicity and capabilities of your new tool.

Maturity of Development

Refactoring is natural part of the development process. This often includes reorganising flow and moving processors to other process group. At this point, it becomes apparent that any element containing data cannot be removed or moved to the different process group. This seems to be a valid point for NiFi architecture but in many cases, it is a pitfall that forces you to look for hidden flowfiles before refactoring process groups. One need to find them manually as there is no automatic way to delete all data within process group.

As a result of being lost in thought, you delete wrong component… yup, your worries are spot-on, NiFi does not have an “undo” button anywhere. When you do something, it’s set in stone. Frustration makes you look for a solution to this problem, the only one available is storing copy of process group in NiFi Registry, version control system for flows, although its functionality is fairly limited, it allows rollbacks to last committed version.

It rarely happens in IT that things are simple. Even if they appear like that at first, it doesn’t usually last long. More often than not, main functionality called “happy path” is only small fraction of actual work. Sooner or later, other scenarios need to be considered: whenever we connect external service retries are needed, error states need to be handled, etc.

Issues mentioned above are not NiFi -specific but rather a natural consequence of IT development. Because of that, most of currently used programming languages have mechanisms created specifically to mitigate them. In object oriented programming, whenever we have functionality used multiple times or one that is just complex, we can extract certain steps to different methods, classes or packages. Given the fact that method or object can have explicitly stated parameters which we supply, it’s possible to hide significant portion of complexity. This behaviour is, at least currently (June 2020), impossible to achieve in NiFi. All information in NiFi is stored inside flowfiles’ attributes which are passed implicitly throughout all steps of processing. There is no way of knowing what exactly is required without knowing specifics of the mechanism (abstraction leaks). This is like programming with all the instance variables being global and accessed directly. If that wasn’t enough, inner process group do not throw the exceptions: if the error occurs it has to be handled in all parent process groups.

Unit tests and integration testing are important. They help you modify the code, refactor to make it cleaner while being sure that business features behave the same way before and after the changes. Remember when you were a little kid and testing your code was basically running it on some input once and hoping that nothing will crush later? Yup, NiFi doesn’t have any testing framework, so no automated tests, no matter how loud you cry.


What we loved: NiFi web UI allows lightning fast development.

What we hated: It has serious limitations when compared to programming languages.

The more you know the technology, the more of its limitations you are aware of. Would we choose NiFi for our previous projects if we had the knowledge we are having now? The answer is YES but in some case we could utilize NiFi in a slightly different way. Stay tuned for further posts to read what we mean by that.

Don't forget to read the previous blog post "Apache NiFi - why do data engineers love it an hate it at the same time"

big data
apache nifi
Software Engineering
8 September 2020

Want more? Check our articles

llm data enrichment bigqueryobszar roboczy 1 4

How to use LLMs for data enrichment in BigQuery?

Introduction In the ever-evolving world of data analytics, businesses are continuously seeking innovative methods to unlock hidden value from their…

Read more
deploy you own databricksobszar roboczy 1 4

Deploy your own Databricks Feature Store on Azure using Terraform

A tutorial on how to deploy one of the key pieces of the MLOps-enabling modern data platform: the Feature Store on Azure Databricks with Terraform as…

Read more
1 RsDrT5xOpdAcpehomqlOPg
Big Data Event

2³ Reasons To Speak at Big Data Tech Warsaw 2020 (February 27th, 2020)

Big Data Technology Warsaw Summit 2020 is fast approaching. This will be 6th edition of the conference that is jointly organised by Evention and…

Read more
highly available airflow cluster aws notext

Highly available Airflow cluster in Amazon AWS

These days, companies getting into Big Data are granted to compose their set of technologies from a huge variety of available solutions. Even though…

Read more
llm cluster hugging face gke autopilot getindataobszar roboczy 1 4

Deploy open source LLM in your private cluster with Hugging Face and GKE Autopilot

Deploying Language Model (LLMs) based applications can present numerous challenges, particularly when it comes to privacy, reliability and ease of…

Read more
data driven fast track 3 steps make you data driven company
Tech News

Data-driven fast-track: 3 steps to make your company more data-driven

Hardly anyone needs convincing that the more a data-driven company you are, the better. We all have examples of great tech companies in mind. The…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy