Tutorial

7 min read

NiFi Ingestion Blog Series. Part VI - I only have one rule and that is … - recommendations for using Apache NiFi

Apache NiFi, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data pipelines and free them from those dirty, text-based methods of implementation. Unfortunately, we live in a world of trade-offs, and those features come with a price. The purpose of our blog series is to present our experience and lessons learned when working with production NiFi pipelines. This will be organised into the following articles:

Apache NiFi - why do data engineers love it and hate it at the same time?

Part I - Fast development, painful maintenance

Part II - We have deployed, but at what cost… - CI/CD of NiFi flow

Part III - No coding, just drag and drop what you need, but if its not there… - custom processors, scripts, external services

Part IV - Universe made out of flow files - NiFi architecture

Part V - It’s fast and easy, what could possibly go wrong - one year history of certain NiFi flow

I have only one rule and that’s … - recommendations for using Apache NiFi

getindata-nifi-ingestion

In this post we try to sit back, think of all the details presented in the previous articles and extract some general rules and lessons learned that may be useful to other data engineers.

Harness the complexity

Flow visualization is one of the greatest features of NiFi, usually when you create the flow, it’s basically self-explanatory. If we want to keep reaping the benefits of it, we must keep the structure of the flow fairly clear. Rules are mostly analogical to those used in writing code. It’s worth mentioning that what is considered a clear structure is subjective, so suggestions here will be more like rule of thumb, rather than a rigid framework, nonetheless here they are:

If your flow doesn’t fit on a visible canvas or you have to zoom out so you can see the names of processors, it’s probably time to put some parts of the flow in process groups.
If your flow gets too deep so you have more than four nestings, it may be time to think about restructuring the flow or putting logic into scripts/custom processors.
If possible, avoid manipulating attributes that are widely used in the flow. An attribute once set is often used in many places and changing it in one place has a great chance of breaking logic in some steps later on.
Keep as much as you can explicit. It's really tempting, especially while writing scripts, to just manipulate the attributes that are implicitly given, but if you have to debug where certain attributes change, it’s usually better to have it specified in processor properties than have to look for it in the script body. In case of custom processors, try keeping good documentation.
Sometimes your flow can have just too much logic inside. In this case, you might think about splitting it into a few flows and connect them with an external service like Kafka. A decrease in size makes the flow more readable and keeps the parts moderately separate, so you only have to debug specific parts. It is like eating spaghetti that has just a single noodle. The shorter the noodle is, the cleaner flow you have.

Choose the right tool

To avoid pitfalls while developing, it's important to remember that while NiFi is great for a variety of problems, for some it’s just… not. Developing some features in NiFi is just not feasible when they are available in other tools so it is a huge mistake (unfortunately happening more often than it should) to equate the size of your technological stack with the complexity of the solution. It is a core assumption built into the design of Nifi to integrate with other processing engines, databases, microservices etc. So even if most of the processing is in Nifi, it’s always good to ask yourself whether this is the right tool for this job.

If you want to do the stream processing with windowing or some logic, consider other technologies like Apache Flink. If you need batch processing on a Hadoop cluster, think of executing Hive queries from NiFi. If the processing cannot be defined with SQL, consider writing a separate Spark job for it. On the other hand, if one needs to manage files on HDFS or generate and run Hive queries, then NiFi is a really good choice.

Choosing NiFi does not mean avoiding writing any code, but the amount of code that has to be written can be significantly reduced.

Keep your finger on the pulse

The development in NiFi is based on using out-of-the-box processors, that makes the developers dependent on available solutions more than in classic development. We can of course create our own custom solutions by implementing the functionality with some flow, script or other custom approach, but it’s usually problematic maintenance-wise. In consequence, it’s vital to stay up to date with features added to new versions of NiFi. This happened to us when we needed a retry mechanism for communicating with 3rd party services like Hive, HDFS, etc. There was no available solution so we implemented a retry process group that has done what we needed. The only issue with this was that the process group contained eleven processors and was placed in multiple places in the flow, which resulted in around 250 extra processors. Fortunately, a couple of months later the RetryFlowFile processor was released and we upgraded Nifi to a newer version and used the available processor.

The lessons learned are clear:

If you are solving a generic problem, sooner or later someone else will solve it too.
Keep up to date with the NiFi change log to see what is coming with the latest releases.

CI and CD takes time

From our experience, continuous integration and continuous deployment of NiFi projects are much more time consuming than other processing technologies. Depending on how sensitive the data is and how critical the process, there are a few options of handling it.

Develop directly on a single main NiFi cluster. Disclaimer: developing on production is generally a really bad idea. But in some cases it’s possible. You have one registry, versioning is easy.
Have a single NiFi registry and multiple NiFi clusters. If you can afford to have elements in production modifiable from a development environment, it can solve a lot of problems moving the flow.
If you can’t do any of the above… well good luck, there is a lot of work ahead of you. NiFi has no out-of-the-box support for migrating whole flows between environments. It is doable, but really hard if you want to have an automated process.

getindata-apache-nifi

Conclusion

This is the 6th post in our series and the last. We’ve seen certain comments saying that NiFi s**ks under some previous posts - which we don't agree with. We are the engineers who have spent quite some time with NiFi, so we write about the things that we had issues with and solved. The technology is not fully mature yet, it is still evolving. For many scenarios, the development of NiFi is lightning fast and is definitely, without any shadow of a doubt, the technology we recommend.

big data

technology

apache nifi

hadoop

hdfs

hive

kafka

Last updated: 2 December 2020

Written by

Tomasz Nazarewicz

Data Engineer

Paweł Leszczyński

Data Engineer

Like this post?
Spread the word

Want more? Check our articles

getindata xebia joining forces globa partner

Joining forces with Xebia: The story by GetInData’s founders about their aspirations, dilemmas and key reasons for joining the global partner

Starting a company from scratch as first-time founders can be very challenging, but being active community members can make all the difference…

getindata blog business value event processing

Use-cases/Project

Business value of event processing - use cases

Every second your IT systems exchange millions of messages. This information flow includes technical messages about opening a form on your website…

Big Data Event

COVID-19 changes Big Data Tech Warsaw 2021 but makes it greater at the same time.

Happy New Year 2021! Exactly a year ago nobody could expect how bad for our health, society, and economy the year 2020 will be. COVID-19 infected all…

data enrichtment flink sql using http connector flink getindata big data blog notext

Tutorial

Data Enrichment in Flink SQL using HTTP Connector For Flink - Part Two

In part one of this blog post series, we have presented a business use case which inspired us to create an HTTP connector for Flink SQL. The use case…

getindator illustration of squirrel holding a trophy and standi 537810f1 a5a2 4f1a a701 18b280cf6acf 720

3 Apache Flink Blogs That Will Revolutionize Your Streaming Game

Streaming analytics is no longer just a buzzword—it’s a must-have for modern businesses dealing with dynamic, real-time data. Apache Flink has emerged…

Success Stories

Customer Story: Platform focused on centralizing data sources and democratization of data with ING

The client who needs Data Analytics Platform ING is a global bank with a European base, serving large corporations, multinationals and financial…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com