Success Story: Fintech data platform gets a boost from stream processing
A partnership between iZettle and GetInData originated in the form of a two-day workshop focused on analyzing iZettle’s needs and exploring multiple…
Read moreApache NiFi, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data pipelines and free them from those dirty, text-based methods of implementation. Unfortunately, we live in a world of trade-offs, and those features come with a price. The purpose of our blog series is to present our experience and lessons learned when working with production NiFi pipelines. This will be organised into the following articles:
Apache NiFi - why do data engineers love it and hate it at the same time?
Part I - Fast development, painful maintenance
Part II - We have deployed, but at what cost… - CI/CD of NiFi flow
Part IV - Universe made out of flow files - NiFi architecture
Part V - It’s fast and easy, what could possibly go wrong - one year history of certain NiFi flow
I have only one rule and that’s … - recommendations for using Apache NiFi
In this post we try to sit back, think of all the details presented in the previous articles and extract some general rules and lessons learned that may be useful to other data engineers.
Flow visualization is one of the greatest features of NiFi, usually when you create the flow, it’s basically self-explanatory. If we want to keep reaping the benefits of it, we must keep the structure of the flow fairly clear. Rules are mostly analogical to those used in writing code. It’s worth mentioning that what is considered a clear structure is subjective, so suggestions here will be more like rule of thumb, rather than a rigid framework, nonetheless here they are:
To avoid pitfalls while developing, it's important to remember that while NiFi is great for a variety of problems, for some it’s just… not. Developing some features in NiFi is just not feasible when they are available in other tools so it is a huge mistake (unfortunately happening more often than it should) to equate the size of your technological stack with the complexity of the solution. It is a core assumption built into the design of Nifi to integrate with other processing engines, databases, microservices etc. So even if most of the processing is in Nifi, it’s always good to ask yourself whether this is the right tool for this job.
If you want to do the stream processing with windowing or some logic, consider other technologies like Apache Flink. If you need batch processing on a Hadoop cluster, think of executing Hive queries from NiFi. If the processing cannot be defined with SQL, consider writing a separate Spark job for it. On the other hand, if one needs to manage files on HDFS or generate and run Hive queries, then NiFi is a really good choice.
The development in NiFi is based on using out-of-the-box processors, that makes the developers dependent on available solutions more than in classic development. We can of course create our own custom solutions by implementing the functionality with some flow, script or other custom approach, but it’s usually problematic maintenance-wise. In consequence, it’s vital to stay up to date with features added to new versions of NiFi. This happened to us when we needed a retry mechanism for communicating with 3rd party services like Hive, HDFS, etc. There was no available solution so we implemented a retry process group that has done what we needed. The only issue with this was that the process group contained eleven processors and was placed in multiple places in the flow, which resulted in around 250 extra processors. Fortunately, a couple of months later the RetryFlowFile
processor was released and we upgraded Nifi to a newer version and used the available processor.
The lessons learned are clear:
From our experience, continuous integration and continuous deployment of NiFi projects are much more time consuming than other processing technologies. Depending on how sensitive the data is and how critical the process, there are a few options of handling it.
This is the 6th post in our series and the last. We’ve seen certain comments saying that NiFi s**ks under some previous posts - which we don't agree with. We are the engineers who have spent quite some time with NiFi, so we write about the things that we had issues with and solved. The technology is not fully mature yet, it is still evolving. For many scenarios, the development of NiFi is lightning fast and is definitely, without any shadow of a doubt, the technology we recommend.
A partnership between iZettle and GetInData originated in the form of a two-day workshop focused on analyzing iZettle’s needs and exploring multiple…
Read moreIn this episode of the RadioData Podcast, Adama Kawa talks with Varun Bhatnagar from Swedbank. Mentioned topics include: Enterprise Analytics Platform…
Read moreIn the "Power of Big Data" series, I will talk about the possibilities that Big Data solutions give to individual business sectors. It should be noted…
Read moreIt's coming up to a year since the European Commission published its proposal for the Artificial Intelligence Act (the AI Act/AI Regulation). The…
Read moreIt's been a year since the announcement of the dbt-flink-adapter, and the concept of enabling real-time analytics with dbt and Flink SQL is simply…
Read moreMulti-tenant architecture, also known as multi-tenancy, is a software architecture in which a single instance of software runs on a server and serves…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?