GetInData in 2021 - let’s celebrate our achievements in the Big Data world!
The year 2021 passed in the blink of an eye and the time has come to summarize our goals at GetinData and define our challenges for the next year…
Read moreApache NiFi, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data pipelines and free them from those dirty, text-based methods of implementation. Unfortunately, we live in a world of trade-offs, and those features come with a price. The purpose of our blog series is to present our experience and lessons learned when working with production NiFi pipelines. This will be organised into the following articles:
Apache NiFi - why do data engineers love it and hate it at the same time?
Part I - Fast development, painful maintenance
Part II - We have deployed, but at what cost… - CI/CD of NiFi flow
Part IV - Universe made out of flow files - NiFi architecture
Part V - It’s fast and easy, what could possibly go wrong - one year history of certain NiFi flow
I have only one rule and that’s … - recommendations for using Apache NiFi
In this post we try to sit back, think of all the details presented in the previous articles and extract some general rules and lessons learned that may be useful to other data engineers.
Flow visualization is one of the greatest features of NiFi, usually when you create the flow, it’s basically self-explanatory. If we want to keep reaping the benefits of it, we must keep the structure of the flow fairly clear. Rules are mostly analogical to those used in writing code. It’s worth mentioning that what is considered a clear structure is subjective, so suggestions here will be more like rule of thumb, rather than a rigid framework, nonetheless here they are:
To avoid pitfalls while developing, it's important to remember that while NiFi is great for a variety of problems, for some it’s just… not. Developing some features in NiFi is just not feasible when they are available in other tools so it is a huge mistake (unfortunately happening more often than it should) to equate the size of your technological stack with the complexity of the solution. It is a core assumption built into the design of Nifi to integrate with other processing engines, databases, microservices etc. So even if most of the processing is in Nifi, it’s always good to ask yourself whether this is the right tool for this job.
If you want to do the stream processing with windowing or some logic, consider other technologies like Apache Flink. If you need batch processing on a Hadoop cluster, think of executing Hive queries from NiFi. If the processing cannot be defined with SQL, consider writing a separate Spark job for it. On the other hand, if one needs to manage files on HDFS or generate and run Hive queries, then NiFi is a really good choice.
The development in NiFi is based on using out-of-the-box processors, that makes the developers dependent on available solutions more than in classic development. We can of course create our own custom solutions by implementing the functionality with some flow, script or other custom approach, but it’s usually problematic maintenance-wise. In consequence, it’s vital to stay up to date with features added to new versions of NiFi. This happened to us when we needed a retry mechanism for communicating with 3rd party services like Hive, HDFS, etc. There was no available solution so we implemented a retry process group that has done what we needed. The only issue with this was that the process group contained eleven processors and was placed in multiple places in the flow, which resulted in around 250 extra processors. Fortunately, a couple of months later the RetryFlowFile
processor was released and we upgraded Nifi to a newer version and used the available processor.
The lessons learned are clear:
From our experience, continuous integration and continuous deployment of NiFi projects are much more time consuming than other processing technologies. Depending on how sensitive the data is and how critical the process, there are a few options of handling it.
This is the 6th post in our series and the last. We’ve seen certain comments saying that NiFi s**ks under some previous posts - which we don't agree with. We are the engineers who have spent quite some time with NiFi, so we write about the things that we had issues with and solved. The technology is not fully mature yet, it is still evolving. For many scenarios, the development of NiFi is lightning fast and is definitely, without any shadow of a doubt, the technology we recommend.
The year 2021 passed in the blink of an eye and the time has come to summarize our goals at GetinData and define our challenges for the next year…
Read moreWelcome to another Power of Big Data series post. In the series, we present the possibilities offered by solutions related to the management, analysis…
Read moreWhat is CICD? It is an acronym for Continuous Integration Continuous Delivery / Deployment. CICD can be also described as the methodology focused on…
Read moreFlink complex event processing (CEP).... ....provides an amazing API for matching patterns within streams. It was introduced in 2016 with an…
Read moreWe would like to announce the dbt-flink-adapter, that allows running pipelines defined in SQL in a dbt project on Apache Flink. Find out what the…
Read moreRecently I’ve had an opportunity to configure CDH 5.14 Hadoop cluster of one of GetInData’s customers to make it possible to use Hive on Spark…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?