Tech News
9 min read

6 Big Data Trends For 2021

big-data-blog-trends-getindata

2020 was a very tough year for everyone. It was a year full of emotions, constant adoption and transformation - both in our private and professional lives. All this driven by a piece of nasty RNA that was spreading around and as a consequence - changing our comfortable status quo. Thanks to enormous work of scientists and professionals all over the world 2021 is the year of hope, that pandemia will be under our control, and adoption to “new normal” as for sure we will not revert back to our pre-pandemic reality. 

How does it impact the data management field? Pandemia boosted digitalisation, so data became even more important for companies that it was before. Taking into consideration that over the past 2 years we have noticed a significant evolution of Big Data technologies we can expect that the upcoming year will be pretty interesting.

In Getindata we work with various customers globally, so we can track the evolution of data platforms and perception of data products in different companies. We are also tech enthusiasts and early adopters of new tech stacks (not to mention our contributions to Open Source) so we had an internal discussion about what to expect in the data management field in the future. Here is the summary of what we think will be trends for the Big Data ecosystem in 2021.

MLOps

getindata-big-data-trends-2021-develop-deploy-test

Machine Learning Operations (MLOps) has been a hot topic in 2020 already, but in our opinion, the following year will be a time for wide adoption of this paradigm. Many companies last year hit the wall of scaling up their current Machine Learning modelling and sorting out this problem became a number 1 priority - some of them have been already doing some research or proof of concepts projects for testing out some technologies. However, MLOps should not be considered as just a technology to automate ML modeling and serving. It is more like setting up the cross-functional process of creating, testing and serving ML models that involves different IT capabilities in the organization that were not closely working together before. It is also a way to bridge the knowledge gap between data analytics teams and IT platforms teams as they usually have complementary but hardly overlapping experience in technologies. So if you are just planning to start ML initiatives in your organization it might be a good idea to shape it as a full MLOps implementation with proper training, knowledge sharing and good practices in place. Instead of finding your way to do ML efficiently, what takes time, such a program could place you directly on a correct path.

From a technology stack perspective, we have few tools that are de facto industry standard in their areas (Jupyter, Pandas), but there are many different ideas how to tackle this matter turned into software products, software components and whole platforms with many new names popping up pretty frequently. The next year or two will show us which approach will become an industry standard.

Stream Processing

Real-time analytics implemented with stream processing engines have been around for a while already. However, what should be noticed is that from being a very complicated piece of software to be implemented and maintained only by experienced professionals they became pretty accessible complex products that allows even less tech-savvy people to work with them. To give an example - numerous APIs, like SQL and Python support in Flink makes everyone find something to themselves. Stream processing capabilities available at your fingertips in public clouds make even easier to start. Actually, in many cases, there is a little extra cost of switching from batch to real-time processing, with some benefits of such an approach as these frameworks have some capabilities already in place that would make your data ingestion less error-prone and messy. Together with the lower learning curve, the more use cases can be considered. In many business environments, there is an increasing appreciation for having data not only on time but online, almost available instantly, that you can actually act upon. As we all know - the value of information is decreasing over time. Data consumers demand information now, not the next business day. Business stakeholders finally have possibilities to get something out of it for themselves.

Cloud Native

getindata-big-data-blog-cloud-native

Classical Hadoop environment as we all know is rather decadent technology. Changes in the vendor landscape that were supporting it commercially has just hastened the end of its domination in the Big Data ecosystem (not to mention public cloud providers with their data analytics offering which became alluring alternatives). Cloud-native movement with its containerised software running on modern orchestrators, with everything as a code paradigm and programmatic infrastructure, has changed the way we build and serve applications. Data management world picked that up with a bit of hesitation and legacy baggage, but currently, no one is questioning the way we are going to evolve our data platforms. Clear separation of data storage from processing and querying is progressing in by Open Source software, but you can already work like that in the public cloud environment. In 2021 everyone is looking for Spark fully supported on Kubernetes. There is still an open question about the data storage for on-premise deployments - HDFS seems to be the most solid and performant solution but there are few initiatives about solving the problem of storage for data-intensive applications, like Ozone for example. From the user perspective, the idea of query federation engines, like Trino (formerly known as PrestoSQL) or BigQuery, are the next way of working with distributed data sources and the upcoming year will definitely increase their adoption. There is a brand new concept of data mesh with domain-oriented and decentralised data paradigm, but before we start thinking about the possibilities we could earn and learn about challenges, we need to be overwhelmed by the fact we can easily go beyond our data lake.

Data Discovery

While the idea of having a central catalog of all your data assets is nothing new there is not much adoption. Maybe apart from tech companies, which made it as their starting point for their data scientist to start work. Data-driven companies where all data is widely available for everyone to do analysis found it necessary to invest in such a solution. In case your data is still maintained in organizational silos you might not see value in Data Discovery. However, once you start going outside your well-trodden paths of doing analytics, you will face the challenge of maintaining knowledge about data, before your data scientists get flooded with datasets you will make available for them. Such catalog usually not only includes information about data sources but also some metadata like profiling or quality so you can consciously pick your data for analysis. Data Discovery also becomes a sort of knowledge management tool for the organization. However this is not just a technological problem as it is closely related to Data Governance practices implemented in the organizations. We see large potential for organizations in efficiently maintaining knowledge about their data.

amundsen-getindata-big-data-trends

Data Quality

There are two more areas that we think will be growing next year. As our data pipelines, today are more likely to be structured as a code, there is a question about maintaining data quality and observability. DataOps is nothing more like DevOps, but adjusted to data processing. While the currently running pipelines might be still in the ETL tools with fancy graphical user interfaces, there will be probably much less concerns about building new ones with just a code, but with rich testing practices and reproducibility.

Public cloud

Last but not least - public cloud offering for analytics is finally a real alternative for data management. Major vendors not only follow the latest advancements in Big Data technology but in many cases they actively participate in charting development paths. Taking into account that while moving to the cloud you just focus on how you want to shape your data product to support your use cases instead of managing the complexity of all these moving parts, many companies want to try it out in the upcoming year. This is valid also for companies from heavily regulated sectors.

Last year showed us that talking about trends and trying to predict what is going to be hot next year can be really tricky if the reality wants to play a game with us and turn everything upside down. The year 2021 seems to be more under control but still with a huge dose of uncertainty. However in data management, we do not expect revolutions - it is more like a constant evolution but with more attention from stakeholders as digitisation became the only way to go for many companies.

streaming
big data
technology
kubernetes
google cloud platform
data discovery
getindata
Amundsen
18 January 2021

Want more? Check our articles

flink
Tutorial

ETL 2.0 Why you should switch into stream processing

If you are looking at Nifi to help you in your data ingestions pipeline, there might be an interesting alternative.Let’s assume we want to simply dump…

Read more
complex event processing apache flink
Tutorial

My experience with Apache Flink for Complex Event Processing

My goal is to create a comprehensive review of available options when dealing with Complex Event Processing using Apache Flink. We will be building a…

Read more
getindata transfer pipelines to modern gitlab cicd small
Tutorial

How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 1

This blog series is based on a project delivered for one of our clients. We splited the content in three parts, you can find a table of content below…

Read more
0 pjPVaAnArwat2ZH8
Big Data Event

Big Data Tech Warsaw Summit 2019 summary

It’s been already more than a month after Big Data Tech Warsaw Summit 2019, but it’s spirit is still among us — that’s why we’ve decided to prolong it…

Read more
big data technology warsaw summit 2021 adam kawa przemysław gamdzyk
Big Data Event

The Big Data Technology Summit 2021 - review of presentations

Since 2015, the beginning of every year is quite intense but also exciting for our company, because we are getting closer and closer to the Big Data…

Read more
highly available airflow cluster aws notext
Tutorial

Highly available Airflow cluster in Amazon AWS

These days, companies getting into Big Data are granted to compose their set of technologies from a huge variety of available solutions. Even though…

Read more

Contact us

Fill out this simple form. Our team will contact you promptly to discuss the next steps.

hello@getindata.comFist bump illustration

Any questions?

Choose one
By submitting this form, you agree to our  Terms & Conditions