Tech News
9 min read

6 Big Data Trends For 2021

big-data-blog-trends-getindata

2020 was a very tough year for everyone. It was a year full of emotions, constant adoption and transformation - both in our private and professional lives. All this driven by a piece of nasty RNA that was spreading around and as a consequence - changing our comfortable status quo. Thanks to enormous work of scientists and professionals all over the world 2021 is the year of hope, that pandemia will be under our control, and adoption to “new normal” as for sure we will not revert back to our pre-pandemic reality. 

How does it impact the data management field? Pandemia boosted digitalisation, so data became even more important for companies that it was before. Taking into consideration that over the past 2 years we have noticed a significant evolution of Big Data technologies we can expect that the upcoming year will be pretty interesting.

In Getindata we work with various customers globally, so we can track the evolution of data platforms and perception of data products in different companies. We are also tech enthusiasts and early adopters of new tech stacks (not to mention our contributions to Open Source) so we had an internal discussion about what to expect in the data management field in the future. Here is the summary of what we think will be trends for the Big Data ecosystem in 2021.

MLOps

getindata-big-data-trends-2021-develop-deploy-test

Machine Learning Operations (MLOps) has been a hot topic in 2020 already, but in our opinion, the following year will be a time for wide adoption of this paradigm. Many companies last year hit the wall of scaling up their current Machine Learning modelling and sorting out this problem became a number 1 priority - some of them have been already doing some research or proof of concepts projects for testing out some technologies. However, MLOps should not be considered as just a technology to automate ML modeling and serving. It is more like setting up the cross-functional process of creating, testing and serving ML models that involves different IT capabilities in the organization that were not closely working together before. It is also a way to bridge the knowledge gap between data analytics teams and IT platforms teams as they usually have complementary but hardly overlapping experience in technologies. So if you are just planning to start ML initiatives in your organization it might be a good idea to shape it as a full MLOps implementation with proper training, knowledge sharing and good practices in place. Instead of finding your way to do ML efficiently, what takes time, such a program could place you directly on a correct path.

From a technology stack perspective, we have few tools that are de facto industry standard in their areas (Jupyter, Pandas), but there are many different ideas how to tackle this matter turned into software products, software components and whole platforms with many new names popping up pretty frequently. The next year or two will show us which approach will become an industry standard.

Stream Processing

Real-time analytics implemented with stream processing engines have been around for a while already. However, what should be noticed is that from being a very complicated piece of software to be implemented and maintained only by experienced professionals they became pretty accessible complex products that allows even less tech-savvy people to work with them. To give an example - numerous APIs, like SQL and Python support in Flink makes everyone find something to themselves. Stream processing capabilities available at your fingertips in public clouds make even easier to start. Actually, in many cases, there is a little extra cost of switching from batch to real-time processing, with some benefits of such an approach as these frameworks have some capabilities already in place that would make your data ingestion less error-prone and messy. Together with the lower learning curve, the more use cases can be considered. In many business environments, there is an increasing appreciation for having data not only on time but online, almost available instantly, that you can actually act upon. As we all know - the value of information is decreasing over time. Data consumers demand information now, not the next business day. Business stakeholders finally have possibilities to get something out of it for themselves.

Cloud Native

getindata-big-data-blog-cloud-native

Classical Hadoop environment as we all know is rather decadent technology. Changes in the vendor landscape that were supporting it commercially has just hastened the end of its domination in the Big Data ecosystem (not to mention public cloud providers with their data analytics offering which became alluring alternatives). Cloud-native movement with its containerised software running on modern orchestrators, with everything as a code paradigm and programmatic infrastructure, has changed the way we build and serve applications. Data management world picked that up with a bit of hesitation and legacy baggage, but currently, no one is questioning the way we are going to evolve our data platforms. Clear separation of data storage from processing and querying is progressing in by Open Source software, but you can already work like that in the public cloud environment. In 2021 everyone is looking for Spark fully supported on Kubernetes. There is still an open question about the data storage for on-premise deployments - HDFS seems to be the most solid and performant solution but there are few initiatives about solving the problem of storage for data-intensive applications, like Ozone for example. From the user perspective, the idea of query federation engines, like Trino (formerly known as PrestoSQL) or BigQuery, are the next way of working with distributed data sources and the upcoming year will definitely increase their adoption. There is a brand new concept of data mesh with domain-oriented and decentralised data paradigm, but before we start thinking about the possibilities we could earn and learn about challenges, we need to be overwhelmed by the fact we can easily go beyond our data lake.

Data Discovery

While the idea of having a central catalog of all your data assets is nothing new there is not much adoption. Maybe apart from tech companies, which made it as their starting point for their data scientist to start work. Data-driven companies where all data is widely available for everyone to do analysis found it necessary to invest in such a solution. In case your data is still maintained in organizational silos you might not see value in Data Discovery. However, once you start going outside your well-trodden paths of doing analytics, you will face the challenge of maintaining knowledge about data, before your data scientists get flooded with datasets you will make available for them. Such catalog usually not only includes information about data sources but also some metadata like profiling or quality so you can consciously pick your data for analysis. Data Discovery also becomes a sort of knowledge management tool for the organization. However this is not just a technological problem as it is closely related to Data Governance practices implemented in the organizations. We see large potential for organizations in efficiently maintaining knowledge about their data.

amundsen-getindata-big-data-trends

Data Quality

There are two more areas that we think will be growing next year. As our data pipelines, today are more likely to be structured as a code, there is a question about maintaining data quality and observability. DataOps is nothing more like DevOps, but adjusted to data processing. While the currently running pipelines might be still in the ETL tools with fancy graphical user interfaces, there will be probably much less concerns about building new ones with just a code, but with rich testing practices and reproducibility.

Public cloud

Last but not least - public cloud offering for analytics is finally a real alternative for data management. Major vendors not only follow the latest advancements in Big Data technology but in many cases they actively participate in charting development paths. Taking into account that while moving to the cloud you just focus on how you want to shape your data product to support your use cases instead of managing the complexity of all these moving parts, many companies want to try it out in the upcoming year. This is valid also for companies from heavily regulated sectors.

Last year showed us that talking about trends and trying to predict what is going to be hot next year can be really tricky if the reality wants to play a game with us and turn everything upside down. The year 2021 seems to be more under control but still with a huge dose of uncertainty. However in data management, we do not expect revolutions - it is more like a constant evolution but with more attention from stakeholders as digitisation became the only way to go for many companies.

streaming
big data
technology
kubernetes
google cloud platform
data discovery
getindata
Amundsen
18 January 2021

Want more? Check our articles

1 RsDrT5xOpdAcpehomqlOPg
Big Data Event

2³ Reasons To Speak at Big Data Tech Warsaw 2020 (February 27th, 2020)

Big Data Technology Warsaw Summit 2020 is fast approaching. This will be 6th edition of the conference that is jointly organised by Evention and…

Read more
monitoring data platform
Whitepaper

White Paper: Monitoring and Observability for Data Platform

About In this White Paper, we described a monitoring and observing data platform in case of continuously working processes. What you will find there…

Read more
why do big data project fails
Tutorial

Why do Big Data projects fail: Part. 2. The Technological Issues.

In the previous post on our Big Data Blog, we discussed the business reasons behind the failures of Big Data projects. We've listed five major…

Read more
getindata running machine learning platform pipelines kedro kubeflow airflow mariusz strzelecki
Tutorial

Running Machine Learning Pipelines with Kedro, Kubeflow and Airflow

One of the biggest challenges of today’s Machine Learning world is the lack of standardization when it comes to models training. We all know that data…

Read more
kubeflow pipelines runing 5 minutes getindata blog

Kubeflow Pipelines up and running in 5 minutes

The Kubeflow Pipelines project has been growing in popularity in recent years. It's getting more prominent due to its capabilities - you can…

Read more
wp stream blogingobszar roboczy 1 4x 100
Whitepaper

White Paper: Stream Processing Explained

Stream Processing In this White Paper we cover topic such as characteristic of streaming, the challegnges of stream processing, information about open…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

By submitting this form, you agree to our  Terms & Conditions