NiFi Scripted Components - the missing link between scripts and fully custom stuff
Custom components As we probably know, the biggest strength of Apache Nifi is the large amount of ready-to-use components. There are, of course…
Read moreA year is definitely a long enough time to see new trends or technologies that get more traction. The Big Data landscape changes increasingly fast thanks to a lot of innovation, competition, and use of technologies that become now critical to almost all companies on this planet. Let’s read about the 5 current trends that will be described in detail by selected presentations at the upcoming edition of Big Data Tech Warsaw 2021 (February 25-26th).
The issue of building machine learning systems, especially scalable ones, was presented by Google in a research paper in 2015 ("Hidden Technical Debt in Machine Learning Systems"). At that time, many companies were already in the process of creating large-scale ML systems. Significantly, however, few had a dedicated platform or tools that would support the end-to-end life-cycle of their ML models and the daily work of their ML teams.
Last year we had a number of very interesting MLOps-related presentations at BDTWS 2020 given by speakers from companies such as Spotify, Disney+, Synerise. The mentioned companies were part of the Data Science & ML track last year.
Keven(Qi) Wang will talk about MLOps journey at H&M on the public cloud. In his speech he will present their entire MLOps stack that has been adopted by multiple product teams managing 100s of models across the entire H&M value chain. It enables data scientists to develop models in a highly interactive environment, enables engineers to manage large scale model training and model serving pipeline with full traceability.
Maciej Pieńkosz from Sotrender, a company whose main task is to analyze huge amounts of data coming from Social Media, will talk about their ML use-cases and GCP components they use (e.g. AI Platform Notebooks, AI Platform Training, Cloud Run, Gitlab CI/CD). His presentation will cover the full lifecycle of the ML model - from experimentation, through deployment and training, to model monitoring.
It's hard to operate in the IT industry (especially within Big Data projects implemented on open-source technologies) and not know the Australian company called Atlassian. Jiamei Du will talk about how her company uses A/B experiments to build better products. Part of her story will focus on their MLOps tools and infrastructure to make their A/B experiments as efficient as possible.
One cannot fail to mention the members of GetInData, who will present their experiences in building portable and reusable ML platforms in various environments (cloud, hybrid, on-premise) using a mix of open-source and cloud-based technologies for various customers. They will share their experience and best practices that come from multiple production implementations.
NoMagic robots improve iteratively and continuously thanks to the software 2.0 improvement cycle supported by an in-house data engine. Watch this short video below to see what type of robots they teach using ML/AI.
Those are only a few highlighted examples, but you will definitely learn more about Machine Learning Operations at Big Data Tech Warsaw 2021.
Adoption of Machine Learning, Data Science, and AI algorithms and techniques always required a lot of work, skills, and time Undoubtedly, however, when conducted successfully, it brings excellent results.. One of the favorite examples to mention is Discover Weekly implemented by Swedish, world-wide known company, Spotify. Below, you can see slides created by my ex-colleagues at Spotify. On those slides, they describe how Discover Weekly came to be, highlighting technical challenges, data-driven development, and the ML models used to power their recommendations engine. It was a complex process, not done overnight. Integrate all necessary (open-source) technologies, then build scalable architecture, implement smart algorithms and monitor it was undoubtedly a big undertaking, at least five years ago.
Today, building dedicated ML platforms and using MLOps toolkits can significantly increase companies productivity. Very often, they also switch to the public cloud - it helps to take advantage of ready-to-use libraries and hardware, and as a consequence, makes their job easier. These processes result in the possibility of experimenting, training and deploying new models faster and cheaper.
Clearly, more and more ML models appears in our daily life these days.
During the BDTWS 2021 conference, you can count on many presentations that (a) describe use-cases, algorithms, and techniques which show how Machine Learning and Artificial Intelligence solve real-world business problems and (b) share their lessons learned from working with ML, Data Science, and advanced analytics. Let’s highlight a few interesting examples:
Mikio Braun (ex-Zalando) will talk about the lessons he learned on building large-scale production recommender systems. He will, among other things, explain how to bridge the gap from the raw mathematical models and algorithms to robust and scalable software systems. It will be exploring the union of theory and practice
Boxun Zhang (ex-Spotify, currently at Unity) will talk about similar issues in his presentation, although he will focus on the aspect related to real-time and large-scale Machine Learning systems. Boxun will also share several generalizable lessons that make ML systems performant from an ML perspective and scalable from an engineering perspective.
It's also hard not to mention GetInData members who will present their experiences from a year-long journey in developing Kcell (a large Kazach telecom’s) big data analytics platform and building data-driven solutions on top of it that help to reduce costs, improve the quality of the services and understand users' needs better.
Machine Learning is often used for prediction, forecasting, and anomaly detection. At the BDTWS 2021 we will be able to hear the story about a near real-time ML model built by Ericsson. It is used for predicting telecom systems degradation and outage based on historical fault & performance data. This model helps the operations team to conduct proactive monitoring, thanks to which the number of hours that support engineers spent on solving issues has significantly decreased. We are talking about a drop ranging between 30 and 40%. It also improved the UXin pre-paid calls and made customer retention higher. Peltarion (a Swedish company that specializes in AI) will describe their state-of-the-art weather forecasting AI service. Sotrender (a Polish company that analyses data from social media) will explain how they use ML to predict and monitor the effectiveness of campaigns conducted on the Facebook platform.
At Big Data Technology Warsaw Summit 2021 there will also be presentations on the use of data, science and technology to generate insights for search and recommendation systems in an e-commerce platform (Etsy), to build content personalization systems in e-commerce (eBay), run A/B experiments for growth (Atlassian), analyze geophysical data from ground-penetrating radars using deep-learning techniques (SGPR.TECH), and more.
For data-driven company, things like data quality and observability have always been important, even a long time ago when tools like Hadoop and Hive were open-sourced. On the other hand, it was always problematic, due to the lack of simple-to-use and feature-rich technologies (especially the open-source ones ). For this reason, many companies haven’t addressed these problems correctly.
Recently, however, the status quo has changed, and new tools have emerged that significantly facilitate data quality and data observability. This includes various tools such as Apache Atlas, Amundsen from Lyft, Dataportal from AirBnB (see a picture below), Datahub from LinkedIn, Data Catalog from Google, and Deequ from Amazon to name a few. These tools are often integrated together - check how Amundsen can work together with Feast for machine learning discovery or Atlas for data discovery.
There will be a presentation on a new open-source technology called Marquezthat can be used for data lineage and observability.This new tool can help to understand how amounts of data are flowing through company’s systems. Thanks to this, it will be possible to demonstrate the dependencies that occur between individual teams receiving and producing data, as well as easier to carry out data pipelines audit.
While ensuring that data quality is important even in the small data set, Criteo representatives will tell how they addressed data quality challenges on their 120+ PB data lake and thousands of jobs. . Their journey began two years ago, and they will now share with us the data and thoughts they have collected. The picture below shows data lake anomaly detection at Criteo (source)
The presentation from OLX will concern pragmatic approach to data quality . It will focus on a a review of already existing frameworks and approaches to data quality. Beside this, it will include principles behind adapting these approaches and designing data quality systems at OLX.
It’s not all, as there will also be presentations about building testable data pipelines at Target and about a tool called Diftong from Klarna for validating big data workflows.
These are the first three trends in Big Data that will be strongly represented in presentations at the BDTWS 2021 conference. But that's not all, go to the next post to learn about the next two trends and learn a bit about the presentations that will apply to them!
Custom components As we probably know, the biggest strength of Apache Nifi is the large amount of ready-to-use components. There are, of course…
Read moreFor project managers, development teams and whole organizations, making the first step into the Big Data world might be a big challenge. In most cases…
Read moreWe are proud to present you our first e-book, created by GetInData specialists. Apache NiFi: A Complete Guide is the result of long and fruitful work…
Read moreBeing a Data Engineer is not only about moving the data but also about extracting value from it. Read an article on how we implemented anomalies…
Read moreDo you remember our blog post about our internal initiatives such as Lunch & Learn and internal training? If yes, that’s great! If you didn’t get the…
Read moreMLOps on Snowflake Data Cloud MLOps is an ever-evolving field, and with the selection of managed and cloud-native machine learning services expanding…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?