Big Data Event

15 min read

A Review of the Presentations at the DataMass Gdańsk Summit 2023

The Data Mass Gdańsk Summit is behind us. So, the time has come to review and summarize the 2023 edition. In this blog post, we will give you a review and key takeaways from selected topics presented at the conference. Trends, lessons learned, and case studies from the whole Summit in this one article! If you couldn't be in Gdansk this year, read this text to update your knowledge with the hottest news from Big Data, Cloud, ML, and AI.

Before we get to the meat of the event, as co-organizers, we would like to thank all the speakers and participants for their openness and sharing of knowledge and experience - you created the unique atmosphere of this event. It is to you that it owes its special character: the substantive content and support of the knowledge transferred with practice.

Fortunately, we won't have to wait a year for another such event. In the spring, another conference will be held in Warsaw: the Big Data Technology Warsaw Summit, for which the call for presentations is now open: submit your speech proposal.

datamass getindata adoption genai

Data Mass ‘23 Recap with a grain of salt…

This year's Summit hosted special guests. Samantha System and Darlene Dorkins, reporters from Clueless Computing, visited the conference to provide coverage and ask the speakers tricky questions.

Clueless Computing is a new format whose purpose is to popularize technology.

Warning: the video is only for people with a great sense of humor.

3 times Data: Data Products, Data Governance, and Data Lake

Review by Radosław Szmit, Senior Data Engineer at GetInData | Part of Xebia

DataMass is one of the most important events for data practitioners during the year.

It has a great atmosphere, a great location, and many enthusiasts from Poland and abroad. Below, I would like to present my review of several selected presentations.

Dainius Kniuksta (Artificial Intelligence Product Lead, Forecast.app) "AI products: faster path from being right to being loved"

Unlike other presentations, this one wasn't about technology at all. The speaker focused on the goal of implementing artificial intelligence solutions and how customers and users of such products perceive them. Dainius also noted that expectations towards artificial intelligence solutions differ from what they offer. The presentation also touched upon trust in decisions made by AI and moral aspects, e.g., by replacing human employees with automation in companies. Dainius also drew attention to the psychological aspect of cooperation with AI, in particular, the fact that "People do not like being proven incorrect/wrong" and how to solve this challenge.

Jakub Nowacki (Senior Data Scientist, Amazon) "Recommender systems: a modern approach"

Jakub told us how Ring company approached the problem of building recommendation systems. It turns out that commonly used methods and algorithms found in ready-made services, such as AWS Personalize, only work in some use cases. Compared to a typical store, Ring faces the problem of having much fewer similar products and much less data. Jakub presented what algorithms were used at Ring to build an effective recommendation system. The presentation also showed what AWS cloud services and open source tools were used in Ring to build such a system.

Piotr Pietrzyk (Head of Data Governance, AVON) "Data Lineage, Data Catalog and Data Quality, as part of Data Governance Universe"

The presentation showed how using modern technologies, Data Products may be created and data transparency with the highest quality assured from a business and regulatory perspective. Piotr showed how important Data Governance is for a large organization and how tragic the consequences of ignoring these issues can be. The presentation presented and discussed the four pillars of Data Governance, i.e., Data Catalog, Data Quality, Data Lineage and Data Privacy. The author also addressed the issue of data strategy and standards created by the non-profit organization DAMA International.

avon-data-governance-getindata

Tomasz Deręgowski (Director AI/ML Engineering) "ML at Scale. How to produce and operationalize hundreds of ML solutions in a simple and reliable way?"

Tomasz presented the challenges large companies face when implementing and scaling MLOps platforms. Research shows that the production implementation of an ML model in a large organization takes more than three months. The author focused in the presentation on three key aspects supporting this process, i.e., processes, roles, and tools.

tomasz-deregowski-datamass-getindata

Marius-Mihai Grumazescu (Data Platform Team Lead, eMAG) and Michał Gutowski (Solutions Engineer, Cloudera) "From Warehouse Walls to Open Waters: eMAG's Data Lake Evolution.".

eMAG is an online marketplace and e-commerce leader in Eastern Europe headquartered in Romania. Like every large company in this industry, it wants to make decisions based on data to be an entirely "data-driven" company. During the speech, the authors presented their challenges while building their data platform. They discussed their experiences in building such an environment in quite some detail. A comparison of several technologies and their effectiveness in selected use cases is presented. The platform is implemented in an on-premise environment using Cloudera Data Platform with cloud migration plans in 2024, and already some of the data sources are in the AWS and GCP clouds. Currently, the platform processes approximately 2PB of data in approximately 200 Airflow pipelines and uses 3,000 processor cores and 45 TB of RAM. The environment performs many different business tasks and is developed by 10 teams of people. Data is loaded both in batch and stream using Apache Kafka and, in some cases, Debezium with the Change Data Capture mechanism on the source databases. An essential element of the platform is security based on Kerberos, Apache Ranger, and Apache Knox. In the future, the company also plans to expand the data platform with tools such as Apache Flink, the dbt library, and a Data Governance tool called Datahub.

emag-cloudera-datamass-getindata

Demands of the future: Data Platform + Experience

Review by Mariusz Wojakowski, Senior Data Engineer at Getindata | Part of Xebia

It was my first time at the DataMass conference, and I was delighted at how many experienced people were there! At the same time, I was filling the role of a technical supporter at the booth so I could freely ask my interlocutors what they were doing, and I had a lot of interesting conversations about their platforms, how they approach different ML-related tasks, and what were their plans, e.g., migrating to the public cloud. One of my takeaways from this conference is that platforms with great developer/data scientist/ML engineer experience will be more & more important in the future.

getindata-team

A nice touch for me was also meeting a former client who spoke very positively about the past cooperation with GetInData | Part of Xebia (schedule complimentary consultation with out experts) - so I would like to convey my sincere thanks to all - current & previous - employees for their contribution 👏

Agnieszka Rybak (Allegro) - Serving ML Models at Scale at Allegro

Agnieszka’s presentation gave insight into a large organization and how it organizes ML operations. There are a lot of internal tools developed by Allegro engineers with a focus on developer experience and ease of use. The same transformation happens for ML engineers, researchers, and data scientists. The presentation showed a general overview of the architecture and technical aspects of the platform created for this purpose. Lessons learned contained valuable information about their feature store, which - perhaps slightly controversial - discussed actual use within product teams (spoiler alert: not that much!). Also, building their solution vs. taking existing products off-the-shelf made sense in the case of a company like Allegro, which already had an internal platform that “just” needed to be wisely extended.

allegro-presentation-getindata

Regarding feature stores, I recommend our ebook, "Power Up Machine Learning: Build Feature Store Faster," where you will learn about MLOps issues.

Stanisław Magierski (Google) - Unifying Data Worlds

In the Big Data world, there's a tendency - at least from my point of view - to create a platform you don't need to leave! BigQuery is becoming that sort of thing: from a data warehouse through a data lakehouse to a data & AI platform. Stanisław talked about converging use cases for data scientists, data engineers & data analysts and how nearly all of them could be fulfilled within BigQuery Studio. An excellent addition at the end was a demo that went through scraping conference talk details (using Spark/Dataproc), then enriching them with poems through LLM usage (PaLM) - the whole thing automated in a pipeline (using Vertex AI) and nicely presented in Looker.

stanislaw-magierski-datamass

Data Science Perspective on Data Mass

Review by Piotr Chaberski, Senior Data Scientist at Getindata | Part of Xebia

This year was the 2nd time when I participated in DataMass. Although the previous edition was also full of high-quality technical and business content, this time - as a data scientist - I was particularly surprised at how packed it was with AI-related topics. What's more, these topics touched many different application areas, ranging from e-commerce through education and geospatial analytics, finally touching even mental health and neuroscience.

Thanks to presentations from great experts, it was nice to verify ideas and approaches that we take in GetInData, e.g., to recommendation systems (overview of Amazon's approaches presented by Jakub Nowacki) or Large Language Models applications (Kamil Gościmiński from AdAstra with Company Data AI Assistant, Xebia's Jeroen Overschie with LLM-based data enrichment tool) to learn that we are aligned about what is the current state-of-the-art and what are the most promising directions (you can read more about our experience on GetInData blog; search for keywords "recommendation" or "LLM"). Another extra-practical talk was given by Dainius Kniuksta, who, based on his rich consulting experience and by bringing some real-world examples, reminded us how vital proper customer understanding is in data science.

However, as the conferences are also aimed to inspire and allow you to see something that is slightly beyond your usual area of interest, I would like to highlight four presentations that contributed to the impressive diversity of data science-related topics on DataMass 2023.

Kacper Łodzikowski (Pearson) "Advancing education with responsible AI"

Let's be asked about the importance of education in modern society. All of us will agree that it is essential, but not so many will be able to elaborate concretely. Kacper started with numbers. To bring a few: the formal education market is just 3% digital; by 2030, 84 million young people will be out of school; in 2019, only half of the Polish population knew how to copy a computer file or folder. These statements alone show how much is to be done to develop educational methods. Kacper managed to show very clearly what are the main challenges and risks in EdTech and how Pearson is trying to address them. He also highlighted some new exciting opportunities brought by Generative AI in this field and how they compare to the traditional approaches.

pearson-presentation-datamass

Marek Skolimowski (DNV): “Tracking all ships in the world. Sea of opportunities.”

Marek’s presentation showed what happens when the Big Data world meets the Maritime industry. He talked about the challenges that this fusion imposes: data quality, data volume, and heavy computation. At first, this might sound like a list of the most standard issues in every data-driven business; however, in this context, it is pretty unique. There are a few reasons for that, for example, the fact that we are dealing with tons of geospatial data that is coming from the Automatic Identification System installed on ships that was not originally intended to be used on a global scale. It turns out that with modern data stack including Azure, Databricks, and Spark, not only collecting and processing such data becomes possible, but also many interesting analytical use cases appear, e.g., discovering ports based on ship movement or estimating CO2 emissions.

Lilianna Czaniecka (mBank, Gdańsk University of Technology) “Unveiling the Mind: A.I. Decodes Words Directly from Brain Scans”

It was refreshing to see such an academic presentation at an applied cloud technology conference. Today, everybody is talking about Generative A.I., but the use cases presented by Lilianna were mind-blowing (no pun intended). The talk was based on two recent research papers, one from the University of Texas and the other from Osaka University, that apply GenAI to fMRI brain scans to create reconstructions of speech or visual experience perceived by humans into the form of generated text or images, respectively. While this usage of LLMs and diffusion models is fascinating, it also opens many interesting practical applications like enhancing virtual reality experience, helping people with communication disabilities, or even understanding what is happening in the minds of our pets.

Lilianna-czaniecka-datamass

Amit Spinrad (Eleos Health) "Improving therapy through data: the case study of Motivational Interview (MI)."

Amit also showed how data and analytics can be used to understand the human mind better, but from a different angle. He introduced the audience to the therapy technique called Motivational Interviewing (MI) and then showed how modern data science and machine learning techniques can be applied to help therapists by automatically analyzing transcripts from MI sessions. It is important to note that the data science team's involvement in handling such a delicate matter as human mental health is just a part of the picture. All the analytics provided with computational techniques would only be possible for successful application with another team of clinical experts that need to label the data, consult on techniques, and evaluate the results.

The Big Data Technology Warsaw Summit 2024 and Data Mass 2024!

There are many reasons to join our events: Big Data Technology Summit and Data Mass in 2024. Staying up to date with the latest trends, networking, and observing the achievements of practitioners. A reminder about the CFP for Big Data Tech, which is already underway. Watch the conference's profile on LinkedIn to make sure you get all the benefits.

See you soon!

big data

cloud

MLOps

Big Data Conference

Last updated: 30 October 2023

Written by

Radosław Szmit

Data Architect

Sylwia Kołpuć

Senior Marketing Specialist

Piotr Chaberski

Senior Data Scientist

Mariusz Wojakowski

Software Engineer

Like this post?
Spread the word

Want more? Check our articles

llm data enrichment bigqueryobszar roboczy 1 4

Tutorial

How to use LLMs for data enrichment in BigQuery?

Introduction In the ever-evolving world of data analytics, businesses are continuously seeking innovative methods to unlock hidden value from their…

Tech News

2022 Big Data Trends: Retail and eCommerce become one of the hottest sectors for AI/ML

Nowadays, we can see that AI/ML is visible everywhere, including advertising, healthcare, education, finance, automotive, public transport…

Success Stories

Customer Story: Driving Customer Experience with scalable and secure Data Platform for Play

The client who needs Data Analytics Play is a consumer-focused mobile network operator in Poland with over 15 million subscribers*. It provides mobile…

Data Journey with Michał Wróbel (RenoFi) - Doing more with less with a Modern Data Platform and ML at home

In this episode of the RadioData Podcast, Adam Kawa talks with Michał Wróbel about business use cases at RenoFi (a U.S.-based FinTech), the Modern…

Big Data Event

A Review of the Big Data Technology Warsaw Summit 2022! Part 2. Top 3 best-rated presentations

The 8th edition of the Big Data Tech Summit left us wondering about the trends and changes in Big Data, which clearly resonated in many presentations…

modern data stack gcp workflowsobszar roboczy 1 4