Big Data Event
15 min read

The Big Data Technology Summit 2021 - review of presentations

Since 2015, the beginning of every year is quite intense but also exciting for our company, because we are getting closer and closer to the Big Data Technology Summit, the conference co-organized by us. This year we celebrated the 7th edition of this event, yet the first one in a fully online version.

More than a one-day event

Even if this could sound strange, “thanks to” to covid19, the conference agenda was even more full of good content than ever. We didn’t just split the conference into two days but we added more tracks, more presentations and even some extra performance available before the main event on the VoD platform.

big-data-technology-warsaw-summit-adam-kawa-getindata
Adam Kawa, CTO of GetInData during the opeinig session of The Big Data Technology Summit

We also had two panels before the conference, the video record from one of them “Pandemic, data and analytics – how might we know what happens next with Covid-19” you can watch on YouTube.

Snowflake, Criteo, Atlassian, Twitter, Bolt, Cloudera... - speakers from all over the world

During this year's edition, we had the pleasure to listen to the performance of speakers from international companies. Thanks to no travel requirements, the conference was a great opportunity for participants to attend the event but also for us as an organizer because it was easier to host guest Big Data Experts from different continents. It is important for us as an organizer that the entire content of the Big Data Technology Summit was highly rated by the participants.

the-big-data-technology-summit-content-notes
Feedback about the content on The Big Data Technology Summit

Also, a lot of the presentations got a really good score from the attendees. Below you can find a brief review of some of the performances our team members attended.

Piotr Menclewicz, Big Data Analyst at Getindata

On the first day of the conference, I had the pleasure to watch presentations on Data Strategy and ROI in simultaneous sessions. The second track’s session Foundations of Data Teams was run by Jesse Anderson from the Big Data Institute. Jesse provided a beautiful analogy between building a data team and constructing a house. In order to make your data efforts work long-term, you need to take care of the foundations. In practice, we often focus too much on the facade instead of the core structure. We were provided with great examples of what happens when key components of the data team, such as data science, data engineering or operations are missing. We also found out that unintuitively, a bit of prevention can lead to tons of value.

On the second day of the Big Data Technology Summit, I attended the presentation hosted by Alex Belotserkovskiy from Microsoft. During his performance: Big Data Instruments and Partnerships - Microsoft ecosystem update, Alex explained what a data platform means at Microsoft and what the main components of such a platform are. He explained the main pillars of the architecture like ingestion, storing, preparation, serving and reporting. He also showed how each of them can be supported with Microsoft alone but also open-source technology (e.g. Spark, Jupyter, PostgreSQL). We’ve also been given a glimpse into areas of current focus for Microsoft, like responsible ML and open data.

The last presentation on Data Strategy and ROI track, How to plan the unpredictable? 7 good practices of organisation and management of fast-paced large-scale R&D projects, was given by Krzysztof Jędrzejewski and Natalia Sikora-Zimna from Pearson. During the talk we had a chance to see practical examples of lessons learned in the battle of managing highly agile and innovative projects. Speakers shared specific assumptions that are perfectly sensible in theory but tend to not work in practice. They also provided remedies for these common pitfalls. We could find out why we shouldn’t: design everything in advance, try to be aligned with everything and everyone, or build everything on your own.

Michal Rudko, Big Data Analyst at Getindata

ML Ops was one of the tracks that received the most interest during the Big Data Technology Summit 2021. More and more companies are bringing advanced analytics into their daily core business, however this also means that this requires proper operationalization and maintenance.

In the first session of the track Keven Wang guided us through the MLOps journey at H&M. How to deal with management of a large number of ML models and their end-to-end lifecycle so that a model could be bought online with confidence, its performance is monitored and the process could be adapted by multiple product teams at H&M. The whole path was presented as a combination of automated and manual steps (e.g. active approvals at crucial moments) with some state-of-the-art tools for model training (Azure Databricks, Kubeflow, Airflow), model management (ML Flow) and model deployment (Seldon). I really liked the way these three stack items were treated separately by design allowing a certain flexibility, which is quite important in such a dynamic environment. Some of the functionalities were backed by leveraging managed services which sped up the whole process a lot.

mlops-journey-at-h-m-keven-wang
MLOps Journey at H&M, Keven Wang


It was again proved that the ML product is just another software product, where best practices and software engineering skills are more than welcome in the team. The whole transition process, however, requires a mindset shift and proper planning, so indeed it’s a journey you have to make in order to have dynamic and fully operational analytics in your data-driven company.

In the afternoon we learned from Maciej Pieńkosz how the ML models are trained and deployed in Google Cloud at Sotrender, social media experts. It came with no surprise that the list of advanced analytics use cases in this industry is pretty long and diverse - just to name a few: sentiment analysis, hate speech detection, image recognition, text extraction, post classification and many more. Each of these model types requires a specific environment for experimentation, training, deployment and maintenance - here is where the Google Cloud platform with its services supported by open-source tools comes into play.
At Sotrender the whole journey usually starts in the notebook (AI Platform Notebooks) where the data is explored and initial models are built. As the next step the codebase is refactored using standardized structures and templates, wrapped into a Docker container and using AI Platform Training, trained in the cloud. The idea is to deploy locally and train in the cloud in order to optimize the costs. Using ML Flow for experiment tracking allows to have the whole experiment history gathered in one place. Models are then deployed as services, served via REST API using Cloud Run - chosen as the most flexible and functional solution. The pipeline CI is managed by GitLab with canary rollouts ensuring smooth and safe change management.
It’s not only all about the tools and algorithms - we also heard some good practices and tips from an operational standpoint both from engineering and ML areas. Again it was stressed that the whole journey requires some human validation in crucial moments, and here is where a monitoring solution for models plays an important role - especially when you have thousands of models in production and would like to react fast in the case that some of them performed not as expected.

Maciej Obuchowski, Data Engineer at GetInData

The second talk of Streaming and Real-Time Analytics track was presented by Simply Business's Michał Wróbel, who talked about Complex event-driven applications with Kafka Streams. Michał told us how Simply Business's streaming infrastructure needs to process multiple different types of events, and answer complicated business related questions. To manage large numbers of types of events, they use JSON schemas stored in Iglu Schema Registry driven by CI/CD. Processing these events is done by Kafka Streams. The advantages of this approach noted by Michał were Kafka Streams's small footprint, having stateful interfaces and processing guarantees - fault tolerance and once semantics. The first version of the system wasn't perfect. It suffered from limited parallelism and was complex to manage, but the second version - build by applying Domain Driven Design principles fulfilled all the requirements and was easier to operate. Data prepared by streaming applications was used also to drive Simply Business's Machine Learning applications.

complex-event-driven-applications-with-kafka-streams-michal-wrobel
Complex event-driven applications with Kafka Streams, Michał Wróbel

We finished the track by listening to Ruslan Gibaiev, who told us about Evolving Bolt from batch jobs to real-time stream processing. Bolt's philosophy is efficiency - and Ruslan stressed that it also needs to be applied in a data context. Bolt heavily relies on using Debezium for Change Data Capture from MySQL to Kafka. Their approach, consistent with the efficiency principle, utilizes building Kafka libraries - Kafka Streams and KSQL. An important aspect stressed by Ruslan is their extensibility - the code is open source, and has aspects such as the possibility of defining UDFs for KSQL. Working with big data tools has its disadvantages, like complicated deployment and operations, and they are hard to debug.

Krzysztof Zarzycki, CTO at Getindata

Streaming and Real-Time Analytics was one of the tracks that I was the most excited about and the presentation Streaming SQL - Be Like Water My Friend definitely met my expectations. Many of us already know that streaming SQL is becoming mainstream! Sooner or later you will need to use it to gain a technological advantage. The performance given by Volker Janz from InnoGames - for which streaming SQL is already very important - was a great introduction to the subject. I see streaming SQL booming in 2021: being used in ETLs, business automation and also for analytics or even Machine Learning. All delivering results instantly and efficiently utilizing resources. During the talk, Volker also showed how Streaming SQL looks and feels with a demo of Flink SQL and Ververica Platform as an operator of Flink on Kubernetes.
The presentation hosted by Volker was the highest-rated presentation on the 7th edition of the Big Data Technology Warsaw Summit, so I think that this was proof that it was worth hearing.

streaming-sql-be-like-water-my-frend-volker-janz
Streaming SQL - Be Like Water My Friend, Volker Janz

The agenda of the Big Data Technology Warsaw conference was not just full of great presentations, but also an interesting discussion during the roundtable session. Some of the big data experts from our team had the pleasure to lead or join the discussions.

Tomasz Żukowski, Data Analysts at Getindata

I had the pleasure to facilitate a roundtable discussion about the end to end cloud migration journey during Big Data Technology Warsaw 2021. We had a mix of cloud practitioners and people planning migration among participants, representing both big and small organizations. The discussion touched on various problems, but two main issues emerged throughout the conversation - cost control and vendor lock-in.
We agreed that cost control is most important during the migration to the cloud, but also after it during the business as usual period. Different approaches were presented:

  • Direct billing monitoring
  • Billing dashboards with slice and dice capabilities
  • Setting limits on various levels of systems
  • Using cost-saving components like preemptible/spot instances

It looks like vendor lock-in is haunting professionals working on migrations and it might be a real issue. Sometimes you might even face direct costs like data export costs or network egress (i.e. in the case of exporting data from BigQuery). All sides of the problem have to be considered and a final decision should be made based on risk evaluation.
During migration planning, each migrated component should be thoroughly examined if it should be migrated to a cloud-native or open-source solution:

  • Apache Kafka vs. Pub/Sub
  • Apache Airflow vs. Cloud Composer

Most importantly, there is no golden rule for such a decision. The reasoning usually is organisation dependent.
In conclusion, we said migration to the cloud is an IT project like any other and as such has to be properly planned, monitored during execution and reevaluated if needed.

Klaudia Wachnio, Marketing Manager at GetInData

One of the best discussions that I had the pleasure of attending was the one led by Juliana Araujo from Kambi, Managing a Big Data project - how to make it all work well together? As Juliana mentioned, 85% of the Big Data Projects fail, according to Gartner newest research. Why are statistics so bad if many companies have a great team of engineers, developers, analytics experts and other big data experts on the board? During the discussion, participants shared their thoughts and experiences in this field and discussed the key aspects of being successful in big data projects. The most important thing seems to be that the development team and stakeholders work together to achieve their goals. It’s hard to understand that after many months of projects, you can get good quality data from the engineering part of the project, but with no business value. I guess some of you now ask, how is this possible? Here, I should mention the long-known difficulty in communication between the IT team and business. The stakeholders should understand that they need to put effort into this kind of project and the engineering team, that they cannot let them stay out of the planning process because they never meet expectations. A good understanding of business needs should be a priority for both teams.
As the participants mentioned, working in an agile way can be a good solution for Big Data projects, because you don't work on the whole system from day one. Your architecture can evolve as the business needs evolve and what’s most important, you can bring some business value into the early stage of the project.  

adam-kawa-big-data-technology-warsaw-summit-2021
Adam Kawa, The Big Data Technology Warsaw Summit 2021

What to expect next? Big Data Technology Warsaw Summit 2022.

We already can’t wait for the 8th edition of the event - we hope you can't too :D We don't know if it will be an online or live event just yet. All we can promise now, is that together with Evention, we will do our best to prepare the most exciting and full of high-quality presentations conference than any event you have ever attended before.

Thank you, and hopefully see you next year!

big data
analytics
conference
google cloud platform
apache flink
bigdatatech
getindata
stream processing
open source
26 March 2021

Want more? Check our articles

transfer legacy pipeline modern using gitlab cicd
Tutorial

How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 3

Please dive in the third part of a blog series based on a project delivered for one of our clients. Please click part I, part II to read the…

Read more
why big data projects fail blog

Why do Big Data projects fail? Part I. The Business Perspective.

In a recent post on out Big Data blog, "Big Data for E-commerce", I wrote about how Big Data solutions are becoming indispensable in modern business…

Read more
acast anomali detection
Use-cases/Project

Anomaly detection implemented in podcasting company

Being a Data Engineer is not only about moving the data but also about extracting value from it. Read an article on how we implemented anomalies…

Read more
getindata apache nifi recommendation notext
Tutorial

NiFi Ingestion Blog Series. Part VI - I only have one rule and that is … - recommendations for using Apache NiFi

Apache NiFi, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data…

Read more
1 RsDrT5xOpdAcpehomqlOPg
Big Data Event

2³ Reasons To Speak at Big Data Tech Warsaw 2020 (February 27th, 2020)

Big Data Technology Warsaw Summit 2020 is fast approaching. This will be 6th edition of the conference that is jointly organised by Evention and…

Read more
covid 19 pandemia
Use-cases/Project

Fighting COVID-19 with Google Cloud - quarantine tracking system

Coronavirus is spreading through the world. At the moment of writing this post (on the 26th of March 2020) over 475k people have been infected and…

Read more

Contact us

Fill out this simple form. Our team will contact you promptly to discuss the next steps.

hello@getindata.comFist bump illustration

Any questions?

Choose one
By submitting this form, you agree to our  Terms & Conditions