Radio DaTa Podcast
9 min read

Data Journey with Yetunde Dada & Ivan Danov (QuantumBlack) – Kedro (an open-source MLOps framework) – introduction, benefits, use-cases, data & insights used for its development

In this episode of the RadioData Podcast, Adam Kawa talks with Yetunde Dada & Ivan Danov  about QuantumBlack, Kedro, trends in the MLOps landscape e.g. so many MLOps tools and LLMOPs. We encourage you to listen to the whole podcast or, if you prefer reading, skip to the key takeaways listed below.

_________________

Host: Adam Kawa, GetInData | Part of Xebia CEO

Since 2010, Adam has been working with Big Data at Spotify (where he proudly operated one of the largest and fastest-growing Hadoop clusters in Europe), Truecaller and as a Cloudera Training Partner. Nine years ago, he co-founded GetInData | Part of Xebia – a company that helps its customers to become data-driven and builds custom Big Data solutions. Adam is also the creator of many community initiatives such as the RadioData podcast, Big Data meetups and the DATA Pill newsletter.

Guests: Yetunde Data & Ivan Danov

Yetunde is a Product Director at QuantumBlack and has been in the company for almost 4 years. 

Ivan is a Software Engineer and has been working for QuantumBlack for 6 years. He has been working on Kedro since the beginning. 

_________________

QuantumBlack

QuantumBlack, a McKinsey company, is a data science and advanced analytics company that works with customers from various industries. QuantumBlack was founded in 2009 and has its headquarters in London, United Kingdom. The company became a part of McKinsey & Company, a global management consulting firm in 2015, and now operates as part of McKinsey's global analytics practice.

_________________

Key takeaways:

1. What is Kedro?

Kedro is an open-source, Python workflow development framework that helps ML practitioners write maintainable and modular analytics code which is production ready. It achieves this by enabling teams to adopt software engineering best practices.

Most companies have separate research and production units. Research units often work with Jupyter notebooks and are responsible for inventing new solutions, whereas production units try to implement their work and run it in a production environment.

Kedro tries to give everyone, regardless of the team, the same level of software engineering practices, which makes the code production ready right from the start or with much less refactoring, than without those practices.

A lot of data engineering and data science prototypes that they write are production ready right from the start, or become production ready with a little bit of work. In the end this approach brings more value to the company.

2. What are the most important reasons why data practitioners choose to use Kedro?

If you are a data scientist you want to pick up Kedro, because you are collaborating with other team members and you want to write well structured code which you want to share with other people and make it more maintainable and understandable.

ML engineers often pick up Kedro because it helps them to create an environment where other team members can write prototypes in a specific, well structured way. It allows the users to build software that is easily scalable and can be run in different environments.

If you are a data engineer, you are involved in creating large scale feature engineering pipelines or some form of data cleaning. Kedro can provide a well structured workflow for those types of tasks.

The last group that benefits from Kedro are project leads. Kedro Viz can help you have a birds-eye view over the pipeline structure.

The code that is produced with Kedro is more modular and usable across different projects.

3. How do large enterprises work with Kedro? What are the differences between enterprises that use and do not use Kedro?

The example from Quantumblack is that before Kedro, each team developed each part of the code in a different programming language, and the following integration was a nightmare. The code was hard to understand and hard to read and it was complicated to move it between different projects and different environments.

Whereas when Kedro was introduced, it presented a common language that everyone could use to communicate with each other about the project. It presented a level of abstraction that helped with communication about data engineering tasks. They can suddenly start talking about the development of a „node” or a „pipeline” and everyone has a common understanding of what that means. 

It also presented a common code base structure which was cleaner and easier to work with across multiple teams.

As a consequence of that, they started to build bigger and bigger projects which were more usable. They have been able to industrialize the way that they write machine learning code at QuantumBlack thanks to Kedro.

One of the companies that benefited greatly from Kedro is Telkomsel (the link to the article about Telkomsel and Kedro). Telkomsel is Indonesia's largest communications company. Telkomsel used Kedro in several of their data engineering projects, and the benefits that they emphasize are collaboration improvement, configuration management and visualization of data pipelines of Kedro.

Data scientists who did not have a software engineering background become better software engineers by using Kedro.

4. What are the numbers or statistics that show the adoption of Kedro?

Kedro has over 8 thousand stars on github. The growth of the project was largely organic. There are over 1.6 thousand projects that depend on Kedro, and the number is growing. There are also almost 180 contributors to the project as well.

There are hundreds of companies that use Kedro, some of the most notable ones can be seen in the README on the github main project page, these are for example: Absa, AXA UK, NASA, ING, GetInData and AMAI GmbH.

5. Is there any specific segment of companies that benefit from Kedro the most?

If you are collaborating with others and are building a data engineering or data science pipeline, then Kedro is for you. Kedro supports creating code that should be deployed to production.

The Kedro design assumes being platform agnostic. It provides freedom in writing data pipeline code without having to worry about which cloud provider it is going to be used with. There are a number of plugins (some of which are developed by GetInData company like Kedro VertexAI and Kedro AzureML) which enable different data sources and data platforms / cloud providers to be used with Kedro and provide the freedom and a level of abstraction that helps to write more modular and reusable code.

Kedro wants to be a bridge between data scientists and production. They want data scientists to have a uniform experience, regardless of what they are developing and which cloud provider is going to be used to run the code.

6. Do you analyze data to define the product roadmap for Kedro?

Kedro is supposed to be governed by the community and all of the QuantumBlack work is done in public. You can see the github issues that they are working on and the milestones that they are currently trying to achieve.

Kedro has got a telemetry opt-in plugin that sends the data back to our database, so that they can see which commands are used more often and which are not. This helps them to decide what the next field of interest should be for the Kedro development team.

In terms of upcoming things to the Kedro project, QuantumBlack is working on improving templating and configuration management in newly created Kedro projects.

They also want to improve already created features and make sure that they are working as intended. They will also focus more on integration with Databricks, Sagemaker and AzureML. They want to equip our users with appropriate tools to work with those services.

The Kedro Viz project is also supposed to see improvements in visualizing dashboards and pipelines.

They also plan to improve Kedro online courses and documentation that will explain the basics of Kedro and how to take advantage of its features.

7. Can you share your thoughts on the future evolution of MLOps? What are the most important trends that you see when working with the open-source community and companies in regards to building their ML solutions?

Regarding the MLOps tooling, it seems that there are too many and they predict that they will see either convergence or clear dominant players taking the stage in certain areas of MLOps. 

They are probably going to see new literature about best practices and code quality in Data Science projects, similar to the one that is already there regarding Software Engineering.

Also they cannot ignore that right now there is a lot of talk about ChatGPT and new language models which probably is going to be a trend in upcoming years.

8. Iguazio is a Tel Aviv based company that offers ML platforms for large scale businesses. It is said in the article that Iguazio and QuantumBlack want to team up in the future to create one unified single product, that combines the best of both worlds of Kedro and Iguazio’s product. What does this mean for Kedro?

In QuantumBlack they have a product that covers a similar field to the one created by Iguazio's company and they plan to join forces to create a better solution together.

They want Kedro to be natively run together with Iguazio’s solution. Their goal is to achieve such integration that there are as little steps as possible needed from both groups of users to transfer one project from one environment to another.

Another benefit is that they have acquired another platform that Kedro runs on, which brings more experience and better understanding of how Kedro should look like, to be more flexible and useful in different scenarios. Although they should not forget that Kedro is still going to be a platform agnostic tool.

___________________

These are just snippets from the entire conversation which you can listen to here: 

Want to learn more about Kedro? Check out the following articles, tutorials and case-studies:

MLOps
ML
Kedro
LLM
8 September 2023

Want more? Check our articles

llm reading assistant getindataobszar roboczy 1 4
Tech News

Combining Kedro and Streamlit to build a simple LLM-based Reading Assistant

Generative AI and Large Language Models are taking applied machine learning by storm, and there is no sign of a weakening of this trend. While it is…

Read more
getindata 6 trends big data 2021 blog
Tech News

6 Big Data Trends For 2021

2020 was a very tough year for everyone. It was a year full of emotions, constant adoption and transformation - both in our private and professional…

Read more
deploying serverless mlflow google cloud platform using cloud run machine learning getindata notext
Tutorial

Deploying serverless MLFlow on Google Cloud Platform using Cloud Run

At GetInData, we build elastic MLOps platforms to fit our customer’s needs. One of the key functionalities of the MLOps platform is the ability to…

Read more
running apache spark on aws
Use-cases/Project

Running Spark on Amazon Web Services (AWS)

When you search thought the net looking for methods of running Apache Spark on AWS infrastructure you are most likely to be redirected to the…

Read more
getindator beautiful magi lake with data visualization under th 04d517e5 6cb7 49b2 af1a 77884a44a1eb
Tutorial

Data lakehouse with Snowflake Iceberg tables - introduction

Snowflake has officially entered the world of Data Lakehouses! What is a data lakehouse, where would such solutions be a perfect fit and how could…

Read more
getindator create an image illustrating the concept of data ske b0d7e21f 9c85 40d2 9a52 32caba3aece3
Tutorial

Data skew in Flink SQL

Data processing in real-time has become crucial for businesses, and Apache Flink, with its powerful stream processing capabilities, is at the…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy