Tutorial
9 min read

dbt & Machine Learning? It is possible!

In one of our recent blog posts Announcing the GetInData Modern Data Platform - a self-service solution for Analytics Engineers we shared with you our vision of the Modern Data Platform. We intended to create a workspace every analytics engineer would love to use, so we placed dbt in the center point of our framework. dbt allows us to build data pipelines quickly, efficiently and with great attention to data quality - the only requirement regarding the users skills here is the proficiency in SQL.

Everything's been said so far regarding dbt's capabilities to transform the data within a data warehouse for reporting purposes. But what if we would like to use this powerful tool for something more advanced like ML pipelines? Working on ML models usually requires not only a lot of mathematical background, but also proficiency in programming languages, e.g. Python, which means that it often takes a lot of time & money to generate business value out of advanced analytics solutions. This is not the case with BigQuery ML, where feature engineering pipelines, model training and tuning can be made within a data warehouse without the need to move the data and, what’s very convenient - using SQL only.

This means that work with the model can be undertaken not only by a data scientist, but also by a data analyst or analytics engineer. The use of BigQuery ML may prove to be a solution that will provide sufficient results without the need to create an entire Data Science team. It can also be the beginning of work on more advanced challenges from the world of Data Science. But how can we combine all these data transformations and feature engineering into one consistent & manageable pipeline with all these best practices from the engineering world? This is where dbt and BQML join forces!

In this tutorial we will show you that it is possible and easier than you might think. For this purpose, we will use the dbt_ml package that allows you to create models using dbt. The whole thing will be done directly using the MDP platform, thanks to which we also have not only a model but also a pipeline for training models without additional work related to infrastructure.

BQML & dbt - Modern Data Platform integration overview

Clients segmentation model in MDP

As part of testing BigQuery ML on our platform, we created a simple customer segmentation model. For this purpose, we used a transaction dataset from Kaggle. It contains the household level transactions within two years from a group of 2,500 households that are frequent customers in retail chain stores. This includes all household purchases, not only limited by categories.

purchases-getindata

The data contains information about where and when the transaction took place, what was its value and details related to promotional campaigns. 

Feature engineering

Before we move on to creating a model, we need to look at the data. Check it's quality, e.g. how many nulls occur and what the individual statistics and dependencies look like. For this purpose, we did exploratory data analysis, which we will show here and only the effects - input variables to the model. Such analysis can be performed in R, SQL or Python. We have prepared an input table for our model in the dbt project using SQL. It contains features on a household level: 

  • recency (days since the last transaction)
  • frequency (the number of purchases made, here the condition is that it must be greater than 1)
  • monetary (the average value of the basket). 

ml-agg-simple-rfm-getindata

Model

Not only the preparation, but also the models are characterized by simplicity, because everything is created using SQL. We used the k-means++ algorithm to create five clusters, in this case k-centroid points are randomly picked. The differences between different k-means algorithms are explained in an interesting way in the article Understanding K-Means, K-Means++ and, K-Medoids Clustering Algorithms.

bqml-simple-rfm-kmeans

A simple syntax allows you to define the model parameters. This information is passed to BigQuery, which performs the calculations and writes the results to our BigQuery.

metrcis-numeric-features-getindata

As the output of the model, we have a customer segment for a specific household. The results obtained are five customer groups. The first of them are people who probably no longer shop in the stores they observe. Cluster number two seems to be the group of the best customers because they often visit the store and their basket is characterized by a high average value. The third group is prospective customers worth fighting for, because you can see that they have not had any transaction for a long time. The other two groups are those who visit the chain of stores to a greater or lesser extent. When creating the final tables, we added predicted data to the transactional table to facilitate work from the visualization side.

visualization side

CI/CD

The advantage of using our tool is a ready pipeline, which is created as part of the project. CI/CD checks the code and passes it to Airflow, then the composer passes the information to BigQuery ML to run the model. The platform plays the role of an orchestrator. Thanks to this, we get rid of work related to MLOps.

airflow-update-getindata

Reporting

The created model can be used to perform segmentation on new customers. Thanks to the created pipeline, the CI/CD process loads new data into the model and maps them to the appropriate segments. The data is directly loaded into the table and thus the report prepared on its basis is also updated.

To prepare the visualization, we used Looker Studio, which is integrated with our platform. The table contains some basic metrics regarding revenue, average basket or discounts.

customer segmentation

Conclusions for dbt and ML

In this tutorial we demonstrated how to define an end to end data pipeline using dbt and BigQuery ML. We discussed the pros and cons of such a solution. The complete step-by-step implementation of BQML was also presented in one of our posts A Step-by-Step Guide to Training a Machine Learning Model using BigQuery ML (BQML). The combination of dbt and BigQuery ML allows for the creation of machine learning models, without the need for a dedicated project or using a different language like Python. By leveraging the implemented models, it is possible to build analyses that can address more advanced business problems, such as customer segmentation. The ease of use is undoubtedly a great advantage. Model definition can be done using SQL, and by using the dbt_ml_package, the syntax familiar to dbt users will not be something new.
The output of the model is the BigQuery results, which can be easily analyzed directly there or using other visualization tools. The dbt_ml_package also allows for maintaining everything within the dbt project, providing version control and keeping the analytical layer in one place. This is a significant facilitation because without it, model creation would happen outside the ETL process and would be completely separate.
Such a solution offers great possibilities as it makes machine learning more accessible and can lead to better insights. For companies familiar with ML, this combination can be treated as a reconnaissance stage for larger projects with a low time investment. It's worth mentioning that the ready-made implementation of algorithms offers convenience but comes with a trade-off of limited customization. When implementing an algorithm from scratch, we have a greater sense of control and predictability. However, this drawback is not unique to this type of solution, as similar considerations can be observed when using packages in Python or R.

What’s next?

It turned out that dbt is not only about batch transformations in reporting workloads - it also supports ML models development but quite soon its use cases will be very common in real-time analytics. At GetInData, we recently created a dbt-flink adapter, that allows the running of pipelines defined in SQL in a dbt project on Apache Flink. You can read about it here

If you want to discuss this or any other topics, feel free to sign up for a 30 minutes free consultation with our experts.

More content on that coming soon!

Stay tuned for the future updates on dbt extensions and …other data solutions.

The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy
BigQuery
dbt
ML
Modern Data Platform
1 August 2023

Want more? Check our articles

flink kubernetes how why blog big data cloud
Tutorial

Flink on Kubernetes - how and why?

Flink is an open-source stream processing framework that supports both batch processing and data streaming programs. Streaming happens as data flows…

Read more
big data technology warsaw summit 2020 getindata
Big Data Event

Review of presentations on the Big Data Technology Warsaw Summit 2020

It’s been exactly two months since the last edition of the Big Data Technology Warsaw Summit 2020, so we decided to share some great statistics with…

Read more
getindata big data tech main 1
Big Data Event

A Review of the Presentations at the Big Data Technology Warsaw Summit 2022!

The 8th edition of the Big Data Tech Summit is already over, and we would like to thank all of the attendees for joining us this year. It was a real…

Read more
transfer legacy pipeline modern gitlab cicd kubernetes kaniko
Tutorial

How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 2

Please dive in the second part of a blog series based on a project delivered for one of our clients. If you miss the first part, please check it here…

Read more
dsc3210
Big Data Event

A Review of the Big Data Technology Warsaw Summit 2022! Part 2. Top 3 best-rated presentations

The 8th edition of the Big Data Tech Summit left us wondering about the trends and changes in Big Data, which clearly resonated in many presentations…

Read more
power of big dataobszar roboczy 1 3x 100
Tutorial

Power of the Big Data: Industry

Welcome to the third part of the "Power of Big Data" series, in which we describe how Big Data tools and solutions support the development of modern…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy