9 min read

dbt & Machine Learning? It is possible!

In one of our recent blog posts Announcing the GetInData Modern Data Platform - a self-service solution for Analytics Engineers we shared with you our vision of the Modern Data Platform. We intended to create a workspace every analytics engineer would love to use, so we placed dbt in the center point of our framework. dbt allows us to build data pipelines quickly, efficiently and with great attention to data quality - the only requirement regarding the users skills here is the proficiency in SQL.

Everything's been said so far regarding dbt's capabilities to transform the data within a data warehouse for reporting purposes. But what if we would like to use this powerful tool for something more advanced like ML pipelines? Working on ML models usually requires not only a lot of mathematical background, but also proficiency in programming languages, e.g. Python, which means that it often takes a lot of time & money to generate business value out of advanced analytics solutions. This is not the case with BigQuery ML, where feature engineering pipelines, model training and tuning can be made within a data warehouse without the need to move the data and, what’s very convenient - using SQL only.

This means that work with the model can be undertaken not only by a data scientist, but also by a data analyst or analytics engineer. The use of BigQuery ML may prove to be a solution that will provide sufficient results without the need to create an entire Data Science team. It can also be the beginning of work on more advanced challenges from the world of Data Science. But how can we combine all these data transformations and feature engineering into one consistent & manageable pipeline with all these best practices from the engineering world? This is where dbt and BQML join forces!

In this tutorial we will show you that it is possible and easier than you might think. For this purpose, we will use the dbt_ml package that allows you to create models using dbt. The whole thing will be done directly using the MDP platform, thanks to which we also have not only a model but also a pipeline for training models without additional work related to infrastructure.

BQML & dbt - Modern Data Platform integration overview

Clients segmentation model in MDP

As part of testing BigQuery ML on our platform, we created a simple customer segmentation model. For this purpose, we used a transaction dataset from Kaggle. It contains the household level transactions within two years from a group of 2,500 households that are frequent customers in retail chain stores. This includes all household purchases, not only limited by categories.


The data contains information about where and when the transaction took place, what was its value and details related to promotional campaigns. 

Feature engineering

Before we move on to creating a model, we need to look at the data. Check it's quality, e.g. how many nulls occur and what the individual statistics and dependencies look like. For this purpose, we did exploratory data analysis, which we will show here and only the effects - input variables to the model. Such analysis can be performed in R, SQL or Python. We have prepared an input table for our model in the dbt project using SQL. It contains features on a household level: 

  • recency (days since the last transaction)
  • frequency (the number of purchases made, here the condition is that it must be greater than 1)
  • monetary (the average value of the basket). 



Not only the preparation, but also the models are characterized by simplicity, because everything is created using SQL. We used the k-means++ algorithm to create five clusters, in this case k-centroid points are randomly picked. The differences between different k-means algorithms are explained in an interesting way in the article Understanding K-Means, K-Means++ and, K-Medoids Clustering Algorithms.


A simple syntax allows you to define the model parameters. This information is passed to BigQuery, which performs the calculations and writes the results to our BigQuery.


As the output of the model, we have a customer segment for a specific household. The results obtained are five customer groups. The first of them are people who probably no longer shop in the stores they observe. Cluster number two seems to be the group of the best customers because they often visit the store and their basket is characterized by a high average value. The third group is prospective customers worth fighting for, because you can see that they have not had any transaction for a long time. The other two groups are those who visit the chain of stores to a greater or lesser extent. When creating the final tables, we added predicted data to the transactional table to facilitate work from the visualization side.

visualization side


The advantage of using our tool is a ready pipeline, which is created as part of the project. CI/CD checks the code and passes it to Airflow, then the composer passes the information to BigQuery ML to run the model. The platform plays the role of an orchestrator. Thanks to this, we get rid of work related to MLOps.



The created model can be used to perform segmentation on new customers. Thanks to the created pipeline, the CI/CD process loads new data into the model and maps them to the appropriate segments. The data is directly loaded into the table and thus the report prepared on its basis is also updated.

To prepare the visualization, we used Looker Studio, which is integrated with our platform. The table contains some basic metrics regarding revenue, average basket or discounts.

customer segmentation

Conclusions for dbt and ML

In this tutorial we demonstrated how to define an end to end data pipeline using dbt and BigQuery ML. We discussed the pros and cons of such a solution. The complete step-by-step implementation of BQML was also presented in one of our posts A Step-by-Step Guide to Training a Machine Learning Model using BigQuery ML (BQML). The combination of dbt and BigQuery ML allows for the creation of machine learning models, without the need for a dedicated project or using a different language like Python. By leveraging the implemented models, it is possible to build analyses that can address more advanced business problems, such as customer segmentation. The ease of use is undoubtedly a great advantage. Model definition can be done using SQL, and by using the dbt_ml_package, the syntax familiar to dbt users will not be something new.
The output of the model is the BigQuery results, which can be easily analyzed directly there or using other visualization tools. The dbt_ml_package also allows for maintaining everything within the dbt project, providing version control and keeping the analytical layer in one place. This is a significant facilitation because without it, model creation would happen outside the ETL process and would be completely separate.
Such a solution offers great possibilities as it makes machine learning more accessible and can lead to better insights. For companies familiar with ML, this combination can be treated as a reconnaissance stage for larger projects with a low time investment. It's worth mentioning that the ready-made implementation of algorithms offers convenience but comes with a trade-off of limited customization. When implementing an algorithm from scratch, we have a greater sense of control and predictability. However, this drawback is not unique to this type of solution, as similar considerations can be observed when using packages in Python or R.

What’s next?

It turned out that dbt is not only about batch transformations in reporting workloads - it also supports ML models development but quite soon its use cases will be very common in real-time analytics. At GetInData, we recently created a dbt-flink adapter, that allows the running of pipelines defined in SQL in a dbt project on Apache Flink. You can read about it here

Do you use dbt to create ML models or would you like to know more about this? Sign up for a free consultation.

More content on that coming soon!

Stay tuned for the future updates on dbt extensions and …other data solutions.

The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy
Modern Data Platform
1 August 2023

Want more? Check our articles

obszar roboczy 1 100

Towards better Data Analytics - Google Cloud Bootcamp

“Without data, you are another person with an opinion” These words from Edward Deming, a management guru, are the best definition of what means to…

Read more
albert1obszar roboczy 1 100

Apache NiFi and Apache NiFi Registry on Kubernetes

Apache NiFi is a popular, big data processing engine with graphical Web UI that provides non-programmers the ability to swiftly and codelessly create…

Read more
getindata 6 trends big data 2021 blog
Tech News

6 Big Data Trends For 2021

2020 was a very tough year for everyone. It was a year full of emotions, constant adoption and transformation - both in our private and professional…

Read more
Big Data Event

A Review of the Big Data Technology Warsaw Summit 2022! Part 2. Top 3 best-rated presentations

The 8th edition of the Big Data Tech Summit left us wondering about the trends and changes in Big Data, which clearly resonated in many presentations…

Read more
paweł lesszczyński 2obszar roboczy 1 4x 100

Alert backoff with Flink CEP

Flink complex event processing (CEP).... ....provides an amazing API for matching patterns within streams. It was introduced in 2016 with an…

Read more
0 pjPVaAnArwat2ZH8
Big Data Event

Big Data Tech Warsaw Summit 2019 summary

It’s been already more than a month after Big Data Tech Warsaw Summit 2019, but it’s spirit is still among us — that’s why we’ve decided to prolong it…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy