16 min read

A Step-by-Step Guide to Training a Machine Learning Model using BigQuery ML (BQML)

What is BigQuery ML? BQML empowers data analysts to create and execute ML models through existing SQL tools & skills. Thanks to that, data analysts can build machine learning models in BigQuery. Below we explain how you can start using BigQuery ML to create and evaluate ML models.

Problem statement

In this example, we’ll build a machine learning model to predict the probability of adding a product to a shopping cart by the e-commerce website user. 

The problem to put it simply, is a binary classification (product added to cart / product not added to cart) problem. In BigQuery ML (BQML), this classification can be delivered by multiple algorithms:

  • Logistic Regression,
  • Boosted Tree Classifier,
  • Deep Neural Network Classifier,
  • Wide-and-Deep Classifier,
  • AutoML Tables.

To minimize the complexity of the example, we decided on a commonly known logistic regression model. This machine learning model produces the classification along with the attribution to class probability.

The data

For the training data, we’ll use the publicly available BigQuery sample dataset for Google Analytics 4 ecommerce web implementation. We’re focusing on the latest Google Analytics 4 version due to the Universal Analytics sunset announced for July 1, 2023:

On July 1, 2023, standard Universal Analytics properties will no longer process data. You'll be able to see your Universal Analytics reports for a period of time after July 1, 2023. However, new data will only flow into Google Analytics 4 properties. Source.

The process

To structure the description of the application of BQML, we’ll use the Cross-industry standard process for Data Mining (CRISP-DM) framework, which will let us split the complex scenario of mode building into the well-defined steps of:

  • Business understanding,
  • Data understanding,
  • Data preparation,
  • Modeling,
  • Evaluation,
  • Deployment.

This approach will be the first, simplified iteration of the whole process. We’ll build more advanced scenarios in the next phases (See: Next steps).


Environment setup

Taking into account  the raw data location (BigQuery) and the available budget and complexity constraints, we decided to use the BigQuery Machine Learning (BQML) features to build our prototype model. 

Using BQML, by its design, provides  several very practical features:

  • data location control - the whole model training process is directly in the data warehouse, there is no need for data export or to move out of the current environment, and we may specify the data processing location (EU/US) to meet the compliance requirements,
  • simplicity - the model definition using SQL, and the quick start from the data to the first full operating machine learning model,
  • it’s a serverless tool - there is no need to manage and maintain the infrastructure,
  • ML capability - there is wide range of built-in model types,
  • simple cost model - in this task we’ll fit the 10 GB data processing free tier (detailed pricing model here).

Where will you need to start? 

To start working with BigQuery ML, you only need the web browser - no IDE or other tools are required. What’s worth mentioning, is that data processing and model training runs entirely on the Google Cloud Platform, so you don’t need a powerful local machine to work, even with large datasets.

Next, just create a project on Google Cloud Platform, and optionally apply for the free trial credits. Under the project, you’ll create BigQuery datasets, and train your ML models.

Important: We called our project `bqmlhackathon`. Feel free to select your own project ID, unique across the whole Google Cloud Platform.

Train your first ML model

Import the data to BigQuery

First, we need to get the data we’ll be using to train the model.

Note: In this example, we’ll use the sample, publicly available Google Merchandise Store data export from Google Analytics 4 to BigQuery. If you’d like to use your own data, please configure the export from your website following these instructions. You can learn more about the dataset here: Google Analytics documentation.

To see the Google Analytics data in BigQuery UI, go to [+Add data] -> Pin a project -> Enter project name: bigquery-public-data and see the ga4_obfuscated_sample_ecommerce.


By completing this step, you’ll see the Google Analytics 4 data in BigQuery. Let’s get familiar with the dataset content.

Data understanding: Exploring the raw dataset

At this step, we’ll do a walk-through of the available data. You can see the schema in BigQuery UI, by clicking on the ga4_obfuscated_sample_ecommerce dataset and events_ table.

The meaning of each column is well described in the Google Analytics documentation.


Note, that the table is partitioned by date. You can explore the available partitions using the drop-down menu in BigQuery UI. Additionally, in your queries you may want to specify the desired rate range by using wildcard tables, i.e. to query data between '20210101' and '20210131'.

--count unique users number in January
COUNT(DISTINCT user_pseudo_id) AS users
_TABLE_SUFFIX BETWEEN '20210101' AND '20210131'

Link to the query: bqml_wildcard_tables.sql

We encourage you to run some basic queries and see how much more you can get by querying the dataset instead of only using Google Analytics UI. Take advantage of BigQuery aggregate functions like AVG(), COUNT(), COUNTIF(), MAX(), MIN(), SUM().

Data preparation: Training datasets

To train a machine learning model,  you’ll need to create features, which could be both raw and transformed data. Based on our business experience, we propose the following set of features. We’ll create this dataset from the raw Google Analytics 4 data in the next steps.


To create it under your project in BigQuery, first you’ll need to create a place for the data and your ML models. In BigQuery, it’s called a dataset.


In the next step, remember to set the proper data location (US/EU) for compliance reasons.

We’ll use eu (multiple regions in the European Union).


Now, you can create your dataset containing Google Analytics 4 data by running the following query: bqml_ga4_dataset.sql

If you changed the project/dataset names to different values than proposed above, update the lines with:




and then also use the updated values in the next steps.

After completing this step, you’ll see your dataset on the left sidebar.


To better understand the prepared dataset, you may want to do some exploratory data analysis using aggregation functions in BigQuery, described in the previous step.

Alternatively, If you prefer a more interactive environment for data exploration, you may visualize and drill-down your query results in Google Data Studio. To do this, just run the query on the dataset you want to visualize, and select [Explore data] -> [Explore with Data Studio]


The data explorer will open in a new window, where you can use many types of data visualization to inspect and discover the dataset used for model training.


In this example, since the dataset is small (<20MB) and limited to a certain time period, we just run the query:  SELECT * FROM `bqmlhackathon.ga4.ga4_sample_ecommerce`, then [Explore data] -> [Explore with Data Studio]. Then, we change the  chart type to map, and select Country as a dimension to visualize.

Additional feature engineering

You may extend the dataset by performing some feature engineering, like data transformations, feature creation, and feature selection.

BigQuery performs some automatic data transformations, to specifically match the input data format to the model input requirements:

  • missing data imputation,
  • feature transformations,
  • category feature encoding.

For more advanced scenarios, you may want to perform more advanced manual data preprocessing, using ML functions such as:


Transform statement in BigQuery ML

With BigQuery ML (BQML), you may train a model using raw data, or optionally add data preprocessing / feature engineering before the model training step using the TRANSFORM clause.

(See: Using the BigQuery ML TRANSFORM clause for feature engineering)

We define data transformations along with the model definition using the TRANSFORM statement in the following way:


If you decide to add the TRANSFORM clause on top of SELECT, the model will only use the features selected in this part, so you can use both feature selection and feature engineering there.

You can use the data transformations to produce features carrying more relevant information for the model. You’ll see the TRANSFORM clause in practice used for the model input data transformations in the next section.

Model training: train your first ML model in BigQuery ML (BQML)

Once we’ve got a prepared dataset in place, we can train our first machine learning model, which will be the logistic regression classifier.

In your future projects, you may need to use a different model type to address your problem. This model selection guide may be useful.

In this simple scenario, you don’t need to create a separate train and test dataset. BigQuery ML will split your input data automatically to the train and test dataset. 

Model training in BigQuery ML

To train the model in BigQuery, you define it in the SQL-like query. 

  1. First, specify the name of the model. 
  2. Next, define the features of the model along with data transformations (section TRANSFORM). 
  3. When we’ve got the data preparation defined, we can define the model type along with the model parameters, and the target column. 
  4. The last part is the regular SQL query which will select the raw input data from a table.


The link to query: bqml_create_model.sql

When you run the query, the model training will start. Model training time depends on the model type. It may take from a few minutes for simple models (like our case) to several hours for more complex models (like neural networks). You may follow the model training progress in the BigQuery UI:


You can also go back to the training process data in the TRAINING tab in your trained model details:


Model evaluation: checking ML model performance

Once the model is trained, you can evaluate the model performance in a few ways. 

Evaluation in BigQuery 

BigQuery also calculates the model performance metrics (Depending on the model type, i.e. precision, recall, accuracy, F1 score, log loss, or ROC AUC). You can preview it on the EVALUATION tab in the BigQuery console:


There are also SQL queries (i.e. ML.EVALUATE, ML.ADVANCED_WEIGHT) that allow access to this data in tabular form. If you want to learn more about the evaluation metrics interpretation, check the ML Crash Course sections about accuracy, precision and recall, and ROC AUC.

At this point you can decide if you want to proceed with the trained model, or whether you want to go back to the model training step and improve the model performance by defining a more complex model.

Model deployment: getting predictions on the new data

When you’re satisfied with the model performance, it’s time to use the model on the previously unseen (by the model) data. Alternatively, you can export the model and use it outside of the BigQuery in your production systems, i.e. to deploy it for online predictions. We’ll cover this topic in the next blog post.

In our example, we’ll predict the probability for the addedToCart event for the new sessions on our e-commerce website. 

First, prepare the dataset with the new data, which was unseen by the model (see: 1_bqml_ga4_dataset.sql and use different dates than were used for model training).

For example, set the date for 20201231, and name the table: bqmlhackathon.ga4.ga4_sample_ecommerce_20201231

Note, that currently 20210131 is the last available date in the public BigQuery GA4 dataset. So, to get predictions on the data unseen by the model during training, we’ll use 20201231. Usually, you’ll use the data after the training period.

You can get the predictions using the following query:


Link to the query: bqml_model_deployment.sql

Congrats! You just used your model to predict the probability of addedToCart event on the new data!


In this post, we explained the process of training, validating and deploying a machine learning model using BigQuery ML (BQML). Our example model predicts the probability of adding an item to the shopping cart on ane-commerce website. We used the Google Analytics 4 export to BigQuery to train a logistic regression classifier and then we predicted the probability of a sample event (addedToCart) on the new sessions.

Soon, we will publish a blog post about advanced model training in BigQuery ML (BQML), so if you don’t want to miss the publication, subscribe to our newsletter. Last but not least, we are happy to discuss Machine Learning models in your business, do not hesitate to contact us.

Interested in ML and MLOps solutions? How to improve ML processes and scale project deliverability? Watch our MLOps demo and sign up for a free consultation.

machine learning
ML Model
ML Model
Machine Learning Model
BigQuery Machine Learning
BigQuery ML
30 May 2022

Want more? Check our articles

backendobszar roboczy 1 2 3x 100

Data Mesh as a proper way to organise data world

Data Mesh as an answer In more complex Data Lakes, I usually meet the following problems in organizations that make data usage very inefficient: Teams…

Read more
1 RsDrT5xOpdAcpehomqlOPg
Big Data Event

2³ Reasons To Speak at Big Data Tech Warsaw 2020 (February 27th, 2020)

Big Data Technology Warsaw Summit 2020 is fast approaching. This will be 6th edition of the conference that is jointly organised by Evention and…

Read more
Big Data Event

A Review of the Big Data Technology Warsaw Summit 2024! Part 2: Private RAG-backed Data Copilot, Allegro and PLAY case studies

In this blogpost series, we share takeaways from selected topics presented during the Big Data Tech Warsaw Summit ‘24. In the first part, which you…

Read more
włdek blogobszar roboczy 1 4x 100

Artificial Intelligence regulatory initiatives of EU countries

AI regulatory initiatives of EU countries On April 21, 2021, the EU Commission adopted a proposal for a regulation on artificial intelligence…

Read more
data enrichtment flink sql using http connector flink getindata big data blog notext

Data Enrichment in Flink SQL using HTTP Connector For Flink - Part Two

In part one of this blog post series, we have presented a business use case which inspired us to create an HTTP connector for Flink SQL. The use case…

Read more
getindata cover nifi ingestion nologo

Apache NiFi - why do data engineers love it and hate it at the same time? Blog Series Introduction

Learning new technologies is like falling in love. At the beginning, you enjoy it totally and it is like wearing pink glasses that prevent you from…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy