Tutorial
16 min read

A Step-by-Step Guide to Training a Machine Learning Model using BigQuery ML (BQML)

What is BigQuery ML? BQML empowers data analysts to create and execute ML models through existing SQL tools & skills. Thanks to that, data analysts can build machine learning models in BigQuery. Below we explain how you can start using BigQuery ML to create and evaluate ML models.

Problem statement

In this example, we’ll build a machine learning model to predict the probability of adding a product to a shopping cart by the e-commerce website user. 

The problem to put it simply, is a binary classification (product added to cart / product not added to cart) problem. In BigQuery ML (BQML), this classification can be delivered by multiple algorithms:

  • Logistic Regression,
  • Boosted Tree Classifier,
  • Deep Neural Network Classifier,
  • Wide-and-Deep Classifier,
  • AutoML Tables.

To minimize the complexity of the example, we decided on a commonly known logistic regression model. This machine learning model produces the classification along with the attribution to class probability.

The data

For the training data, we’ll use the publicly available BigQuery sample dataset for Google Analytics 4 ecommerce web implementation. We’re focusing on the latest Google Analytics 4 version due to the Universal Analytics sunset announced for July 1, 2023:


On July 1, 2023, standard Universal Analytics properties will no longer process data. You'll be able to see your Universal Analytics reports for a period of time after July 1, 2023. However, new data will only flow into Google Analytics 4 properties. Source.

The process

To structure the description of the application of BQML, we’ll use the Cross-industry standard process for Data Mining (CRISP-DM) framework, which will let us split the complex scenario of mode building into the well-defined steps of:

  • Business understanding,
  • Data understanding,
  • Data preparation,
  • Modeling,
  • Evaluation,
  • Deployment.

This approach will be the first, simplified iteration of the whole process. We’ll build more advanced scenarios in the next phases (See: Next steps).

data-scheme-bq

Environment setup

Taking into account  the raw data location (BigQuery) and the available budget and complexity constraints, we decided to use the BigQuery Machine Learning (BQML) features to build our prototype model. 

Using BQML, by its design, provides  several very practical features:

  • data location control - the whole model training process is directly in the data warehouse, there is no need for data export or to move out of the current environment, and we may specify the data processing location (EU/US) to meet the compliance requirements,
  • simplicity - the model definition using SQL, and the quick start from the data to the first full operating machine learning model,
  • it’s a serverless tool - there is no need to manage and maintain the infrastructure,
  • ML capability - there is wide range of built-in model types,
  • simple cost model - in this task we’ll fit the 10 GB data processing free tier (detailed pricing model here).

Where will you need to start? 

To start working with BigQuery ML, you only need the web browser - no IDE or other tools are required. What’s worth mentioning, is that data processing and model training runs entirely on the Google Cloud Platform, so you don’t need a powerful local machine to work, even with large datasets.

Next, just create a project on Google Cloud Platform, and optionally apply for the free trial credits. Under the project, you’ll create BigQuery datasets, and train your ML models.

Important: We called our project `bqmlhackathon`. Feel free to select your own project ID, unique across the whole Google Cloud Platform.

Train your first ML model

Import the data to BigQuery

First, we need to get the data we’ll be using to train the model.

Note: In this example, we’ll use the sample, publicly available Google Merchandise Store data export from Google Analytics 4 to BigQuery. If you’d like to use your own data, please configure the export from your website following these instructions. You can learn more about the dataset here: Google Analytics documentation.

To see the Google Analytics data in BigQuery UI, go to [+Add data] -> Pin a project -> Enter project name: bigquery-public-data and see the ga4_obfuscated_sample_ecommerce.

bq-project

By completing this step, you’ll see the Google Analytics 4 data in BigQuery. Let’s get familiar with the dataset content.

Data understanding: Exploring the raw dataset

At this step, we’ll do a walk-through of the available data. You can see the schema in BigQuery UI, by clicking on the ga4_obfuscated_sample_ecommerce dataset and events_ table.

The meaning of each column is well described in the Google Analytics documentation.

bq-events

Note, that the table is partitioned by date. You can explore the available partitions using the drop-down menu in BigQuery UI. Additionally, in your queries you may want to specify the desired rate range by using wildcard tables, i.e. to query data between '20210101' and '20210131'.


--count unique users number in January
SELECT
COUNT(DISTINCT user_pseudo_id) AS users
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
WHERE
_TABLE_SUFFIX BETWEEN '20210101' AND '20210131'

Link to the query: bqml_wildcard_tables.sql

We encourage you to run some basic queries and see how much more you can get by querying the dataset instead of only using Google Analytics UI. Take advantage of BigQuery aggregate functions like AVG(), COUNT(), COUNTIF(), MAX(), MIN(), SUM().

Data preparation: Training datasets

To train a machine learning model,  you’ll need to create features, which could be both raw and transformed data. Based on our business experience, we propose the following set of features. We’ll create this dataset from the raw Google Analytics 4 data in the next steps.

bq-table-schema

To create it under your project in BigQuery, first you’ll need to create a place for the data and your ML models. In BigQuery, it’s called a dataset.

bq-dataset

In the next step, remember to set the proper data location (US/EU) for compliance reasons.

We’ll use eu (multiple regions in the European Union).

bq-dataset-location

Now, you can create your dataset containing Google Analytics 4 data by running the following query: bqml_ga4_dataset.sql

If you changed the project/dataset names to different values than proposed above, update the lines with:

bqmlhackathon.ga4.ga4_sample_ecommerce

to

<project_id>.ga4.ga4_sample_ecommerce

and then also use the updated values in the next steps.

After completing this step, you’ll see your dataset on the left sidebar.

bq-dataset-output

To better understand the prepared dataset, you may want to do some exploratory data analysis using aggregation functions in BigQuery, described in the previous step.

Alternatively, If you prefer a more interactive environment for data exploration, you may visualize and drill-down your query results in Google Data Studio. To do this, just run the query on the dataset you want to visualize, and select [Explore data] -> [Explore with Data Studio]

bq-query-results


The data explorer will open in a new window, where you can use many types of data visualization to inspect and discover the dataset used for model training.

data-studio-2

In this example, since the dataset is small (<20MB) and limited to a certain time period, we just run the query:  SELECT * FROM `bqmlhackathon.ga4.ga4_sample_ecommerce`, then [Explore data] -> [Explore with Data Studio]. Then, we change the  chart type to map, and select Country as a dimension to visualize.

Additional feature engineering

You may extend the dataset by performing some feature engineering, like data transformations, feature creation, and feature selection.

BigQuery performs some automatic data transformations, to specifically match the input data format to the model input requirements:

  • missing data imputation,
  • feature transformations,
  • category feature encoding.

For more advanced scenarios, you may want to perform more advanced manual data preprocessing, using ML functions such as:

  • ML.BUCKETIZE
  • ML.QUANTILE_BUCKETIZE
  • ML.POLYNOMIAL_EXPAND
  • ML.FEATURE_CROSS
  • ML.NGRAMS
  • ML.MIN_MAX_SCALER
  • ML.STANDARD_SCALER

Transform statement in BigQuery ML

With BigQuery ML (BQML), you may train a model using raw data, or optionally add data preprocessing / feature engineering before the model training step using the TRANSFORM clause.

(See: Using the BigQuery ML TRANSFORM clause for feature engineering)

We define data transformations along with the model definition using the TRANSFORM statement in the following way:

transform

If you decide to add the TRANSFORM clause on top of SELECT, the model will only use the features selected in this part, so you can use both feature selection and feature engineering there.

You can use the data transformations to produce features carrying more relevant information for the model. You’ll see the TRANSFORM clause in practice used for the model input data transformations in the next section.

Model training: train your first ML model in BigQuery ML (BQML)

Once we’ve got a prepared dataset in place, we can train our first machine learning model, which will be the logistic regression classifier.

In your future projects, you may need to use a different model type to address your problem. This model selection guide may be useful.

In this simple scenario, you don’t need to create a separate train and test dataset. BigQuery ML will split your input data automatically to the train and test dataset. 

Model training in BigQuery ML

To train the model in BigQuery, you define it in the SQL-like query. 

  1. First, specify the name of the model. 
  2. Next, define the features of the model along with data transformations (section TRANSFORM). 
  3. When we’ve got the data preparation defined, we can define the model type along with the model parameters, and the target column. 
  4. The last part is the regular SQL query which will select the raw input data from a table.

model-train-1

The link to query: bqml_create_model.sql

When you run the query, the model training will start. Model training time depends on the model type. It may take from a few minutes for simple models (like our case) to several hours for more complex models (like neural networks). You may follow the model training progress in the BigQuery UI:

bq-progress

You can also go back to the training process data in the TRAINING tab in your trained model details:

bq-training

Model evaluation: checking ML model performance

Once the model is trained, you can evaluate the model performance in a few ways. 

Evaluation in BigQuery 

BigQuery also calculates the model performance metrics (Depending on the model type, i.e. precision, recall, accuracy, F1 score, log loss, or ROC AUC). You can preview it on the EVALUATION tab in the BigQuery console:

model-evaluation

There are also SQL queries (i.e. ML.EVALUATE, ML.ADVANCED_WEIGHT) that allow access to this data in tabular form. If you want to learn more about the evaluation metrics interpretation, check the ML Crash Course sections about accuracy, precision and recall, and ROC AUC.

At this point you can decide if you want to proceed with the trained model, or whether you want to go back to the model training step and improve the model performance by defining a more complex model.

Model deployment: getting predictions on the new data

When you’re satisfied with the model performance, it’s time to use the model on the previously unseen (by the model) data. Alternatively, you can export the model and use it outside of the BigQuery in your production systems, i.e. to deploy it for online predictions. We’ll cover this topic in the next blog post.

In our example, we’ll predict the probability for the addedToCart event for the new sessions on our e-commerce website. 

First, prepare the dataset with the new data, which was unseen by the model (see: 1_bqml_ga4_dataset.sql and use different dates than were used for model training).

For example, set the date for 20201231, and name the table: bqmlhackathon.ga4.ga4_sample_ecommerce_20201231

Note, that currently 20210131 is the last available date in the public BigQuery GA4 dataset. So, to get predictions on the data unseen by the model during training, we’ll use 20201231. Usually, you’ll use the data after the training period.

You can get the predictions using the following query:

model-deployment

Link to the query: bqml_model_deployment.sql

Congrats! You just used your model to predict the probability of addedToCart event on the new data!

Summary

In this post, we explained the process of training, validating and deploying a machine learning model using BigQuery ML (BQML). Our example model predicts the probability of adding an item to the shopping cart on ane-commerce website. We used the Google Analytics 4 export to BigQuery to train a logistic regression classifier and then we predicted the probability of a sample event (addedToCart) on the new sessions.

Soon, we will publish a blog post about advanced model training in BigQuery ML (BQML), so if you don’t want to miss the publication, subscribe to our newsletter. Last but not least, we are happy to discuss Machine Learning models in your business, do not hesitate to contact us.


Did you like our post? If you want more, do not hesitate to download our free Ebook “MLOps: Power Up Machine Learning Process. Introduction to Vertex AI, Snowflake and dbt Cloud”.

machine learning
BigQuery
ML Model
ML Model
Machine Learning Model
BQML
BigQuery Machine Learning
BigQuery ML
30 May 2022

Want more? Check our articles

power of big dataobszar roboczy 1 3x 100
Tutorial

Power of the Big Data: Industry

Welcome to the third part of the "Power of Big Data" series, in which we describe how Big Data tools and solutions support the development of modern…

Read more
getindata 1000 followers

5 reasons to follow us on Linkedin. Celebrating 1,000 followers on our profile!

We are excited to announce that we recently hit the 1,000+ followers on our profile on Linkedin. We would like to send a special THANK YOU :) to…

Read more
kafka gobblin hdfs getindata linkedin
Tutorial

Data pipeline evolution at Linkedin on a few pictures

Data Pipeline Evolution The LinkedIn Engineering blog is a great resource of technical blog posts related to building and using large-scale data…

Read more
avoiding the mess in the hadoop cluster
Tutorial

Avoiding the mess in the Hadoop Cluster

This blog is based on the talk “Simplified Data Management and Process Scheduling in Hadoop” that we gave at the Big Data Technical Conference in…

Read more
getindata modern data platform features tools
Tech News

GetInData Modern Data Platform - features & tools

About the GetInData Modern Data Platform In our previous article you learned what our take on the Modern Data Platform is and that we took some steps…

Read more
1wersjaobszar roboczy 1 4
Tutorial

Feature Store comparison: 4 Feature Stores - explained and compared

In this blog post, we will simply and clearly demonstrate the difference between 4 popular feature stores: Vertex AI Feature Store, FEAST, AWS…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

The administrator of your personal data is GetInData Sp. z o.o. Sp.k with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the  Terms & Conditions. For more information on personal data processing and your rights please see  Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy