Tutorial
12 min read

Deploying MLflow on the Google Cloud Platform using App Engine

MLOps platforms delivered by GetInData allow us to pick best of breed technologies to cover crucial functionalities. MLflow is one of the key components in the open-source-based MLOps platforms as it acts both as an experiment tracker as well as a centralized model registry. 

In my previous blog post I wrote about deploying serverless MLflow in GCP using Cloud Run. This solution has its benefits as well as some drawbacks due to the complexity in the authorization layer. There are other options - one of such is using Google App Engine Flexible environment to host the MLflow and benefit from out-of-the-box Identity-Aware-Proxy integration, which will handle authorization transparently. In this blog post I will explain how to deploy such a configuration - this time, with much fewer manual steps as the solution will be deployed using Terraform.

Prerequisites

  • Access to the Google Cloud Platform, including: CloudSQL, GCS, Secret Manager, App Engine, Artifact Registry
  • Terraform >= 1.1.7
  • Docker (for building the image, but can also be done in CI)

Target setup

The final setup described in this blog post will look like this:

final-setup-bigdata-mlflow

Step 1: Pre-configuring OAuth 2.0 Client

App Engine-based MLflow will use Identity Aware Proxy as its authorization layer. In order to configure it, you need to obtain OAuth 2.0 Client ID and Client Secret. Follow the official guide here or refer to the previous blog post (step #2).

Once the OAuth 2.0 credentials are created, make sure that Authorized redirect URIs field has a value following the pattern (usually this value gets pre-filled with the correct value, but you should verify this):

https://iap.googleapis.com/v1/oauth/clientIds/<CLIENT ID>:handleRedirect

This is required for the IAP proxy to work properly with the App Engine.

Store the Client ID and Client Secret in 2 separate secrets within the Secret Manager and save the Resource ID of each secret - it will be passed to the App Engine.

secret-manager-mlflow

Step 2: Build the docker image for MLflow on App Engine

The docker image for MLflow will be similar to the GetInData’s public MLflow Docker image from https://github.com/getindata/mlflow-docker, with the difference being the base image with the gcloud SDK installed. The reason behind this requirement is that there is no built-in mechanism to mount secrets from the Secret Manager (with a database password) in the App Engine service natively, which shifts the responsibility of obtaining the secret to the actual application code / container entrypoint. 

Dockerfile

FROM google/cloud-sdk:385.0.0
ARG MLFLOW_VERSION="1.26.0"

ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8

RUN echo "export LC_ALL=$LC_ALL" >> /etc/profile.d/locale.sh
RUN echo "export LANG=$LANG" >> /etc/profile.d/locale.sh

ENV TINI_VERSION v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini

RUN pip3 install --no-cache-dir --ignore-installed google-cloud-storage && \
    pip3 install --no-cache-dir PyMySQL mlflow==$MLFLOW_VERSION pyarrow

COPY start.sh start.sh
RUN chmod +x start.sh


ENTRYPOINT ["/tini", "--", "./start.sh"]


start.sh

#!/usr/bin/env bash
set -e
echo "Obtaining credentials"
DB_PASSWORD=$(gcloud secrets versions access --project=${GCP_PROJECT} --secret=${DB_PASSWORD_SECRET_NAME} latest)
BACKEND_URI=${BACKEND_URI:-"mysql+pymysql://${DB_USERNAME}:${DB_PASSWORD}@/${DB_NAME}?unix_socket=/cloudsql/${DB_INSTANCE_CONNECTION_NAME:-"NOT_SET"}"}

mlflow server \
  --backend-store-uri ${BACKEND_URI} \
  --default-artifact-root ${GCS_BACKEND} \
  --host 0.0.0.0 \
  --port ${PORT}

Build the image and push it to the Google Artifact Registry, preferably to the same region in which you will deploy the App Engine app.

Step 3: Prepare the Terraform inputs

The full Terraform module which handles deployment of MLflow to App Engine is shared in the repository linked below.

Here are the required inputs:

  • project - Google Cloud Platform Project ID
  • env - environment name (of your choice)
  • prefix - additional prefix for resources (such as CloudSQL, GCS buckets etc.)
  • region - region for CloudSQL, Secret Manager and GCS buckets
  • app_engine_region - region for App Engine (follow the official documentation)
  • docker_image - full URI of the docker image with MLflow (see Step 2 above)
  • machine_type - machine type for the Cloud SQL (e.g. db-n1-standard-1)
  • availability_type - availability type for Cloud SQL instance (defaults to ZONAL)
  • secret_for_oauth_client_secret - Secret Manager Resource ID of Client Secret obtained in Step 1
  • secret_for_oauth_client_id - Secret Manager Resource ID of Client ID obtained in Step 1

MLflow Terraform Module details

App Engine

Secrets will be passed securely with the use of data sources:

data "google_secret_manager_secret_version" "oauth_client_id" {
  secret = var.secret_for_oauth_client_id
}

data "google_secret_manager_secret_version" "oauth_client_secret" {
  secret = var.secret_for_oauth_client_secret
}

resource "google_app_engine_application" "mlflow_app" {
  project     = var.project
  location_id = var.app_engine_region
  iap {
    enabled              = true
    oauth2_client_id     = data.google_secret_manager_secret_version.oauth_client_id.secret_data
    oauth2_client_secret = data.google_secret_manager_secret_version.oauth_client_secret.secret_data
  }
}

The App Engine app definition contains autoscaling, Cloud SQL connection configuration as well as a few required environment variables (note that only the Secret ID is passed, the DB password will not be exposed).

⚠️ If your project already contains an App Engine app, you will need to import it before applying the changes in terraform, as only one single App Engine app per project is allowed. Using the service name default will block you from deleting the app using terraform destroy, so you might want to change this parameter.

resource "google_app_engine_flexible_app_version" "mlflow_default_app" {
  project    = var.project
  service    = "default"
  version_id = "v1"
  runtime    = "custom"

  deployment {
    container {
      image = var.docker_image
    }
  }

  liveness_check {
    path = "/"
  }

  readiness_check {
    path = "/"
  }

  beta_settings = {
    cloud_sql_instances = google_sql_database_instance.mlflow_cloudsql_instance.connection_name
  }

  env_variables = {
    GCP_PROJECT                 = var.project
    DB_PASSWORD_SECRET_NAME     = google_secret_manager_secret.mlflow_db_password_secret.secret_id,
    DB_USERNAME                 = google_sql_user.mlflow_db_user.name
    DB_NAME                     = google_sql_database.mlflow_cloudsql_database.name
    DB_INSTANCE_CONNECTION_NAME = google_sql_database_instance.mlflow_cloudsql_instance.connection_name
    GCS_BACKEND                 = google_storage_bucket.mlflow_artifacts_bucket.url
  }

  automatic_scaling {
    cpu_utilization {
      target_utilization = 0.75
    }

    min_total_instances = 1
    max_total_instances = 4
  }

  resources {
    cpu       = 1
    memory_gb = 2
  }

  delete_service_on_destroy = true
  noop_on_destroy           = false

  timeouts {
    create = "20m"
  }

  depends_on = [
    google_secret_manager_secret_iam_member.mlflow_db_password_secret_iam,
    google_project_iam_member.mlflow_gae_iam
  ]

  inbound_services = []
}

CloudSQL instance

The module generates a random password for the database and stores it in the Secret Manager. MySQL 8.0 database flavor is used by default, but can be changed to any MLflow-supported one.

The database will be backed up by the disk with an auto-resize option to handle scaling up while keeping the initial costs to minimum.

resource "random_password" "mlflow_db_password" {
  length  = 32
  special = false
}

resource "random_id" "db_name_suffix" {
  byte_length = 3
}

resource "google_sql_database_instance" "mlflow_cloudsql_instance" {
  project = var.project

  name             = "${var.prefix}-mlflow-${var.env}-${var.region}-${random_id.db_name_suffix.hex}"
  database_version = "MYSQL_8_0"
  region           = var.region

  settings {
    tier              = var.machine_type
    availability_type = var.availability_type

    disk_size       = 10
    disk_autoresize = true

    ip_configuration {
      ipv4_enabled = true
    }

    maintenance_window {
      day          = 7
      hour         = 3
      update_track = "stable"
    }

    backup_configuration {
      enabled            = true
      binary_log_enabled = true
    }

  }
  deletion_protection = false
}

resource "google_secret_manager_secret" "mlflow_db_password_secret" {
  project   = var.project
  secret_id = "${var.prefix}-mlflow-db-password-${var.env}-${var.region}"

  replication {
    user_managed {
      replicas {
        location = var.region
      }
    }
  }
}

resource "google_secret_manager_secret_version" "mlflow_db_password_secret" {
  secret      = google_secret_manager_secret.mlflow_db_password_secret.id
  secret_data = random_password.mlflow_db_password.result
}

MLflow requires the database to be created.

resource "google_sql_database" "mlflow_cloudsql_database" {
  project  = var.project
  name     = "mlflow"
  instance = google_sql_database_instance.mlflow_cloudsql_instance.name
}

resource "google_sql_user" "mlflow_db_user" {
  project    = var.project
  name       = "mlflow"
  instance   = google_sql_database_instance.mlflow_cloudsql_instance.name
  password   = random_password.mlflow_db_password.result
  depends_on = [google_sql_database.mlflow_cloudsql_database]
}

Google Cloud Storage

Lastly, the GCS bucket for MLflow artifacts will be created - by default, it will use a MULTI_REGIONAL class for the highest availability. Adjust the configuration to your preference, e.g. enable object versioning.

resource "google_storage_bucket" "mlflow_artifacts_bucket" {
  name                        = "${var.prefix}-mlflow-${var.env}-${var.region}"
  location                    = substr(var.region, 0, 2) == "eu" ? "EU" : "US"
  storage_class               = "MULTI_REGIONAL"
  uniform_bucket_level_access = true
}

resource "google_storage_bucket_iam_member" "mlflow_artifacts_bucket_iam" {
  depends_on = [google_app_engine_application.mlflow_app]
  bucket     = google_storage_bucket.mlflow_artifacts_bucket.name
  role       = "roles/storage.objectAdmin"
  for_each   = toset(["serviceAccount:${var.project}@appspot.gserviceaccount.com", "serviceAccount:service-${data.google_project.project.number}@gae-api-prod.google.com.iam.gserviceaccount.com"])
  member     = each.key
}

Step 4: Deploy MLflow on App Engine using terraform

First, make sure that you have enabled the following APIs in your project: App Engine Flexible API, Cloud SQL Admin API.

Provided that you have cloned the repository with the MLflow terraform module (link in Summary below), run the following commands in the mlflow directory. Alternatively, integrate the module with your existing Infrastructure-as-a-Code project.

  1. Make sure that you have access to the target google project (e.g. by logging using gcloud auth application-default login

  2. terraform init

  3. Create terraform.tfvars file,example:

    project_id="<my project>"
    env="dev"
    prefix="abc"
    region="us-central1"
    app_engine_region="us-central"
    docker_image="us-central1-docker.pkg.dev/<artifact path>/mlflow:latest"
    machine_type="db-f1-micro"
    secret_for_oauth_client_secret="projects/<projectid>/secrets/oauth-client-secret"
    secret_for_oauth_client_id="projects/<projectid>/secrets/oauth-client-id"
  4. ⚠️ (optional) If your project already contains an App Engine app, you need to import it using the following command:
    terraform import google_app_engine_application.mlflow_app <my project>

  5. terraform plan -out mlflow.plan

  6. Verify the plan

  7. terraform apply mlflow.plan

    Depending on the settings, region etc. it can take from 5 to 15 minutes to deploy the whole stack.

Accessing serverless MLflow behind Identity Aware Proxy

Once deployed, the MLflow service can be accessed either from the browser or from a backend service.

Browser

All user accounts (or the whole domain) needs to have an IAP-secured Web App User role in order to be able to access the MLflow. Note that applying the permissions does not have an immediate effect and you will probably have to wait a few minutes before the user will be able to access the MLflow UI.

Visiting the deployed App Engine URL will automatically redirect you to the SSO page for your Google Account.

google-account-mlflow

google-account-2-mlflow

Service-to-service

Whether in CI/CD scripts or in Python code, you can access an MLflow instance by URL with the addition of the Authorization HTTP header.

First, make sure that the service account you will be using has the following roles:

  • IAP-secured Web App User
  • Service Account Token Creator

Once the roles are set up, use one of the following options:

You have the service account json key.

  1. gcloud auth activate-service-account --key-file=./path-to/key.json
  2. export TOKEN=$(gcloud auth print-identity-token --audiences="${OAUTH_CLIENT_ID}")

You are running on a Compute Engine VM / Cloud Run / GKE / other GCP-backed service which uses a service account natively

export TOKEN=$(curl -s -X POST -H "content-type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" -d "{\"audience\": \"${OAUTH_CLIENT_ID}\", \"includeEmail\": true }" "https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/$(gcloud auth list --filter=status:ACTIVE --format='value(account)'):generateIdToken"  | jq -r '.token')

You want to verify whether a given service account can generate a token (using impersonation)

export TOKEN=$(gcloud auth print-identity-token --impersonate-service-account="<service account email>" --include-email --audiences="${OAUTH_CLIENT_ID}")

You are using Python to make requests

  1. Make sure that `google-cloud-iam` package is installed

  2. Obtain the token

    from google.cloud import iam_credentials
    import requests
    client = iam_credentials.IAMCredentialsClient()
    sa = "<Service Account Email>"
    client_id = "<OAuth 2.0 Client ID>"
    token = client.generate_id_token(
                name=f"projects/-/serviceAccounts/{sa}",
                audience=client_id,
                include_email=True,
    ).token
    
    result = requests.get("https://<redacted>.r.appspot.com/api/2.0/mlflow/experiments/list", 
                         headers={"Authorization": f"Bearer {token}"})
    print(result.json())

You are using MLflow Python SDK

Set MLFLOW_TRACKING_TOKEN environment variable to the token value (obtained using any of the above methods).

Alternatively, you can re-use the Python code above and implement the mlflow.request_header_provider plugin.

Summary

I hope this guide helped you to deploy secure, serverless MLflow instances on the Google Cloud Platform using App Engine. Happy (serverless) experiment tracking!

A special thank you to Mateusz Pytel for the initial configuration.

Repository with MLflow App Engine Terraform module is here

If you have any questions or concerns about deployment of MLflow on Google Cloud Platform, we encourage you to contact us.

Don’t miss the next MLflow and GCP blog post!

Sign up for the newsletter and stay up to date.

The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy
terraform
MLFlow
Google Cloud Platform
GCP
app engine
12 July 2022

Want more? Check our articles

complex event processing apache flink
Tutorial

My experience with Apache Flink for Complex Event Processing

My goal is to create a comprehensive review of available options when dealing with Complex Event Processing using Apache Flink. We will be building a…

Read more
runningkedroeverywhereobszar roboczy 1 4
Tutorial

Running Kedro… everywhere? Machine Learning Pipelines on Kubeflow, Vertex AI, Azure and Airflow

Building reliable machine learning pipelines puts a heavy burden on Data Scientists and Machine Learning engineers. It’s fairly easy to kick-off any…

Read more
ml getindataobszar roboczy 1
Use-cases/Project

Real-time Machine Learning: considerations based on Fraud Detection use case

When it comes to machine learning, most products are designed to work in batches, meaning they process data at fixed intervals rather than in real…

Read more
getindator beautiful magi lake with data visualization under th 04d517e5 6cb7 49b2 af1a 77884a44a1eb
Tutorial

Data lakehouse with Snowflake Iceberg tables - introduction

Snowflake has officially entered the world of Data Lakehouses! What is a data lakehouse, where would such solutions be a perfect fit and how could…

Read more
mlokladka
Tutorial

Your ML prototype doesn't have to be messy. A few words about the GetInData Machine Learning Framework

A prototype is an early sample, model, or release of a product built to test a concept or process. What we have above is a nice, generic definition of…

Read more
wp stream blogingobszar roboczy 1 4x 100
Whitepaper

White Paper: Stream Processing Explained

Stream Processing In this White Paper we cover topic such as characteristic of streaming, the challegnges of stream processing, information about open…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy