Tutorial
9 min read

From 0 to MLOps with ❄️ Part 2: Architecting the cloud-agnostic MLOps Platform for Snowflake Data Cloud

From 0 to MLOps with Snowflake ❄️

In the first part of the blogpost, we presented our kedro-snowflake plugin that enables you to run your Kedro pipelines on the Snowflake Data Cloud in 3 simple steps. This time we are going to demonstrate how you can implement the end-to-end MLOps platform on top of Snowflake powered by Kedro and MLflow. Inspired by the blog posts: 1 and 2 - in this article we present a novel approach that tries to address the problem of missing External Internet Access in Snowpark.

This feature would be an absolute game changer for implementing any integration with third party tools not natively available in the Snowflake ecosystem and according to the best of knowledge is on the Snowflake roadmap. So let’s dive into the MLOps Platform for Snowflake Data Cloud.

MLOps Platform pillars - recap

What is Kedro? The MLOps Framework

Kedro is a widely-adopted MLOps framework in Python, that brings engineering back to the data science world to help productionize machine learning code seamlessly. Kedro lets you build machine learning pipelines that can work on cloud, edge or on-premises platforms. It's open source and offers tools for data scientists and engineers to create, share and collaborate on machine learning workflows. Additionally, Kedro allows you to track the entire machine learning lifecycle from data preparation to model deployment.

See also:

Kedro is a tool that can make your Machine Learning projects more scalable and flexible, while keeping things simple. It can run on any platform, whether cloud or edge computing and is designed to be easily scalable. With Kedro, data scientists and engineers can build machine learning workflows compatible with different platforms without worrying about scalability. In addition, Kedro provides flexibility in building machine learning workflows, supporting various data sources, models, and deployment targets. It makes it easier for teams to experiment with new technologies and techniques while maintaining a consistent pipeline.

What is MLflow? The platform for machine learning lifecycle management

MLflow is an open source platform that manages the entire lifecycle of machine learning models. It offers tools to track, manage and visualize workflows, from data preparation to model deployment. Additionally, MLflow promotes collaboration between data scientists and engineers by providing a shared language and understanding of the machine learning process.

Why should you consider MLflow in your MLOps toolbox? Here are the three main benefits MLFlow introduces:

  1. Efficient Machine Learning Development - MLflow provides tools for tracking, managing and visualizing the entire machine learning workflow from data preparation to model deployment, which helps in the efficient development of machine learning models.
  2. Collaboration - MLflow enables collaboration among data scientists and engineers by providing a common language and shared understanding of the machine learning process. MLflow makes it easier to work with other team members and share knowledge.
  3. Continuous Improvement - MLflow provides tools for tracking model performance, which helps improve models over time. By continuously monitoring and analyzing model performance, data scientists can identify the areas where they need to improve their models or processes.

Both Kedro and MLflow are projects supported by Linux Foundation.

What is Terraform?

Terraform is an open-source software tool that enables the infrastructure as code (IaaC) approach in cloud computing, network automation and security. It provides tools to manage your infrastructure using simple, declarative configuration files instead of complex, error-prone manual configurations or scripts. Terraform helps you automate provisioning, updating and deleting resources across multiple cloud providers such as AWS, Azure and Google Cloud Platform (GCP).

GetInData is also an active contributor to the official Snowflake Terraform Provider, in particular we have recently added support for external function translators that was required for the presented MLflow integration

MLOps Snowflake Platform - high level architecture

In the recent release of our Kedro-snowflake plugin we added beta support for MLflow integration.

The diagram below presents proposed MLOps platform architecture in the case of AWS cloud

mlops-platform-architecture-aws

GCP and Azure deployment scenarios are very much the same except for:

  • API Gateway (GCP) or API Management (Azure) for exposing MLflow API
  • MLflow hosting - e.g. App Engine, Cloud Run (GCP) or Azure Container Apps, Azure Container Instances (Azure)
  • Online model serving - Vertex Endpoints (GCP) or Azure ML Endpoints

Technical deep-dive into kedro-snowflake <-> MLflow integration

Kedro-snowflake <-> MLflow integration is based on the following concepts:

  • Snowflake external functions that are used for wrapping POST requests to the MLflow instance. In the minimal setup the following wrapping external functions for MLflow REST API calls must be created:
  • Snowflake external function translators for changing the format of the data sent/received from the MLflow instance.
  • Snowflake API integration for setting up a communication channel from the Snowflake instance to the cloud HTTPS proxy/gateway service where your MLflow instance is hosted (e.g. Amazon API Gateway, Google Cloud API Gateway or Azure API Management).
  • Snowflake storage integration to enable your Snowflake instance to upload artifacts (e.g. serialized models) to the cloud storage (Amazon S3, Azure Blob Storage, Google Cloud Storage) used by the MLflow instance.
  • Offline model deployment within Snowflake is handled by the MLflow-Snowflake plugin, which allows you to deploy models trained in popular frameworks (such as PyTorch, Scikit-Learn, TensorFlow, LightGBM, XGBoost, ONNX and others) natively in Snowflake as User-Defined Functions, which then can be called directly from SQL to efficiently (using vectorization) perform inference and obtain the models’ predictions. The plugin is backed by the Snowflake team and is actively developed. 
  • Online model deployment with SageMaker Endpoints with the MLflow official plugin.

YOUⓇ MLOps Snowflake Platform: Pros, cons and alternative approaches

Our MLOps platform assumes a fully native integration with the Snowflake ecosystem and leverages it for data access, pipeline orchestration, model training as well as model deployment and inference. Such an architecture has a number of advantages, just to name a few:

  • simpler security setup (also thanks to limited data egress)
  • fewer dependencies on external services
  • substantially less data transfers 
  • lower pipeline nodes startup overheads when compared to SageMaker/AzureML/Vertex AI
  • a unified data and machine learning platform

This is of course not a one-size-fits-all architecture and there are shortcomings that have not yet been addressed in the Snowflake Data Cloud, i.e.:

  • GPU support
  • preemptible data warehouses
  • access to the external services
  • multiple Python versions (already available in Public Preview)
  • Docker support

Alternatively, one can only use Snowflake in a data-centric way and offload the machine learning pipeline orchestration and training to external systems such as Azure ML, Vertex AI or SageMaker. By doing so, a trade-off will be introduced - some of the ML/AI-native capabilities of the external services could potentially be leveraged, while the data transfer and cross-cloud connection/setup costs may appear. Such an approach may be, however, desirable in some cases when there is a need to reuse an external pipeline orchestrator service that is already used in organizations, such as Apache Airflow.

By building your core ML project and delivering the business value on top of the Kedro framework, later migration and switching between those two approaches is possible with minimal effort. So far, we’ve open sourced 6 major plugins for Kedro: Kedro-Snowflake, Kedro-AzureML, Kedro-VertexAI, Kedro-SageMaker, Kedro-Airflow and Kedro-Kubeflow (check our GitHub repos). 

The below table summarizes the pros and cons of different approaches:

getindata-table-approaches-airflow-snowflake-vertex

legend-getindata-table-approaches-airflow-snowflake-vertex

Summary and future improvements

In this condensed blog post we presented our approach to architecting a cloud-agnostic MLOps platform on top of Snowflake Data Cloud, based on three building blocks: Kedro, MLflow and Terraform.The proposed solution solves the issue of missing external access to third party services in the Snowflake ecosystem - this feature once implemented, would simplify the overall architecture by removing the need of using external function wrappers and API gateways.

But it’s not the end of the story - stay tuned for the third part of this blogpost in which we are going to present how to extend this platform even further to support Large Language Models (LLMs).

If this topic caught your interest, it's the best moment because on June 27 at the Snowflake Warsaw Meetup I will be giving a talk on From 0 to MLOps with ❄️Snowflake Data Cloud. I encourage you to join!

ebook banner

MLOps
MLOps Platform
Kedro
Snowflake
snowflake data cloud
22 June 2023

Want more? Check our articles

1RiTD99ILqsAaSQqY1GaLMw
Big Data Event

Five big ideas to learn at Big Data Tech Warsaw 2020

Hello again in 2020. It’s a new year and the new, 6th edition of Big Data Tech Warsaw is coming soon! Save the date: 27th of February. We have put…

Read more
ml getindataobszar roboczy 1
Use-cases/Project

Real-time Machine Learning: considerations based on Fraud Detection use case

When it comes to machine learning, most products are designed to work in batches, meaning they process data at fixed intervals rather than in real…

Read more
datagenerationobszar roboczy 1 4
Tutorial

Data online generation for event stream processing

In a lot of business cases that we solve at Getindata when working with our clients, we need to analyze sessions: a series of related events of actors…

Read more
obszar roboczy 12 23blogcdci
Tutorial

Different generations of CICD tools

What is CICD? It is an acronym for Continuous Integration Continuous Delivery / Deployment. CICD can be also described as the methodology focused on…

Read more
getindata big data blog apache spark iceberg
Tutorial

Apache Spark with Apache Iceberg - a way to boost your data pipeline performance and safety

SQL language was invented in 1970 and has powered databases for decades. It allows you not only to query the data, but also to modify it easily on the…

Read more
nifiobszar roboczy 1 3 3x 100
Tutorial

Apache NiFi: A Complete Guide E-book.

We are proud to present you our first e-book, created by GetInData specialists. Apache NiFi: A Complete Guide is the result of long and fruitful work…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy