Tutorial

8 min read

From 0 to MLOps with ❄️ Snowflake Data Cloud in 3 steps with the Kedro-Snowflake plugin

MLOps on Snowflake Data Cloud

MLOps is an ever-evolving field, and with the selection of managed and cloud-native machine learning services expanding by the day, it can be challenging to navigate the options available. With a plethora of managed and cloud-native machine learning services available, it's crucial to choose the right platform for running machine learning pipelines and deploying trained models. However, three significant pain points persist in the MLOps landscape:

Lack of easy access to company's valuable data,
the need for quick local iteration on ML pipelines
Lack of a seamless transition to the cloud environment.

With Snowflake being a powerful data warehouse and Snowpark's ease of use, together they make a strong candidate for building complex ML pipelines. If you are not familiar with Snowpark yet, there are a lot of great articles introducing its core concepts and how you can use it for writing data science and machine learning (ML) code, e.g. here, here or here.

There are however at least a few shortcomings of the currently proposed approaches that have not yet been addressed:

ML pipelines orchestration - in the current state, two strategies can be pursued:
- using an external orchestrator service or tool, such as AzureML Pipelines or Apache Airflow for invoking Snowpark code directly
- manually wrapping Snowpark code into Python UDFs and using them for building a directed acyclic graph (DAG) of steps of the Snowflake native tasks mechanism

Unfortunately, neither of these methods seem to be free from flaws - the former requires additional scheduling components to be included in the architecture that makes it more complex and less platform-independent. The latter one is less user-friendly as it requires not only developing training code, but also defining Snowflake DAGs of tasks by means of plain SQL or Terraform programming language.

ML model lifecycle management - there isn’t any automation in place that makes it easy to promote/deploy training pipelines between stage/runtime environments - i.e. Development - Test - Production. This requires preparation of Continuous Integration/Continuous Training (CI/CT) processes on your own
Code standardization and project templates - in its current state, Snowpark does not come with any built-in mechanism for code structuring, unit testing or automated documentation generation.

The above list of challenges clearly indicates missing the integration of the Snowflake environment with an MLOps framework, such as Kedro.

Today we are proudly announcing a solution that will fill this gap - the kedro-snowflake plugin. In the next post we will also guide you through the whole MLOps platform and ML model deployment on Snowflake. However, let's first take a look at what Kedro is and then let's build an ML pipeline in Kedro and execute it in the Snowflake environment in 3 simple steps.

Kedro - the MLOps Framework

Kedro is a widely-adopted, open-source Python framework that has claimed to bring engineering back to the data science world. The rationale behind using Kedro as a framework for creating maintainable and modular training code is in many aspects, similar to preferring Terraform technology over cloud-vendor native SDK for infrastructure provisioning and can be summarized in the following points:

standardization of ML project layout,
portability of ML pipelines,
reusability code base, modules or even whole pipelines,
a faster development loop thanks to the possibility of running/testing pipelines locally,
clear and maintainable codebase with no dependencies on Cloud specific APIs (as an analogy to Terraform providers) and separation of runtime configurations
multi cloud readiness
hooks support for further automation,
seamless integration with plugins mechanism with 3rd party tools like MLflow, pandas-profiling or Docker,
suitable for easy integration with CI/CD tools for a true MLOps experience.

We at GetInData|Part of Xebia are strong advocates of the Kedro framework as our technology of choice for deploying robust and user-friendly MLOps platforms on many cloud platforms. With our open-source Kedro plugins, you can write your pipeline code and focus on the target model. Then, with the Kedro plugins, you deploy it to any supported platform (see: Running Kedro… everywhere? Machine Learning Pipelines on Kubeflow, Vertex AI, Azure and Airflow - GetInData) without changing the code, making local iterations fast and moving to cloud - seamless.

As of May 2023 we support:

Google Cloud Platform (http://github.com/getindata/kedro-vertexai),
Microsoft Azure (https://github.com/getindata/kedro-azureml),
Amazon Web Services (https://github.com/getindata/kedro-sagemaker),
Airflow (https://github.com/getindata/kedro-airflow-k8s),
Kubeflow (https://github.com/getindata/kedro-kubeflow).

Now the time has come for Snowflake…

Kedro-Snowflake plugin behind the scenes

kedro-snowflake is our newest plugin that allows you to run full Kedro pipelines in Snowflake. Right now it supports:

Kedro starter, to get you up to speed fast
automatically creating Snowflake Stored Procedures from Kedro nodes (using Snowpark SDK)
translating the Kedro pipeline into Snowflake task DAGs
running the Kedro pipeline fully within Snowflake, without an external system
using Kedro's official SnowparkTableDataSet
automatically storing intermediate data results as Transient Tables (if Snowpark's DataFrames are used)

The core idea of this plugin is to programmatically traverse a Kedro pipeline and translate its nodes into corresponding Stored Procedures and at the same time wrap them into Snowflake tasks, while preserving the inter-node dependencies to form exactly the same pipeline DAG on the Snowflake side. The end result is a Snowflake DAG of tasks like this:

snowflake-dag-tasks-getindata

that correspond to the Kedro pipeline:

kedro-pipeline-getindata

It also comes with a built-in snowflights (port of the official spaceflights, extended with Snowflake-related features) starter that will help to bootstrap your Snowflake-based ML projects in seconds.

Quick start - your ML pipeline in 3 steps with Kedro-Snowflake plugin

Let’s start with the snowflights Kedro starter. First, prepare your environment (i.e. your preferred Python virtual environment). First, just install our kedro-snowlake plugin:

pip install "kedro-snowflake>=0.1.2"

Next, create your first ML pipeline using Kedro and Snowlake. The starter will guide you through the Snowflake connection configuration, including the Snowlake account and warehouse details:

kedro new --starter=snowflights --checkout=0.1.2

Then run the starter pipeline:

kedro snowflake run --wait-for-completion

That’s it! You can see the ML pipeline execution in the Snowflake UI:

ml-pipeline-execution-snowflake-getindata

and in the terminal:

This starter will showcase the Kedro-Snowflake integration, including the connection with Snowflake, transforming an ML Pipeline in Kedro to a Snowflake compatible format, and execution of the pipeline in the Snowflake environment. Feel free to build your own pipeline based on this starter or from scratch with our plugin. See more in the following plugin documentation: Kedro Snowflake plugin documentation!

We also recommend you our video tutorial in which Marcin Zabłocki shows how to run ML pipeline on Snowflake.

Summary

In this short blog post we presented our newest kedro-snowflake plugin. Thanks to this plugin, you can build your ML pipelines in Kedro and execute them in a scalable Snowflake environment in three simple steps. Stay tuned for the second part of this blogpost in which we are going to present the whole MLOps platform and ML model deployment with the kedro-snowflake plugin being the core component of it.

WATCH KEDRO-SNOWFLAKE TUTORIAL

Interested in ML and MLOps solutions? How to improve ML processes and scale project deliverability? Watch our MLOps demo and sign up for a free consultation.

machine learning

open source

MLOps

Kedro

Snowflake

Snowpark

Kedro-Snowflake plugin

Last updated: 17 May 2023

Written by

Marek Wiewiórka

Big data architect

Marcin Zabłocki

MLOps Architect

Michał Bryś

Machine Learning Engineer and Technical Product Owner

Like this post?
Spread the word

Want more? Check our articles

Success Stories

Customer Story: Platform focused on centralizing data sources and democratization of data with ING

The client who needs Data Analytics Platform ING is a global bank with a European base, serving large corporations, multinationals and financial…

Tutorial

ETL 2.0 Why you should switch into stream processing

If you are looking at Nifi to help you in your data ingestions pipeline, there might be an interesting alternative. Let’s assume we want to simply…

big data blog getindata data enrichment flink sql http connector

Tutorial

Data Enrichment in Flink SQL using HTTP Connector For Flink - Part One

HTTP Connector For Flink SQL In our projects at GetInData, we work a lot on scaling out our client's data engineering capabilities by enabling more…

data enrichtment flink sql using http connector flink getindata big data blog notext

Tutorial

Data Enrichment in Flink SQL using HTTP Connector For Flink - Part Two

In part one of this blog post series, we have presented a business use case which inspired us to create an HTTP connector for Flink SQL. The use case…

getindator create an image illustrating the concept of data ske b0d7e21f 9c85 40d2 9a52 32caba3aece3

Tutorial

Data skew in Flink SQL

Data processing in real-time has become crucial for businesses, and Apache Flink, with its powerful stream processing capabilities, is at the…

semi supervised learning real timeobszar roboczy 1 4

Tutorial

Semi-supervised learning on real-time data streams

Acquiring unlabeled data is inherent to many machine learning applications. There are cases when we do not know the result of the action provided by…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

From 0 to MLOps with ❄️ Snowflake Data Cloud in 3 steps with the Kedro-Snowflake plugin

MLOps on Snowflake Data Cloud

Kedro - the MLOps Framework

Kedro-Snowflake plugin behind the scenes

Quick start - your ML pipeline in 3 steps with Kedro-Snowflake plugin

Summary

Like this post?Spread the word

Want more? Check our articles

Customer Story: Platform focused on centralizing data sources and democratization of data with ING

ETL 2.0 Why you should switch into stream processing

Data Enrichment in Flink SQL using HTTP Connector For Flink - Part One

Data Enrichment in Flink SQL using HTTP Connector For Flink - Part Two

Data skew in Flink SQL

Semi-supervised learning on real-time data streams

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!