5 min read

Kubeflow Pipelines up and running in 5 minutes

The Kubeflow Pipelines project has been growing in popularity in recent years. It's getting more prominent due to its capabilities - you can orchestrate almost any machine learning workflow and run it on a Kubernetes cluster. Although KFP is powerful, its installation process might be painful, especially in cloud providers other than Google (who are the main contributor to the Kubeflow Project). Due to its complexity and high entry level, Data Scientists seem to be discouraged to even give it a go. At GetInData, we have developed a platform-agnostic Helm Chart for Kubeflow Pipelines, that will allow you to get started within minutes, no matter if you're using GCP, AWS or whether you want to run with KFP locally.

How to run Kubeflow Pipelines on a local machine?

Before you start, make sure you have the following software installed:

  • Docker with ~10GB RAM reserved, at least 20GB of free disk space
  • Helm v3.6.3 or newer,
  • kind.

Once you have all of the required software, the installation is just a breeze!

  1. Create a local kind cluster:

kind create cluster --name kfp --image kindest/node:v1.21.14

It usually takes 1-2 minutes to spin up a local cluster.

  1. Install Kubeflow Pipelines from GetInData's Helm Chart:

helm repo add getindata https://getindata.github.io/helm-charts/
then

helm install my-kubeflow-pipelines getindata/kubeflow-pipelines --version 1.6.2 --set

platform.managedStorage.enabled=false --set platform.cloud=gcp --set

platform.gcp.proxyEnabled=false

Now you need to wait a few minutes (usually up to 5, depending on your machine) for the local KIND cluster to spin up all apps. Don't worry if you see ml-pipeline or metadata-grpc-deployment pods having a CrashLoopBackOff state for some time - they will become ready once their dependent services launch.

The KFP instance will be ready once all of the pods have this status Running:

kubectl get pods

NAMEREADYSTATUSRESTARTSAGE
cache-deployer-deployment-db7bbcff5-pzvwx1/1Running07m42s
cache-server-748468bbc9-9nqqv1/1Running07m41s
metadata-envoy-7cd8b6db48-ksbkt1/1Running07m42s
metadata-grpc-deployment-7c9f96c75-zqt2q1/1Running27m41s
metadata-writer-78f67c4cf9-rkfkk1/1Running07m42s
minio-6d84d56659-gcrx91/1Running07m41s
ml-pipeline-8588cf6787-sp68f1/1Running17m42s
ml-pipeline-persistenceagent-b6f5ff9f5-qzmsl1/1Running07m42s
ml-pipeline-scheduledworkflow-6854cdbb8d-ml5mf1/1Running07m42s
ml-pipeline-ui-cd89c5577-qhgbc1/1Running07m42s
ml-pipeline-viewer-crd-6577dcfc8-k24pc1/1Running07m42s
ml-pipeline-visualizationserver-f9895dfcd-vv4k81/1Running07m42s
mysql-6989b8c6f6-g6mb41/1Running07m42s
workflow-controller-6d457d9fcf-gnbrh1/1Running07m42s

Access local Kubeflow Pipelines instance

In order to connect to KFP UI, create a port-forward to the ml-pipeline-ui service:

kubectl port-forward svc/ml-pipeline-ui 9000:80

and open this browser: http://localhost:9000/#/pipelines

getindata-big-data-blog-kubeflow-pipeline-marcin-zablocki

Implementation details

Our platform-agnostic KFP Helm Chart was based on the original chart maintained by the GCP team. At the moment of the fork, GCP chart was running version 1.0.4, we upgraded all of the components so that KFP was running the up-to-date version 1.6.0 (at the time of writing this post). GCP-specific components, such as CloudSQLProxy, ProxyAgent were refactored to be deployed conditionally, based on values provided in the chart.

We introduced a setting to enable or disable managed storage. Once enabled, it can use:

  • CloudSQL and Google Cloud Storage - when running on the Google Cloud Platform
  • Amazon RDS and S3 - when running on AWS. If the managed storage is disabled, a local MySQL database and MinIO storage buckets are created (as in this post). As for now, Azure support is pending, feel free to create a pull request to our repository!

Next steps to running Kubeflow Pipelines

Now that you have a fully working local Kubeflow Pipelines instance, you can learn KFP DSL and start building your own machine learning workflows without the need for provision of a full Kubernetes cluster.

running-kubeflow-pipeline-big-data-blog-getindata

I encourage you to also explore GetInData's Kedro Kubeflow plug-in, which enables you to run the Kedro pipeline on Kubeflow Pipelines. It supports translation from the Kedro pipeline DSL to KFP (using Pipelines SDK) and deployment to running a Kubeflow cluster with convenient commands. Once you create your Kedro pipeline, configure the plug-in to use a local KFP instance by setting the host parameter in conf/base/kubeflow.yaml:

host: http://localhost:9000

# (...) rest of the kubeflow.yaml config

To stay up-to-date with the KFP Helm Chart, follow the Artifact Hub page! If you would like to know more about Kedro Kubeflow plug-in, check the documentation here.


Interested in ML and MLOps solutions? How to improve ML processes and scale project deliverability? Watch our MLOps demo and sign up for a free consultation.

big data
kubeflow
kubeflow pipelines
23 September 2021

Want more? Check our articles

blog1obszar roboczy 1 4
Tech News

Is my company data-driven? Here’s how you can find out

Planning any journey requires some prerequisites. Before you decide on a route and start packing your clothes, you need to know where you are and what…

Read more
screenshot 2022 08 02 at 10.56.56
Tech News

2022 Big Data Trends: Retail and eCommerce become one of the hottest sectors for AI/ML

Nowadays, we can see that AI/ML is visible everywhere, including advertising, healthcare, education, finance, automotive, public transport…

Read more
apache2xobszar roboczy 1 4
Tutorial

Introduction to GeoSpatial streaming with Apache Spark and Apache Sedona

We are  producing more and more geospatial data these days. Many companies struggle to analyze and process such data, and a lot of this data comes…

Read more
power of big dataobszar roboczy 1 3x 100
Tutorial

Power of the Big Data: Industry

Welcome to the third part of the "Power of Big Data" series, in which we describe how Big Data tools and solutions support the development of modern…

Read more
getindata blog nifi tomasz nazarewicz
Tutorial

NiFi Scripted Components - the missing link between scripts and fully custom stuff

Custom components As we probably know, the biggest strength of Apache Nifi is the large amount of ready-to-use components. There are, of course…

Read more
getindata ml innovations 2023
Tech News

If LLM’s did not exist. ML innovations in 2023 from a data scientist’s perspective

The year 2023 has definitely been dominated by LLM’s (Large Language Models) and generative models. Whether you are a researcher, data scientist, or…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy