Tutorial
7 min read

Machine Learning Features discovery with Feast and Amundsen

getindata-cover-amundsen-feast-love

One of the main challenges of today's Machine Learning initiatives is the need for a centralized store of high-quality data that can be reused by Data Scientists across different models. Tools that fulfill that gap are named Feature Stores, and you can read about them in fantastic Adi’s post: What are Feature Stores and Why Are They Critical for Scaling Data Science?

Many companies deploy Feature Store according to their needs, but one of the most popular, open-source implementations is Feast. Feast recently joined LF AI&Data Foundation as a reference solution to store features by:

  1. Providing a single data access layer that decouples models from the infrastructure used to generate, store, and serve feature data.
  2. Decoupling the creation of features from the consumption of features through a centralized store, thereby allowing teams to ship features into production with minimal engineering support.
  3. Providing point-in-time correct retrieval of feature data for both model training and online serving.
  4. Encouraging reuse of features by allowing organizations to build a shared foundation of features.
  5. Providing data-centric operational monitoring that ensures operational teams can run production machine learning systems confidently at scale.

For one of the projects I’m working on in GetInData, Feast was selected as a backend behind the feature store. The core components installation allows users to register, browse and update features definition using Python SDK, but this is not the most user-friendly interface - it lacks a full text search API and ability to collaborate on the features (like adding and editing descriptions). And the only gateway to the data is Python REPL or Jupyter, so it requires some coding skills to access the data.

In search of the Feast UI

“With 1.2k stars on github there must be an UI somewhere” - this was my first thought. Unfortunately, the project name is not super-unique, so entering “feast ui” in google doesn’t provide very meaningful results. But it leads to some interesting insights:

I tried to revive the old UI, but without success. At this point in time, I was really close to starting the development of the UI myself, but decided to try another way first.

Meet Amundsen

Amundsen is a data discovery tool that collects metadata from your databases, pushes them to internal Neo4j graph database and Elasticsearch and exposes using a nice, interactive frontend. The tool is widely adopted in Lyft, ING and many data-oriented projects supported by GetInData. Using the web portal users can search for the data they are interested in, assign tags, mark datasets as “starred” and even edit descriptions of the tables and columns. What is also interesting, it is part of the LF AI&Data as well.

Usually, you use Amundsen to load the structure from your database (using databuilder) into database-oriented format and structure with:

  • column being a part of the table,
  • table residing in schema,
  • schema being stored in cluster,
  • cluster belonging to some kind of database technology.

At first glance, the structure doesn’t really fit the Feast schema, as the structure there looks like this:

  • entity and feature are properties of the feature table,
  • feature table as many properties: labels, definition of batch source and - optionally - stream source
  • entity and feature table are registered within a project

But, still, users can imagine these data in a tabular form with this mapping:

  • columns are either entities or features,
  • table is simply a feature table ,
  • database is a feast project,
  • cluster is a name of feast instance (usually there is only one in the company),
  • database technology is always “feast”.

With this mapping in mind I was finally able to try Amundsen as a Feast user interface!

Amundsen’s Feast Extractor

Implementation of Amundsen extension to scrap the Feast for the metadata turned out to be a straightforward task. Databuilder concept defines the Extractor as a simple class that generates the objects with metadata one after another, so a few calls using Feast’s Python SDK solves it completely. Recently, my implementation was merged into Databuilder master, so you can try it yourself! The job that does all the job can be defined as in the sample script.

Apart from features and entities, the Extractor pushes data exposed by Feast:

  • feature table creation date
  • labels
  • specification of batch source
  • (if there is a stream source) specification of stream source.

On the Frontend it looks as follows:

getindata-amundsen-feast-machine-learning.png


On the left, you can see all the information extracted from Feast, on the right list of entities and features in the tabular form.

The process to extract the data from Feast to Amundsen runs every hour in form of Kubeflow Pipelines scheduled workflow:

mariusz-strzelecki-getindata-machine-learining-feast-amundsen

Summary

Amundsen’s frontend solves all the requirements that I had for a Feast UI:

  • all features are searchable by their name and description
  • the descriptions can be modified without leaving the UI, giving great collaboration opportunities
  • feature tables tagging for easier discovery
  • feast project name indicating readiness for the feature table to be used in models (with names: dev, test, beta, prod)
  • quick preview of the features sample (requires Apache Superset setup and a bit of configuration)
  • ability to present features statistics, like count of nulls, mean value of average string length (requires a bit of extra work, as Feast does no longer store statistics internally since 0.8 release)

If you plan to use Feast, but you’re a bit afraid of the lack of user interface, definitely try Amundsen with FeastExtractor. Both projects are supported by LF AI&Data, so they are not going anywhere soon. And, to be frank, seeing how these two can support each other just blew my mind ;-)

If you're looking for a company to help you scale and operationalize your ML efforts with tools like Feast and Kubeflow, just write to us.

big data
technology
3 December 2020

Want more? Check our articles

transfer legacy pipeline modern gitlab cicd kubernetes kaniko
Tutorial

How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 2

Please dive in the second part of a blog series based on a project delivered for one of our clients. If you miss the first part, please check it here…

Read more
kubeflow pipelines runing 5 minutes getindata blog

Kubeflow Pipelines up and running in 5 minutes

The Kubeflow Pipelines project has been growing in popularity in recent years. It's getting more prominent due to its capabilities - you can…

Read more
5apacheobszar roboczy 1 4
Tutorial

Real-time ingestion to Iceberg with Kafka Connect - Apache Iceberg Sink

What is Apache Iceberg? Apache Iceberg is an open table format for huge analytics datasets which can be used with commonly-used big data processing…

Read more
getindata white paper aviation bigdata technologies
Whitepaper

White Paper: Big Data Technologies in the Aviation Industry

About In this White Paper we described use-cases in the aviation industry which are the most prominent examples of Big Data related implementations…

Read more
getindata blog big data machine learning models tools comparation no text
Tutorial

Machine Learning model serving tools comparison - KServe, Seldon Core, BentoML

Intro Machine Learning is now used by thousands of businesses. Its ubiquity has helped to drive innovations that are increasingly difficult to predict…

Read more
1 06fVzfDygMpOGKTvnlXAJQ
Tech News

Panem et circenses — how does the Netflix’s recommendation system work.

Panem et circenses can be literally translated to “bread and circuses”. This phrase, first said by Juvenal, a once well-known Roman poet is simple but…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

The administrator of your personal data is GetInData Sp. z o.o. Sp.k with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the  Terms & Conditions. For more information on personal data processing and your rights please see  Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy