Tutorial
6 min read

Feature store - managing multiple data sources with Feast

As the effort to productionize ML workflows is growing, feature stores are also growing in importance. Their job is to provide standardized and up-to-date features ready to use in production models, making it possible to reuse the existing features for different models, as well as to serve as a data discovery platform - a database of feature metadata. On top of that, feature stores may provide limited feature engineering capabilities.

Currently, judging by the number of GitHub stars, the most popular open source feature store implementation is Feast. The combined Feast's and Amundsen's data discovery capabilities have already been presented in Mariusz's genius blog post Machine Learning Features discovery with Feast and Amundsen. In this blogpost, we will cover how to use point-in-time joins on data flowing from different sources from a relational database, as well as from a real-time data stream to serve production-ready features.

Feast demo

In order to showcase Feast's capabilities, we created a demo containing a sample business case with two data sources, one of which is the Postgres database and the other is a Kafka topic. Its source code is available right here on GitHub.

Let's assume the following business case - an e-commerce company offers products on its website. The company has access to data about users and their orders (stored in Postgres) and a data stream regarding website traffic, ingested via Kafka topics.

The company wants to create features with the help of Feast and use them to understand their consumers' behavior better - on top of the data that defines the user (coming from the Postgres table) there is the need to engineer features based on web traffic data, such as the number of landings on the listing, product and photo pages.

Sample user behavior was modeled and corresponding data was generated using the invaluable doge_datagen project. This is the user behavior model's diagram:

feature-store-managing-multiple-data-source-feast

Feast requires a feature_store.yaml configuration file and one or more Python files with Feast object definitions. The configuration we tested included using Postgres as an offline store and Redis as an online store.

Postgres and Kafka source support are new things in Feast appearing in versions v0.21 and v0.22 respectively. As of version 0.23, Kafka stream support is in the experimental stage and we believe there is still room for improvement with regards to its interface. As stated in the Feast docs, Kafka sources must have a batch source specified, which can be used for retrieving historical features. Therefore, even if the Kafka stream is only used as an online store source, a mock batch source must be created. So in the demo we ended up with three defined data sources:

  • for orders, we created a batch Postgres source
  • for traffic, we created a mock batch Postgres source and an online Kafka source

Those sources identified their records using a subset of three defined entities: user, order and web traffic id. Finally, we defined two feature views, one for order details and the other for user traffic, which we used to create a simple visualization. Below we have presented an animation of an example demo run:

feast-demo-feature-store-implementationfeast-demo-feature-store-implementation

To showcase one step in visualizing the data, we created histograms showing the expected number of visits on the listing, product and photo pages per one successful transaction. As you can see, all those histograms show similar distributions (which is expected, as this can be deduced from the datagen model), and while most of the customers decide to buy the product after just a few clicks, some seem to ponder for a long time, before finally clicking the shopping cart icon!

machine-learining-models-feast-framework

Caveats

During the work on the demo, we encountered some issues which you may want to be aware of, especially if you are thinking about using Feast to implement Feature Store in your own project. Most of those issues can be linked to the early stage of the project, so be sure to check out new Feast releases regularly.

Feature extraction is still in the alpha stage and its capabilities are limited. For example, entity values are not provided in the "on demand" feature views input dataframe, thus the dataframe is not suitable for aggregates extraction. Due to this limitation, we ended up transforming the Kafka source data before passing it on to Feast. Moreover, we did not find a way to take a random sample from the offline/online store. Documentation says it is possible to provide pd.Dataframe or SQL. The SQL solution seems more elegant, but we could not manage to successfully use this option.

What's more, as it stands, debugging is a pain, as during development internal libraries' have exceptions galore. When a feast applies command results in Pandas or psycopg2 exception, deducing what the problem is demands imagination and time. It also seems that Feast makes some assumptions about schemas that are not documented. For example, the defined source's event timestamp field and created timestamp field may not be the same field, as such a defined source will throw out an internal SQL generation exception. During my testing, the Kafka online source has also silently failed to ingest records, without any exception, because of schema mismatch.

However, it seems that the Feast team is aware of the aforementioned debugging issue and some fixes may be applied in subsequent releases.

Feast Feature Store Demo Summary

Feature stores serve an important role in operationalizing machine learning models and Feast is certainly one of the most popular open source projects in this area. The Feast framework still has some teething issues, which we summed up in the previous chapter, however, the rate of improvement is fast and we hope that as time goes by, the rough edges will be eliminated and the solution's productivity multiplier will be even greater. If you would like to know more about our other Feature Store implementations, check out our blog post Feature Store comparison: 4 Feature Stores - explained and compared.


Interested in ML and MLOps solutions? How to improve ML processes and scale project deliverability? Watch our MLOps demo and sign up for a free consultation.

Don’t miss the next Feast and Feature Store blog post!

Sign up for the newsletter and stay up to date.

The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy
machine learning
Feast
ML
Feature Store
Data Source
5 September 2022

Want more? Check our articles

getindator justice fighting with ai illustration 2c2801f5 b279 474f 9812 56a64a8366c2

Large Language Models - the legal aspects of licensing for commercial purposes

In the rapidly evolving landscape of artificial intelligence (AI), large language models (LLMs) have become indispensable tools for various…

Read more
5apacheobszar roboczy 1 4
Tutorial

Real-time ingestion to Iceberg with Kafka Connect - Apache Iceberg Sink

What is Apache Iceberg? Apache Iceberg is an open table format for huge analytics datasets which can be used with commonly-used big data processing…

Read more
getindata ml innovations 2023
Tech News

If LLM’s did not exist. ML innovations in 2023 from a data scientist’s perspective

The year 2023 has definitely been dominated by LLM’s (Large Language Models) and generative models. Whether you are a researcher, data scientist, or…

Read more
observability using grafanaobszar roboczy 1 4
Tutorial

Observability using Grafana - lessons learned

Introduction At GetInData, we understand the value of full observability across our application stacks. In this article we will share with you our…

Read more
getindata bigdatatech cfp
Big Data Event

How we evaluate the CfP submissions and build the conference agenda at Big Data Technology Warsaw Summit

Big Data Technology Warsaw Summit 2021 is fast approaching. Please save the date - February 25th, 2021. This time the conference will be organized as…

Read more
lean big data 1
Tutorial

Lean Big Data - How to avoid wasting money with Big Data technologies and get some ROI

During my 6-year Hadoop adventure, I had an opportunity to work with Big Data technologies at several companies ranging from fast-growing startups (e…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy