Tutorial

6 min read

Semi-supervised learning on real-time data streams

Acquiring unlabeled data is inherent to many machine learning applications. There are cases when we do not know the result of the action provided by the input data immediately, because the result is postponed in time. Moreover, sometimes the results are not available at all, for example when there is a non-negligible cost associated with the acquisition of labels. So, how can you benefit from unlabeled data in order to improve prediction?

In this blogpost, I will present a possible way to make better use of partially labeled data. Using certain techniques described further on in the article, I will augment the labeled data features with extracted features from nearby unlabeled data, in order to improve the accuracy of the label prediction. The augmentation algorithm will generate the features automatically, based on the provided hyperparameters, which can be an important tool during the data exploration process. Promising generated features may be, for example, used to improve machine learning models.

This blog post's subject and contents are based on my Master's thesis "Semisupervised learning under delayed label setting". Let’s start!

Semi-supervised problem example case

One example of a semi-supervised problem is the prediction of flight delays. In this case, while the features of the flight (such as distance, departure time, planned arrival time, origin and destination location) may be available before landing, the value that we want to predict, the actual delay, is only available after landing. This scenario, where the event features are available before the event labels, is called a delayed label acquisition scenario.

It is possible to use semi-supervised learning for such a scenario. To convert the input stream to suit semi-supervised learning in the simplest way, we can imagine the new data stream, which outputs one unlabeled record when the event features become available, and then the second labeled record when the event label becomes available.

The semi-supervised data stream augmentation algorithm concept

In order to make the input data augmentation useful, let's assume that events which occurred recently prior may contain valuable information when predicting the current event. In the case of the flight delay prediction example, this assumption makes sense, as flights taking place in a similar time and location experience shared external conditions, such as weather or traffic congestion. Therefore, we may expect patterns to emerge for nearby delay values.

As a result, if the nearby events are somehow correlated, in order to improve the quality of a prediction, the features of events that took place in a similar time and location should be included, whether those events are labeled or not.

The idea for the implementation of the automatic data augmentation algorithm is to keep a buffer of features of events that happened recently. Those features can be concatenated, clustered and aggregated, or otherwise extracted.

The data stream may be logically partitioned in order to increase the locality of the buffer. In the flight delay prediction example, it may make sense to create a separate buffer partition for each destination airport, because if there were recently many delayed flights at a particular airport, we can predict that following flights will also be delayed.

Solution

The semi-supervised data stream augmentation algorithm was created using [MOA](https://moa.cms.waikato.ac.nz/). Four main approaches for enriching input data with the vector of input records in the buffer were tested

1. The naïve approach

The simplest approach is to concatenate all of the buffer's records to the current record, without further feature extraction. However, this approach’s efficiency can be hampered by high data dimensionality - if an incoming event and the n-element buffer both contain p attributes per element, the resulting vector would contain (n + 1) ∗ p attributes.

2. The concatenated feature extraction approach

Using dimensionality reduction techniques on buffer records could be an improvement to the naïve approach. In this approach, each of the events currently in the buffer are processed using a feature extraction algorithm. Then, the resulting extracted features are concatenated with the original event attributes.

3. The cluster aggregation approach

The clustering model is created, which does an unsupervised clustering for each of the events currently kept in the buffer. Then, each element in the buffer is clustered, which gives us the number of elements for each cluster group that belong to that group. This information is concatenated on the current input data record.

4. The cluster complex event aggregation approach

As in the previous approach, each element in the buffer is clustered. However, instead of aggregating each cluster group count, occurrences of complex events decision rules are counted. Each of the rules is represented as a regular expression. An example of such a rule is C_a X* C_b. This rule will be satisfied, when there is an event in the buffer, which belongs to the cluster C_a, and there is an event which belongs to the cluster C_b situated later in the buffer.

Results

The tests were performed on semi-supervised datasets. Among them were three open source datasets, a proprietary closed source dataset, and three generated datasets. The results achieved by the most successful augmentation approaches slightly outperformed the unaugmented approach on three of the four real-life-based datasets. Which approach is best depends on the dataset - approaches 2, 3 and 4 were found the most performant in at least one tested dataset, while the naïve approach, as expected, was found to be subpar.

Moreover, the computational overhead induced by the approaches was negligible in comparison to the overhead of a larger number of attributes.

In other words, the model with fewer total attributes, some of which were generated by the augmentation algorithm, should be faster than the model with more total attributes without augmentation.

Source code

An open source implementation of the methods tested in the thesis is available at https://github.com/kosmag/FeatExtream. The repository contains MOA extension classes and Jupyter notebooks for benchmark data creation and result visualization. The code is licensed with the GNU General Public License v3.0.

Further work

This area of research is quickly expanding and the solution presented above is meant to be a proof of the viability of the concept.

In order to create a working implementation of the solution in the production environment, it makes sense to look into the capabilities of the state-of-the-art feature stores, for example [Feast](https://docs.feast.dev/). Using an existing feature store, it is possible to combine an improvement to the prediction performance with scalability, as well as the ability to use the existing integrations with on-premise and cloud data sources or the supported deployments.

Conclusions

In conclusion, automated data augmentation, as suggested in the thesis, may provide a boost in semi-supervised prediction performance, at the cost of the added complexity of the prediction algorithm. How much performance which can be added to the model in such a way will vary from dataset to dataset.

Low computational overheads means that such an augmentation may be a computationally cheap, automatic way of finding useful features for the model.

Finally, I would like to thank my thesis supervisor. He offered me critical advice as well as constant encouragement. His engagement throughout the thesis writing process made the whole experience much more pleasant.

machine learning

real-time data

Last updated: 7 July 2023

Written by

Kosma Grochowski

Data Engineer

Like this post?
Spread the word

Want more? Check our articles

Whitepaper

eBook: Power Up Machine Learning Process. Build Feature Stores Faster - an Introduction to Vertex AI, Snowflake and dbt Cloud

Recently we published the first ebook in the area of MLOps: "Power Up Machine Learning Process. Build Feature Stores Faster - an Introduction to…

Looking Back at 2024: GetInData’s in Data & AI

Let’s take a moment to look back at 2024 and celebrate everything we’ve achieved. This year has been all about sharing knowledge, creating impactful…

mamava getindata cloud google bigquery prostooleh

Success Stories

Success story: Breastfeeding supported with modern IoT and app features

Outstanding customer experience is usually backed by robust data analytics. Same applies to Mamava, a business that celebrates and supports…

Whitepaper

White Paper: Data Analytics for Industrial Internet of Things

About In this White Paper, we described what is the Industrial Internet of Things and what profits you can get from Data Analytics with IIoT What you…

running observability kubernetesobszar roboczy 1 4

Tutorial

Running Observability Stack on Grafana

Introduction At GetInData, we understand the value of full observability across our application stacks. For our Customers, we always recommend…

Tutorial

Real-time Analytics: architecture, technologies and example implementation in e-commerce

Real-time analytics are all processes of collecting, transforming, enriching, cleaning and analyzing data to provide immediate insights and actionable…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

Semi-supervised learning on real-time data streams

Semi-supervised problem example case

The semi-supervised data stream augmentation algorithm concept

Solution

1. The naïve approach

2. The concatenated feature extraction approach

3. The cluster aggregation approach

4. The cluster complex event aggregation approach

Results

Source code​

Further work

Conclusions

Like this post?Spread the word

Want more? Check our articles

eBook: Power Up Machine Learning Process. Build Feature Stores Faster - an Introduction to Vertex AI, Snowflake and dbt Cloud

Looking Back at 2024: GetInData’s in Data & AI

Success story: Breastfeeding supported with modern IoT and app features

White Paper: Data Analytics for Industrial Internet of Things

Running Observability Stack on Grafana

Real-time Analytics: architecture, technologies and example implementation in e-commerce

Contact us

Interested in our solutions?Contact us!

Source code

Like this post?
Spread the word

Interested in our solutions?
Contact us!