Tutorial

7 min read

Data online generation for event stream processing

In a lot of business cases that we solve at Getindata when working with our clients, we need to analyze sessions: a series of related events of actors, to find correlations or patterns of the actors' behaviors . Some of the cases include:

Analysis of user web sessions for web analytics.
Funnel analysis, cohort analysis and many more,to understand how product users interact with the product, behave, convert or drop out.
Fraud/anomaly or threat detection to prevent them quickly.
Product recommendations, suggestions and many more…

This is generally speaking, wherever behavioral analytics are being applied.

For the use cases to be able to research and develop applications, we need to create test data for experimentation, functional tests, performance tests, demos and so on.

Randomly generated events produced by most data generators are not enough. The data that is produced is unrelated. There are no actor sessions. However, there are a few projects that have tried to tackle this problem. One of them is GitHub - Interana/eventsim which was sadly abandoned a few years ago.

So, what can a solution that supports mentioned use cases look like? We can start with modeling actor behavior - what the actor can do and how it can evolve over time. Something like this can be expressed as a state machine.

There are still few things missing thou. For start we don't want to just have a state machine and then invoke transitions by writing lines of code, we would land back where we started. We want our state machine to transition automatically based on some probability.

Stream processing use case

Let's consider an example we are working on pattern recognition. We are trying to find banking application users that were thinking about taking a loan but did not took it. We are trying to target users with low balance that opened a loan screen and exited without taking a loan. Model of such interaction with application could look like that:

data-online-generation-for-event-stream-processing-diagram

User can log into an application and when they are Online they can open a loan screen or exit application. From Loan screen they can ether take a loan or exit.

But what 0.1% probability mean? This is a probability that our state machine will automatically do that transition in each step. If we assign some real-world duration of a step, for example 1 minute. 0.1% probability would mean that on average user will log in once in 17 hours. Or from other perspective in each minute 1 of 1000 users will log in.

Using above we can generate click stream of user actions. But what about balance? It cannot be easily expressed in finite-state machine. We could think about adding new set of states Offline low balance, Online low balance,
Loan Screen low balance but it looks like a duplication and it gets even more messy when we think about adding something additional like for example a loan balance.

data-online-generation-for-event-stream-processing-diagram2

Because balance does not affect set of possible actions it should not be modeled as a state. It is just an attribute of particular user that is traversing our state machine which value can be emitted when we are generating events during transitions. Using attributes we can go back to our simpler state machinebut this time enriched with new actions that won't be changing state but will change value of attributes. How can we generate data suitable for stream processing?

data-online-generation-for-event-stream-processing-diagram3

Data Online Generator

We have create a tool that can do all of the above Data Online Generator GitHub - getindata/doge-datagen it can be used to simulate user, system or other actor behavior bases on probabilistic model. It is a state machine which is traversed by multiple subjects automatically based on probability defined for each possible transition.

To express above state machine we need to define list of states, initial state, a factory to create initial state of users, number of users, tick length and number of ticks to simulate. Then we need to define each transition, its name, beginning and ending state, probability of transition and optional callback and sinks. Callback will be invoked during transition and allows for user attributes change. It can also be used to implement conditional transitions returning False will block state machine from transitioning. For example we can block Spending transition if balance is zero. Event sinks are used to generate events during a transition. Currently DOGE-datagen ships with sinks for Database and Kafka.

datagen = DataOnlineGenerator(['offline', 'online', 'loan_screen'], 'offline', UserFactory(), 10, 60000, 1000, 1644549708000)
datagen.add_transition('income', 'offline', 'offline', 0.01,
                       action_callback=income_callback, event_sinks=[balance_sink])
datagen.add_transition('spending', 'offline', 'offline', 0.1,
                       action_callback=spending_callback, event_sinks=[trx_sink, balance_sink])
datagen.add_transition('login', 'offline', 'online', 0.1, event_sinks=[clickstream_sink])
datagen.add_transition('logout', 'online', 'offline', 70, event_sinks=[])
datagen.add_transition('open_loan_screen ', 'online', 'loan_screen', 30, event_sinks=[clickstream_sink])
datagen.add_transition('close_loan_screen', 'loan_screen', 'online', 40, event_sinks=[clickstream_sink])
datagen.add_transition('take_loan', 'loan_screen', 'online', 10,
                       action_callback=take_loan_callback, event_sinks=[clickstream_sink, loan_sink, balance_sink])
datagen.start()

Example output:

$ kafka-avro-console-consumer --topic clickstream --consumer.config ~/.confluent/local.conf --bootstrap-server localhost:9092 --from-beginning
{"user_id":"7","event":"login","timestamp":1644549768000}
{"user_id":"7","event":"open_loan_screen ","timestamp":1644549828000}
{"user_id":"7","event":"take_loan","timestamp":1644550008000}
{"user_id":"1","event":"login","timestamp":1644556668000}
{"user_id":"1","event":"open_loan_screen ","timestamp":1644556728000}
{"user_id":"1","event":"close_loan_screen","timestamp":1644556968000}
{"user_id":"4","event":"login","timestamp":1644559668000}
{"user_id":"4","event":"open_loan_screen ","timestamp":1644559728000}
{"user_id":"4","event":"close_loan_screen","timestamp":1644559788000}

$ kafka-avro-console-consumer --topic trx --consumer.config ~/.confluent/local.conf --bootstrap-server localhost:9092 --from-beginning
{"user_id":"8","amount":75,"timestamp":1644553008000}
{"user_id":"6","amount":47,"timestamp":1644558708000}
{"user_id":"3","amount":42,"timestamp":1644563808000}
{"user_id":"0","amount":85,"timestamp":1644583128000}
{"user_id":"2","amount":40,"timestamp":1644597228000}
{"user_id":"5","amount":18,"timestamp":1644601308000}

postgres=# select * from balance;
 timestamp     | user_id | balance 
---------------+---------+---------
 1644550008000 |       7 |   10933
 1644553008000 |       8 |     409
 1644558708000 |       6 |     320
 1644563808000 |       3 |     577
 1644583128000 |       0 |     259
 1644597228000 |       2 |     697
 1644601308000 |       5 |     702

postgres=# select * from loan;
 timestamp     | user_id | loan  
---------------+---------+-------
 1644550008000 |       7 | 10000

Conclusion

Data Online Generator can be used not only during development and experimentation, but can also be a part of automated test pipelines. It is designed to be flexible to support many use cases and is also extendable and allows for easy implementation of new event sinks. Visit our GitHub, to see examples on how to work with the data online generator.

If you're interested in stream processing, check our CEP Platform.

—

Did you like this blog post? Check out our other blogs and sign up for our newsletter to stay up to date!

streaming

stream processing

data online generator

Data Generator

Last updated: 15 March 2022

Written by

Grzegorz Liter

Staff Data Engineer

Like this post?
Spread the word

Want more? Check our articles

GetInData in 2022 - achievements and challenges in Big Data world

Time flies extremely fast and we are ready to summarize our achievements in 2022. Last year we continued our previous knowledge-sharing actions and…

Success Stories

Customer Story: Platform focused on centralizing data sources and democratization of data with ING

The client who needs Data Analytics Platform ING is a global bank with a European base, serving large corporations, multinationals and financial…

flink kubernetes how why blog big data cloud

Tutorial

Flink on Kubernetes - how and why?

Flink is an open-source stream processing framework that supports both batch processing and data streaming programs. Streaming happens as data flows…

Radio DaTa Podcast

Data Journey with Arunabh Singh (Willa) – Building robust ML & Analytics capability very early with FinTech, skills & competencies for data scientists with ML/AI predictions for the next decades.

In this episode of the RadioData Podcast, Adama Kawa talks with Arunabh Singh about Willa use cases ( FinTech): the most important ML models…

Tutorial

From 0 to MLOps with ❄️ Part 2: Architecting the cloud-agnostic MLOps Platform for Snowflake Data Cloud

From 0 to MLOps with Snowflake ❄️ In the first part of the blogpost, we presented our kedro-snowflake plugin that enables you to run your Kedro…

Big Data Event

A Review of the Presentations at the Big Data Technology Warsaw Summit 2022!

The 8th edition of the Big Data Tech Summit is already over, and we would like to thank all of the attendees for joining us this year. It was a real…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

Data online generation for event stream processing

Stream processing use case

Data Online Generator

Conclusion

Like this post?Spread the word

Want more? Check our articles

GetInData in 2022 - achievements and challenges in Big Data world

Customer Story: Platform focused on centralizing data sources and democratization of data with ING

Flink on Kubernetes - how and why?

Data Journey with Arunabh Singh (Willa) – Building robust ML & Analytics capability very early with FinTech, skills & competencies for data scientists with ML/AI predictions for the next decades.

From 0 to MLOps with ❄️ Part 2: Architecting the cloud-agnostic MLOps Platform for Snowflake Data Cloud

A Review of the Presentations at the Big Data Technology Warsaw Summit 2022!

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!