Stream Analytics platform for a Telco – part 1
Some time ago we were contacted by Kcell, one of the largest telecommunication company in Kazakhstan, with a question to help them build a new real-time platform for complex event processing.
Kcell is a part of Scandinavian telecommunication holding – Telia Company. It has a significant consumer base consisting of more than 10 million subscribers in Kazakhstan. Fortunately for us, they are also open for innovation and trying out new things as much as we at Getindata do. This allowed us to build in a joint effort a modern stream analytics platform on top of cutting-edge technologies like Apache Flink, Apache Kafka, Apache Nifi and others, what we will try to describe in this blog post.
Although this blog post has a single author, it describes the results of joint work done by the whole team of Big Data specialists from two companies (Kcell and GetInData).
Whenever one thinks what companies have the most data to analyze the most obvious choice are internet tycoons like Google, Spotify or Amazon. However, probably just behind those come telco operators. They do collect huge volumes of data from sources of different kinds: billing events, data usage events, roaming or location events. Having access to all of those, the first thing that you want to do is to be able to react instantly and automatically upon some predefined patterns of events. Whenever such a chain of events occur one may want to send some information to the subscriber, block some of their actions or notify one of the internal departments.
In the first phase of the project, we have built the platform and implemented 10+ use-cases running on top of it. To better understand the possible use-cases let’s have a look at two of them:
Some of the most obvious cases apply to marketing. As a concerned service provider, Kcell cares that the user experience is the best possible. Imagine a situation when subscriber top-ups his or her balance too often in a short period of time e.g. because he or she uses internet intensively but is not subscribed to any special tariff. If we spot such situation we may send a text message with an offer of a tailored tariff. Even better if we do it right away, as he or she is still thinking about the top-up. This way the chance the message will be successful is definitely higher. One must remember that a satisfied subscriber is less willing to change the mobile provider.
A different group of cases is looking for fraudulent actions. In this case, it is vital to spot such behaviours as soon as possible to limit the loss. One of such cases that we want to search for is whenever someone registers in roaming even though she or he does not have sufficient balance to do so.
As described here, with roaming, the call data is collected by the visited location operator — in some countries, this might still be another domestic operator, but often it can be a telecom service provider located halfway around the world. And normally you wouldn’t get access to roaming call records until this data arrives back to your home network within minutes or hours later. This gives a convenient window of opportunity for fraudulent actions that we want to avoid.
To come up with a suitable solution architecture we started with creating a list of key requirements both in terms of features needed as well as performance.
First of all, it is worth mentioning the scale of the problem. As we already said Kcell has more than 10 million subscribers and we get about 160 thousand events per seconds on average (and 300k eps in a peak!) from just a subset of the most important data sources amongst all others. This means our platform was supposed to handle around 23 TB/month for the beginning and this number will only grow with future use cases.
Because of the characteristic of our use cases when the events arrive continuously there was no choice than building a stream processing platform. For a streaming platform to be reliable it is crucial that it is not only performant but also fault-tolerant. What is important to remember is the fault tolerance means not only that the platform should run 24/7 and can restart itself after failure, but also that the internal state will be rewinded to the same value as it was before the failure occurred. So that all computations are accurate as if no failure happened at all.
Quite obvious for us was also that our system should work in event time. This is a key requirement so that notifications that we generate are correct. This is not possible if we do not recreate the original order of events from different systems like billing and roaming. If you want to read more about the pitfalls of processing time computations based on Spotify example, please have a look at our previous post.
We also found out that the set of sources and sinks (external systems, e.g. SMS gateway, that we need to push our notification) is not defined upfront at all and we would need to adjust it quite frequently. Therefore we came to a conclusion we need to build separate layers for ingestion and outgestion with elasticity in mind.
Last but not least we needed to provide a simple way for business requesters to tweak a little bit running business rules now and then. We were aware though that programming skills were not one amongst many of their proficiencies. Therefore we had to build some GUI to allow them to adjust parameters like thresholds, simple filters and so on.
All those requirements led to architecture shown in the picture below:
In the heart of our stream analytics platform, we put Apache Flink as the main processing engine. We really appreciate its features especially that it:
– provides the most flexible event time support
– it handles its state in a fault-tolerant manner thanks to the algorithm of consistent distributed snapshots and does not put any limitations on the size of state because it stores it out of the JVM in a RocksDB database.
– it provides high-throughput & low-latency capabilities that match the scale of our problem
The other key component is Apache Kafka which already became a standard for moving data around. In our solution, it is the central event hub that all data comes through. It is also always the only source and sinks for Apache Flink, which allows us to rewind to any particular moment in time.
As already mentioned we noticed that the most external layers (both ingestion and outgestion) will change quite frequently due to unreliability in external systems that we communicate with. Therefore we chose Apache Nifi as an engine for those, cause it allows us to flexibly adjust our ingestion and outgestion pipelines through a simple GUI.
Next, to strictly functional components we’ve put additional components like Monitoring and Logs collection that allows us to reason about the performance of our stream analytics platform. Without those parts, we would not be able to say if our platform works properly. It also lets us debug any problem that may slip through our testing process even though we put a lot of effort into the testing platform, which we describe later.
Of course, building such stream analytics platform does not end with choosing the overall architecture. We still needed to make many lower level decisions. Those are no less significant on the road to success. We will go through some of them on the next blog post.
Please sign up for our newsletter to stay up to date!
Edit: you can read the second part of this blog post here.