Tutorial

7 min read

Can AI automatically fix and optimize IT systems like Flink or Spark?

Will AI replace us tomorrow? In recent years, there have been many predictions about what areas of our lives will be automated and which professions or services will become unnecessary. I'm not talking about sci-fi books, but research-backed analyses. How does this relate to the IT industry? Will ML models be able to solve system problems and optimize Flink, Spark or other data processing systems? Or even replace fully-fledged software engineers?

Becoming a Flink (or other tech) master

While working on one of the projects in Apache Flink, when I was once again analyzing the logs, I wondered how long it would be before I found a solution to this problem. It occurred to me that surely there is a programmer out there who has encountered the same problem and solved it. Apparently, however, they didn't share the solution on StackOverflow, because I couldn't find a trace of this anywhere. Probably, the next time they encounter a similar error message in the logs, they will immediately know where to look for the problem - because they have previously gained experience by pouring over the logs and the thicket of configuration parameters on their own.

However, to be a true expert in a specific technology, you would have to wade through a tremendous number of problems, understand and skillfully exploit the potential of configuration parameters, creating a reliable and optimal solution.

Unfortunately, the process of gaining this knowledge and experience is very painstaking, depends on the project we are working on (not all problems will occur in our project) and, most importantly, is individual to each of us. No one will read the documentation for us, go through hundreds of examples of technology use in various configurations and struggle through countless failures.

But what if it was possible to extract this experience gained over the years and make it available to other engineers - who are not yet experts in the technology?

Let's consider for a moment what this experience is.

Making the expertise an IF -> ELSE algorithm

For me, it is the complex algorithm IF.... ELSE, for example:

if the checkpointing takes too long, take a peek at how the X parameter is configured and how much data is processed per second
when we use Flink in version X and in the logs we encounter exception Y, then raise the version to X' because the developers just fixed this problem in it
when the volume of processed data is about X and the amount of remaining RAM approaches Y, and the exception Z appears in the logs, then increase the amount of memory

All this expertise takes the form of a complex decision tree. The more expertise we have, the more IF....ELSE we remember.

But wait a minute... the decision tree is, after all, one of the simpler ML algorithms.

Would it be possible to use machine learning to analyze Flink's problems?

ML model for fixing Flink

Let's consider for a moment how such a model could be trained.

Let's take as input parameters:

Some basic configuration parameters, e.g:

partitioned RAM
level of parallelization (parallelism)
the way checkpointing is defined

As output attributes, let's examine:

logs
whether there are errors and warnings in them
the amount of RAM used per worker
the amount of CPU used by the worker

Now we hand the problem over to the ML magician and after a few days and liters of coffee we get a trained recommender model that could:

link errors occurring in the logs to the current Flink configuration and recommend changing it for optimization purposes
recommend a code or configuration change when a specific error is encountered

I bet that such a system would not be completely accurate, but even if it automatically solved 80% of the problems, it would still be pretty good.

Making systems machines readable…

Nowadays, systems are designed in such a way that a possible error is supposed to be human readable - the log should contain the exact place of occurrence along with the full stacktrace, and configuration parameters should be well documented, with examples of their use.

The system should be as easy to use and operate as possible.... for a human.

However, if we wanted to use more complex ML algorithms to auto optimize it, this would require some changes to the system itself, so that it would also be more easily manageable by an automated ML algorithm.

Instead of the full error message in the logs, for example, only its unique code would be sufficient.

It would be necessary to unify the system for collecting metrics and configuration parameter values, in such a way that they could be easily applied later as a batch to the ML algorithm.

Perhaps it would be possible for a technology vendor or the open-source community itself to add an already-trained recommender or error analyzer model. It would run as an additional operator in parallel with our application.

The system would analyze metrics on an ongoing basis, sending recommendations in the form of alerts to Slack.

This would probably be very difficult, but not impossible.

Or making machines be able to read systems

Let's imagine the future of such systems in 30 years or so, where advances in software engineering and ML algorithms would be at a completely different level than today.

Take, for example, a stream processing platform.

It has no input parameters, because why should it?

The allocation of memory, CPU and other parameters would be done after analyzing the application code, the volume of input data and would be continuously tuned at runtime.

We don't need to know about it.

The system would automatically scale itself to handle the input traffic and select the appropriate side technologies (e.g.: cache, storage) that can easily handle this volume of traffic.

For some time the system would have to run "on DEV" in order for the algorithms to select the optimal settings, but after this time the system would deploy itself on production.

Although there would have to be some kind of mechanism for debugging our application to eliminate human errors in the code.

Self-Created technologies

Imagine that we have the business requirements themselves.

We know the specifics of the input data and we know how we want to process it and where to store the result.

In addition, we have a defined method of using this result.

Now the engineer takes these requirements and selects the main technology themselves, e.g: Flink and all the side ones, e.g.: cache, storage and cloud computing.

You need to create a team of experts with experience in these technologies and then create a solution and maintain it on an ongoing basis.

However, one could automatically generate the necessary technology based on data and business requirements, while training ML algorithms of optimizers and stability guardians.

After all, you wouldn't need to take the whole of Flink, with all its features and the baggage of their stability risks.

Therefore, it would be possible to generate the code needed to handle our data format with the given volume, all the processing steps and save the result in the output destination. All this, together with a system for collecting metrics as input to a pre-trained ML model of optimizers and stability guards. It would be a bit like compiling a Linux core, under a specific machine.

The code would contain only the necessary fragments, optimized for a specific business case.

No-code, no-job?

Who knows what the future holds, but there is already a lot of interest in no-code and low-code solutions.

There are more and more technologies, so specialized experts are getting harder and harder to come by, and they are getting more expensive. It is natural that the market does not like a void and is trying to automate those areas where human resources are lacking.

We'll see what the future brings, but it will certainly be interesting :)

Are you a Flink expert? We now have an open position that may interest you: Senior Data Engineer (Flink).

And for more predictions, interesting articles and tutorials cCheck out the DATA Pill newsletter - a weekly roundup of the best content about Big Data, Cloud, ML and AI. Subscribe and stay up to date on trends.

TAKE DATA PILL

technology

apache flink

flink

spark

IT future

Last updated: 13 March 2023

Written by

Marcin Kacperek

Senior BigData Developer

Like this post?
Spread the word

Want more? Check our articles

getindata flink kafka audio spectrum analyzer smalltext

Use-cases/Project

Puzzles in the time of plague: truly over-engineered audio spectrum analyzer

Quarantaine project Staying at home is not my particular strong point. But tough times have arrived and everybody needs to change their habits and re…

Use-cases/Project

Enabling Hive on Spark on CDH 5.14 — a few problems (and solutions)

Recently I’ve had an opportunity to configure CDH 5.14 Hadoop cluster of one of GetInData’s customers to make it possible to use Hive on Spark…

DATA Pill – the blue pill that (accidentally) works!

Ever felt overwhelmed by the flood of news about the latest technologies, tools, and trends in Data, AI, and ML? A new framework here, a revolutionary…

4 pragmatic enablers data driven decision making notext

Use-cases/Project

4 pragmatic enablers of data-driven decision making

You could talk about what makes companies data-driven for hours. Fortunately, as a single picture is worth a thousand words, we can also use an…

deploying serverless mlflow google cloud platform using cloud run machine learning getindata notext

Tutorial

Deploying serverless MLFlow on Google Cloud Platform using Cloud Run

At GetInData, we build elastic MLOps platforms to fit our customer’s needs. One of the key functionalities of the MLOps platform is the ability to…

5 reasons to follow us on Linkedin. Celebrating 1,000 followers on our profile!

We are excited to announce that we recently hit the 1,000+ followers on our profile on Linkedin. We would like to send a special THANK YOU :) to…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

Can AI automatically fix and optimize IT systems like Flink or Spark?

Becoming a Flink (or other tech) master

Making the expertise an IF -> ELSE algorithm

ML model for fixing Flink

Making systems machines readable…

Or making machines be able to read systems

Self-Created technologies

No-code, no-job?

Like this post?Spread the word

Want more? Check our articles

Puzzles in the time of plague: truly over-engineered audio spectrum analyzer

Enabling Hive on Spark on CDH 5.14 — a few problems (and solutions)

DATA Pill – the blue pill that (accidentally) works!

4 pragmatic enablers of data-driven decision making

Deploying serverless MLFlow on Google Cloud Platform using Cloud Run

5 reasons to follow us on Linkedin. Celebrating 1,000 followers on our profile!

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!