Flink with a metadata catalog
Have you worked with Flink SQL or Flink Table API? Do you find it frustrating to manage sources and sinks across different projects or repositories…
Read moreIn this episode of the RadioData Podcast, Adama Kawa talks with Varun Bhatnagar from Swedbank. Mentioned topics include: Enterprise Analytics Platform, evolution of MLOps at Swedbank, iterative development for ML models and more.
We encourage you to listen to the whole podcast or, if you read it here.
Host: Adam Kawa, GetInData | Part of Xebia CEO
Since 2010, Adam has been working with Big Data at Spotify (where he proudly operated one of the largest and fastest-growing Hadoop clusters in Europe), Truecaller and as a Cloudera Training Partner. Nine years ago, he co-founded GetInData | Part of Xebia – a company that helps its customers to become data-driven and builds custom Big Data solutions. Adam is also the creator of many community initiatives such as the RadioData podcast, Big Data meetups and the DATA Pill newsletter.
Guest: Varun Bhatnagar
Varun is an MLOps and DevOps lead designer at Swedbank, located in India. He started as a consultant working for Ericsson as a Python developer. This was a time when he started developing a passion towards visualization, automation and cloud technologies. Since 2014 he’s been helping various customers with the adaptation of DevOps and cloud migration.
He has been interested in MLOps for a few years, which started with being interested in developing and shipping into production some MLOps models in Jupyter Notebook, and now he’s working in developing the data analytics platform for Swedbank in the cloud.
_______________
Swedbank, Sweden's largest bank and the third-largest in the Nordic region, enhanced its market position by moving its capabilities to Azure cloud. With the Enterprise Analytics Platform (EAP), AI and ML tools are easily accessible, benefiting the entire organization. MLOps implementation shortened development cycles, reducing time-to-market.
Varun: The platform is called the Enterprise Analytics Platform and it’s set up on Azure cloud. The migration to the cloud from on-premise was finished in June 2022. The migration was done for around 50 sources and encapsulated around 50 tables, which altogether held around 95 terabytes of data. Right now we’ve got around 20 analytic models developed on the cloud, and there are more to come. We’ve got integration with various BI tools for reporting and analytics and we comply with all required security regulations.
Varun: (stone age) We have to start with the “stone age” as I like to call it. Back in 2019 we started experimenting with Machine Learning and deployed our very first model into production. It was a huge achievement. The code was written locally by a data scientist on a laptop. A lot of hard work had to be done to train and evaluate the model, because getting the data and training the model was hard. Since everything was done manually, it took a lot of time to get it up and running in production. We learned a lot from that experience. At that time we were not fully integrated with banking services. The members of the team were working alone without proper communication. We also lacked automation which slowed things down. Data source availability was also a challenge because data was copied from one local source to another, which took a lot of time and created more data inconsistencies. We realized that machine learning systems cannot be built manually. We understood that we had to have an automated process which would organize the way we managed our models in production.
(bronze age) From 2019 to 2021 we moved towards the bronze age. We were trying to standardize the development and deployment process of the ML models and we did that by using open source technologies. We had a semi-automated process and standardized environment for development and deployment. All of those things helped us to standardize processes and shorten the time of deploying the model to production. We still had some manual steps in between which led to bottleneck situations and delays. We still had data and schema skews but there were much less of them. Data was available only in production and data was not supposed to be copied to local environments. It was tricky to work around those limitations. We realized that we could further improve. Being on-premise means that you have a limited amount of resources which was becoming a challenge, because the new use-cases were becoming more and more complex and were consuming more and more resources.
(gold age) Finally we arrived at the gold age. Because of the limitations of on-premise solutions, we decided to move to the cloud which started in 2021 and is still ongoing. Our on-premise platform was reaching end-of-life. We were faced with a dilemma between renewing our licenses or moving to the cloud. Those, among other factors, contributed to the decision of moving to the cloud. This created a demand for the redesign and reengineering of some of our processes. Today we’re fully functional on the cloud. There are more collaborative teams and people with different skill sets across multiple teams. Now we have a complete hands-off deployment process and automated checks. We have proper segregation of environments for developers and production and we have centralized data access. We keep track of metrics, logging and proper model registry.
Varun: It is actually a mix of slight improvements in every area. The collaboration between team members has improved and at the same time, the restructuring that happened allowed people with various skill sets to become a part of the teams. The architectural changes also helped in faster iterative development. Centralized data reduces the time of copying the data between different environments. Now the teams have a very clear goal of what they want to achieve. Because of that we can iterate on the model in a faster way.
Varun: Yes, we see the performance increase. Mainly because of improving our process overall. In order for us to have proper implementation of MLOps, we finalized 6 components which are must-haves for our whole MLOps process. We have versioning, experimentation tracking, artifact tracking, configuration and development environment, so that it’s the same for dev, testing and prod. You need to have testing in place, linting and repository structure. You might want to have unit tests in place in order to catch as many errors as possible in your unit tests. The automation process also helps. The reproducibility is improved. You don’t have to create the whole environment from scratch. The monitoring of the model has also been improved.
Varun: We plan to add more and more capabilities to generate more business value. One of the key focus areas is to make our platform available to as many users as possible. We need to have strong training so that new users feel comfortable when they get on-board. They need to understand the way of working. We want to improve the efficiency of using the resources on the cloud. We also want to improve the automation as much as we can. Also it’s very important to stick to the best practices of development. We also try to use monitoring to a larger extent. Right now we can already detect the staleness of the model, which indicates that it needs retraining. We want to improve on that. We also have to keep up with security updates and we’re continuously working on it. We’re trying to create reusable assets for the users so that new users don't have to reinvent the wheel.
Varun: We’re set up on Azure, for any of the compute workloads we use Databricks. When it comes to version control and CICD we use Azure DevOps with some internal services in the bank. For orchestration we use Azure Data Factory and for sourcing we use Abinitio. We make use of Docker and Kubernetes when it comes to open source technologies. We also make use of MLFlow.
Varun: The first thing to do is to engage people with different skills and mindsets to work in teams. We would also try to have a clear vision of what we want to achieve, better define the milestones. It’s important not to try to deploy the whole MLOps process in one go, it’s better to be done in phases. There is no single recipe for MLOps to work, so it’s good to understand the MLOps through reading books and articles in order to try to extract the most valuable information and apply that to your own use case.
Varun: We redesigned many of our processes at Swedbank while moving to the cloud. We listed the current needs of the organization and we got 3 major categories, which were:
We started by creating a team structure. We also divided the teams into parts: the infrastructure part, which was responsible for setting up a stable and functional platform, application enabler part which consisted of data scientists and ML engineers, who were responsible for developing the solution for developing modern cycle management, and the last part was the data and I/O part which was mainly responsible for data acquisition from new sources, ensuring that the data is available, the data access policy is put on and that the data governance is in place.
In order to be more reasonable we introduced the phase approach. The first was the solution and design phase (part one and part two). This phase was completely focused on evaluating the machine learning capabilities on Azure. It lasted 6 weeks when we were describing the scope. In the second part of this phase we wanted to implement some less critical use-case, in order to evaluate the capabilities of AI technologies provided by the cloud and the value we are able to generate.
Next was the migration and implementation phase, where we defined our migration strategy and we had a clear plan of what parts of the solution we had to lift and shift, and what parts of the solution needed reengineering. The focus was to implement the features that we had on-premise.
The last was the enhancement phase, and we’re in it at the moment. Where we plan to improve the existing features and add more features (like monitoring).
Before finalizing any tech stack we were analyzing the capabilities to assess whether those met the needs of the platform. With this capability map, it was easier to plan a backlog.
You can listen to the whole episode here:
Subscribe to the Radio DaTa podcast to stay up-to-date with the latest technology trends and discover the most interesting data use cases!
Have you worked with Flink SQL or Flink Table API? Do you find it frustrating to manage sources and sinks across different projects or repositories…
Read moreLogs can provide a lot of useful information about the environment and status of the application and should be part of our monitoring stack. We'll…
Read moreThe Airbyte 0.50 release has brought some exciting changes to the platform: checkpointing (so that you don’t have to start from scratch in case of…
Read moreIntroduction We recently took part in the Kaggle H&M Personalized Fashion Recommendations competition where we were challenged to build a…
Read moreManaging data efficiently and accurately is a significant challenge in the ever-evolving landscape of stream processing. Apache Flink, a powerful…
Read moreOur recently released white paper, "Data Democratization Through Data Management" offers an in-depth exploration of the subject. This article will…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?