Radio DaTa Podcast
10 min read

MLOps in the Cloud at Swedbank - Enterprise Analytics Platform

In this episode of the RadioData Podcast, Adama Kawa talks with Varun Bhatnagar from Swedbank. Mentioned topics include: Enterprise Analytics Platform, evolution of MLOps at Swedbank, iterative development for ML models and more.

We encourage you to listen to the whole podcast or, if you read it here.

Host: Adam Kawa, GetInData | Part of Xebia CEO

Since 2010, Adam has been working with Big Data at Spotify (where he proudly operated one of the largest and fastest-growing Hadoop clusters in Europe), Truecaller and as a Cloudera Training Partner. Nine years ago, he co-founded GetInData | Part of Xebia – a company that helps its customers to become data-driven and builds custom Big Data solutions. Adam is also the creator of many community initiatives such as the RadioData podcast, Big Data meetups and the DATA Pill newsletter.

Guest: Varun Bhatnagar

Varun is an MLOps and DevOps lead designer at Swedbank, located in India. He started as a consultant working for Ericsson as a Python developer. This was a time when he started developing a passion towards visualization, automation and cloud technologies. Since 2014 he’s been helping various customers with the adaptation of DevOps and cloud migration.

He has been interested in MLOps for a few years, which started with being interested in developing and shipping into production some MLOps models in Jupyter Notebook, and now he’s working in developing the data analytics platform for Swedbank in the cloud.

_______________

Swedbank

Swedbank, Sweden's largest bank and the third-largest in the Nordic region, enhanced its market position by moving its capabilities to Azure cloud. With the Enterprise Analytics Platform (EAP), AI and ML tools are easily accessible, benefiting the entire organization. MLOps implementation shortened development cycles, reducing time-to-market.

What kind of platform are you building in Swedbank?

Varun: The platform is called the Enterprise Analytics Platform and it’s set up on Azure cloud. The migration to the cloud from on-premise was finished in June 2022. The migration was done for around 50 sources and encapsulated around 50 tables, which altogether held around 95 terabytes of data. Right now we’ve got around 20 analytic models developed on the cloud, and there are more to come. We’ve got integration with various BI tools for reporting and analytics and we comply with all required security regulations.

Can you tell us a little bit of how you discovered the need for moving to the cloud and building an ML platform at Swedbank. What challenges did you face when implementing it?

Varun: (stone age) We have to start with the “stone age” as I like to call it. Back in 2019 we started experimenting with Machine Learning and deployed our very first model into production. It was a huge achievement. The code was written locally by a data scientist on a laptop. A lot of hard work had to be done to train and evaluate the model, because getting the data and training the model was hard. Since everything was done manually, it took a lot of time to get it up and running in production. We learned a lot from that experience. At that time we were not fully integrated with banking services. The members of the team were working alone without proper communication. We also lacked automation which slowed things down. Data source availability was also a challenge because data was copied from one local source to another, which took a lot of time and created more data inconsistencies. We realized that machine learning systems cannot be built manually. We understood that we had to have an automated process which would organize the way we managed our models in production.

(bronze age) From 2019 to 2021 we moved towards the bronze age. We were trying to standardize the development and deployment process of the ML models and we did that by using open source technologies. We had a semi-automated process and standardized environment for development and deployment. All of those things helped us to standardize processes and shorten the time of deploying the model to production. We still had some manual steps in between which led to bottleneck situations and delays. We still had data and schema skews but there were much less of them. Data was available only in production and data was not supposed to be copied to local environments. It was tricky to work around those limitations. We realized that we could further improve. Being on-premise means that you have a limited amount of resources which was becoming a challenge, because the new use-cases were becoming more and more complex and were consuming more and more resources.

(gold age) Finally we arrived at the gold age. Because of the limitations of on-premise solutions, we decided to move to the cloud which started in 2021 and is still ongoing. Our on-premise platform was reaching end-of-life. We were faced with a dilemma between renewing our licenses or moving to the cloud. Those, among other factors, contributed to the decision of moving to the cloud. This created a demand for the redesign and reengineering of some of our processes. Today we’re fully functional on the cloud. There are more collaborative teams and people with different skill sets across multiple teams. Now we have a complete hands-off deployment process and automated checks. We have proper segregation of environments for developers and production and we have centralized data access. We keep track of metrics, logging and proper model registry.

Is there any game changing technology in a team that was used or developed that has significantly changed the way that data practitioners work at Swedbank? 

Varun: It is actually a mix of slight improvements in every area. The collaboration between team members has improved and at the same time, the restructuring that happened allowed people with various skill sets to become a part of the teams. The architectural changes also helped in faster iterative development. Centralized data reduces the time of copying the data between different environments. Now the teams have a very clear goal of what they want to achieve. Because of that we can iterate on the model in a faster way.

Do you see an improvement between the models working on the cloud compared to on-premise?

Varun: Yes, we see the performance increase. Mainly because of improving our process overall. In order for us to have proper implementation of MLOps, we finalized 6 components which are must-haves for our whole MLOps process. We have versioning, experimentation tracking, artifact tracking, configuration and development environment, so that it’s the same for dev, testing and prod. You need to have testing in place, linting and repository structure. You might want to have unit tests in place in order to catch as many errors as possible in your unit tests. The automation process also helps. The reproducibility is improved. You don’t have to create the whole environment from scratch. The monitoring of the model has also been improved.

What is your focus this year and how do you want to develop your platform? What features or capabilities do you want to enable in your platform?

Varun: We plan to add more and more capabilities to generate more business value. One of the key focus areas is to make our platform available to as many users as possible. We need to have strong training so that new users feel comfortable when they get on-board. They need to understand the way of working. We want to improve the efficiency of using the resources on the cloud. We also want to improve the automation as much as we can. Also it’s very important to stick to the best practices of development. We also try to use monitoring to a larger extent. Right now we can already detect the staleness of the model, which indicates that it needs retraining. We want to improve on that. We also have to keep up with security updates and we’re continuously working on it. We’re trying to create reusable assets for the users so that new users don't have to reinvent the wheel.

Could you describe what is your tech-stack and what technologies you are using?

Varun: We’re set up on Azure, for any of the compute workloads we use Databricks. When it comes to version control and CICD we use Azure DevOps with some internal services in the bank. For orchestration we use Azure Data Factory and for sourcing we use Abinitio. We make use of Docker and Kubernetes when it comes to open source technologies. We also make use of MLFlow.

Imagine that you have to make this journey one more time. What mistakes or pitfalls would you like to avoid? What would you prioritize?

Varun: The first thing to do is to engage people with different skills and mindsets to work in teams. We would also try to have a clear vision of what we want to achieve, better define the milestones. It’s important not to try to deploy the whole MLOps process in one go, it’s better to be done in phases. There is no single recipe for MLOps to work, so it’s good to understand the MLOps through reading books and articles in order to try to extract the most valuable information and apply that to your own use case.

Could you tell us more about your processes at Swedbank?

Varun: We redesigned many of our processes at Swedbank while moving to the cloud. We listed the current needs of the organization and we got 3 major categories, which were:

  • breaking the silos and creating more collaborative teams
  • defining processes and ways of working
  • creating capabilities and technologies mapping, which basically means that based on the requirements we draw the capability map for each technology which fits those capabilities

We started by creating a team structure. We also divided the teams into parts: the infrastructure part, which was responsible for setting up a stable and functional platform, application enabler part which consisted of data scientists and ML engineers, who were responsible for developing the solution for developing modern cycle management, and the last part was the data and I/O part which was mainly responsible for data acquisition from new sources, ensuring that the data is available, the data access policy is put on and that the data governance is in place. 

In order to be more reasonable we introduced the phase approach. The first was the solution and design phase (part one and part two). This phase was completely focused on evaluating the machine learning capabilities on Azure. It lasted 6 weeks when we were describing the scope. In the second part of this phase we wanted to implement some less critical use-case, in order to evaluate the capabilities of AI technologies provided by the cloud and the value we are able to generate.

Next was the migration and implementation phase, where we defined our migration strategy and we had a clear plan of what parts of the solution we had to lift and shift, and what parts of the solution needed reengineering. The focus was to implement the features that we had on-premise.

The last was the enhancement phase, and we’re in it at the moment. Where we plan to improve the existing features and add more features (like monitoring).

Before finalizing any tech stack we were analyzing the capabilities to assess whether those met the needs of the platform. With this capability map, it was easier to plan a backlog.

You can listen to the whole episode here: 

Subscribe to the Radio DaTa podcast to stay up-to-date with the latest technology trends and discover the most interesting data use cases!

18 September 2023

Want more? Check our articles

flink metadata catalog
Tutorial

Flink with a metadata catalog

Have you worked with Flink SQL or Flink Table API? Do you find it frustrating to manage sources and sinks across different projects or repositories…

Read more
logs analytics in cloud loki albert lewandowski getindata big data blog notext
Tutorial

Logs analytics at scale in the cloud with Loki

Logs can provide a lot of useful information about the environment and status of the application and should be part of our monitoring stack. We'll…

Read more
airbyte column selectionobszar roboczy 1 4
Tutorial

Less data, less problems: Airbyte’s column selection is finally here

The Airbyte 0.50 release has brought some exciting changes to the platform: checkpointing (so that you don’t have to start from scratch in case of…

Read more
blogdzisobszar roboczy 1 4
Use-cases/Project

What drives your customer’s decisions? Find answers with Machine Learning Models! H&M’s Kaggle competition

Introduction We recently took part in the Kaggle H&M Personalized Fashion Recommendations competition where we were challenged to build a…

Read more
getindator create a high tech and dynamic illustration represen a37ec8de 4a50 49d5 95b5 ba7eaf847b88
Tutorial

Flink SQL - changelog and races

Managing data efficiently and accurately is a significant challenge in the ever-evolving landscape of stream processing. Apache Flink, a powerful…

Read more
data menocratization data managment white paper by getindata
Whitepaper

White Paper: Data Democratization Through Data Management

Our recently released white paper, "Data Democratization Through Data Management" offers an in-depth exploration of the subject. This article will…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy