Tech News

10 min read

Announcing the GetInData Modern Data Platform - a self-service solution for Analytics Engineers

The GID Modern Data Platform is live now!

The Modern Data Platform (or Modern Data Stack) is on the lips of basically everyone in the data world right now. The need for a more self-service approach towards data-driven insight development has been observed in many of our clients for some time now. One of the most frequently observed painful points was the lengthy time-to-value due to complex, unstandardised data projects & pipelines requiring a lot of devops and engineering effort. Data quality issues and the lack of a consistent, holistic approach to data testing only added to this frustration. Interestingly, this was a common theme in all types of business - from FinTech start-ups to big telco enterprises. This is why we took a closer look at this topic (read our previous blog post: Modern Data Platform - the what's, why's and how's. Demystifying the buzzword) and outlined what new things the modern data stack brings to the industry and how it's even possible. We understood that there was a need to change the way the data teams were organized and data-driven insights were produced, so we compiled our long and vast industry experience and adapted this pattern quickly into a repeatable solution. We also recognized that platform usability was not fully realized with the bag of unintegrated tools available, so we supplemented the design with our own tools to build a seamless analyst’s workflow. Now we are ready to announce that we rolled up our sleeves and evolved this blurrily-defined concept into a solution - a GID Modern Data Platform developed in our R&D Labs. In this article you will learn what it is all about - both from a functional and technical perspective. As usual - be prepared for a nicely cooked data meal containing best-in-class ingredients with an open source flavor (and no black-boxed recipes!). If you’re new to the topic of the Modern Data Platform and have no clue what all the fuss is about - you can quickly get onboarded by listening to our Radio DaTa podcast recorded by Jakub Pieprzyk.

architecture-bigdata-modern-data-platform — A concept of the Modern Data Platform

The motivation to create the GID Modern Data Platform

Every market trend has a reason - it is no different with the Modern Data Stack.

More and more products are being built around data, which means that data literacy is basically a must nowadays, if you’d like to stay afloat in this rapidly changing market. The expectations in different industries for data access democratization are as high as never before. This is why there is such a big need for self-service data platforms with a low-code approach that make it possible for the users - analysts or analytics engineers to independently design, develop and maintain their data pipelines, utilizing best engineering practices. When you think about a Modern Data Stack, you probably instantly imagine a cloud data warehouse ingested by modern data integration tools (e.g. Airbyte or Fivetran), data transformation using SQL (like in dbt)and interactive reports generated in a self-service business intelligence tool (like Looker) as the icing on top of the data product cake. These tools are definitely the key building blocks for delivering every data platform, but they alone are not enough to build a seamless user experience. Some of the challenges stumbled upon by analysts include:

How to define data ingestion themselves without the help of data engineers?
How to work on their analysis without installing heavy tooling on their workstation?
How to transfer ingestion, transformation or BI code from development to production reliably?
How to make the results of data pipelines available in BI tools?
How to make a data catalog common across all these tools?
How to start with a new project. Is there a template to start with?

The data infrastructure engineers are also troubled with some of the following questions:

How do you integrate these tools with existing infrastructure?
How do you make the platform multi cloud and independent of any vendor?
How do you secure data access across the whole stack?
How do you automate infrastructure and pipeline deployment to many environments?
How do you make the platform cost efficient?

Below you will discover how we addressed these challenges with our platform design. But before we dive into the design, let’s first define the target we are aiming at.

So, what’s the deal? The MDP manifest.

Let’s start from a definition of what we feel a Modern Data Platform should offer. This is a handful of key values which we had in mind when designing our platform. We divided them into three different perspectives in order to address all the points of view that need to be taken into consideration. Some of these functionalities are an inherent part of the core products we use in our stack (e.g. cloud data warehouses), but in order to achieve others (like simplicity or standardization), we had to come up with some dedicated components.

User’s attractions

This is the most important point of view because it defines the main functional reasons behind the solution. This encapsulates the value proposition of the product.

Self-Service

Every product team should be independent across the whole span of the data project - including data ingestion, discovery, exploration, transformation, deployment and reportingand. It shouldn’t need a data or software engineer to understand their data. You can probably recall such situations when according to your gut feeling it was the time to act, but you lacked the data engineering resources to support your business decision because of the data, which was stored in sophisticated structures and required technical skills to make sense out of it. As a result you missed a certain business opportunity. It’d be great if a data platform could allow all the SQL-savvy analysts to act independently, wouldn’t it? Think of this solution as Google Maps in the times when you had to hire a guide so as not to get lost in an unknown city. Now you can be your own guide …in the data world and make data-driven decisions.

Simplicity

This comes down to an easy-to-use user interface supported by a web based workbench app for communication with the platform and low-code, SQL-based data transformations for business logic definition, so that it can be easily understood by product teams. There is no need to depend on data engineers and their heavy and sophisticated toolset and waste a lot of energy on repetitive translations from business to technical language. Just sign in to your workbench - delivering value from your data will be within your reach.

Integrated Data Quality & Observability

Data quality should be guaranteed as early in the pipeline as possible via integrated and automated data testing. This fosters trust in data and reduces the risk of making business decisions based on faulty information. This is also possible thanks to data monitoring and alerting on discrepancies in a timely manner. Provisioning of a data catalog with data lineage helps the user to understand data provenance. Forget about the times when you had to double check every figure in order to make sure that it’s reliable and that the data that you base your decisions upon is fresh and consistent. Thanks to the mechanisms that enforce data quality control, you’re informed when your data deviates from the expectations. Data lineage on the other hand, supports the troubleshooting process to diagnose where the data faults come from.

Integrated Data Discovery

A data catalog is a self-service single source of truth on the data and associated metadata, being an incentive for collaboration and maintaining good data management practices. It was definitely quite frustrating when the definition of your data assets were spread all over your organization, incomplete or outdated. People often used to look for support on Slack or other communication channels. Now, with the emergence of modern data catalogs, you can get a “Google search” for your data with some useful context and other metadata insights on top.

Standardization

Keeping everything intact and ready for scaling includes templating similar projects in standardized structures, best practices from the market regarding project structure and well-thought naming convention. Having one framework for the data project templates helps to organize the data assets in your company. This addresses the common problem of a long and steep learning curve that a new person undertakes when being onboarded to the team and prevents having to reinvent the wheel every time a new data product is about to be created. This also makes time-to-market significantly shorter.

Data architecture - safety & flexibility first

Finally, all the most crucial data architecture decisions and assumptions are gathered in one place.

Extensibility & Flexibility

The solution is deployable in any environment (local, dev or production), any popular public cloud (GCP, AWS, Azure) and in any storage facility, while the analyst workflow in any of those configurations stays the same. This highly adaptable and configurable framework is achieved mostly thanks to the consistent abstraction and logical architecture of the whole solution. Unlike on most of the proprietary enterprise platforms, the majority of the components are interchangeable and extensible thanks to a strong open source focus.

Data Security

Consistent (defined at the source) and easily maintainable data & metadata access control policies are common for all the data clients in the stack.

Data Mesh Readiness

Support for organizations with data teams that would like to incorporate all the data mesh principles (such as data as a product, domain oriented data ownership & architecture, self-serve data infrastructure as a platform and federated decentralized data governance). This readiness is granted automatically when all previously mentioned features in the platform are provided.

Deployment & maintenance - automation is the key

This section covers all the assumptions on what it takes to provision and maintain the underlying infrastructure and core components in order to have a good understanding of the associated complexity and costs.

Easy to set up & maintain

No lengthy experimentation and development. Stack is provisioned by Infrastructure as a Code (IaaC), which is also used for all the necessary upgrades. The setup also addresses all the scalability issues with the support of native cloud configurations.

Deployment automation

Pipeline automation via CI/CD with all the engineering best practices in place that hides deployment complexity from the analytics engineers and other users.

Pay-as-you-go

No up-front payments and costly licensing fees, you only pay for what you use.

Adaptability

Most of the components are interchangeable, hence there is negligible risk of getting locked in with a certain solution.

Conclusion

Unblocking the potential of data is something that we’ve always been passionate about. We strongly believe that a well designed Modern Data Platform is a powerful weapon for product teams to make informed data-driven decisions. We put our almost decade-long data industry expertise & experience and our client stories into one pot and seasoned it with our passion to deliver value. We already know that it tasted really good for some of our clients (subscribe to our newsletter to stay tuned). If you want to learn more about our ingredients and how we serve our data meal to our customers, check out the ‘GetInData Modern Data Platform - features & tools’! and if you are interested in the Modern Data Platform for your company, sign up for a full live demo here.

big data

google cloud platform

Airflow

dbt

Modern Data Platform

Last updated: 13 September 2022

Written by

Michał Rudko

Big Data Analyst / Architect

Like this post?
Spread the word

Want more? Check our articles

Success Stories

Customer Story: Platform focused on centralizing data sources and democratization of data with ING

The client who needs Data Analytics Platform ING is a global bank with a European base, serving large corporations, multinationals and financial…

finding your way llm getindataobszar roboczy 1 4

Tutorial

Finding your way through the Large Language Models Hype

With the introduction of ChatGPT, Large Language Models (LLMs) have become without doubt the hottest topic in AI and it doesn’t seem that this is…

Use-cases/Project

Enabling Hive on Spark on CDH 5.14 — a few problems (and solutions)

Recently I’ve had an opportunity to configure CDH 5.14 Hadoop cluster of one of GetInData’s customers to make it possible to use Hive on Spark…

Tutorial

Automated Machine Learning (AutoML) with BigQuery ML. Start Machine Learning easily and validate if ML is worth investing in or not.

Machine learning is becoming increasingly popular in many industries, from finance to marketing to healthcare. But let's face it, that doesn't mean ML…

Use-cases/Project

Real-time Machine Learning: considerations based on Fraud Detection use case

When it comes to machine learning, most products are designed to work in batches, meaning they process data at fixed intervals rather than in real…