Tutorial

10 min read

GCP Workflows - how can you integrate a lightweight managed orchestrator in Modern Data Stack? GID Modern Data Platform

Modern Data Stack has been around for some time already. Both tools and integration patterns have become more mature and battle tested. We shared our solution blueprints a couple of months ago: GetInData Modern Data Platform - features & tools. Since then we’ve received plenty of questions about our platform and the proposed technologies used - one of the most common ones was: why don’t we start quick & small and go for a lightweight managed orchestrator? This was a clear indication that we needed to be prepared for some alternatives for Apache Airflow and Cloud Composer. This is how we incorporated GCP Workflows as a fully-fledged element of our stack. Below you'll learn what its typical use cases are and how we got around the topic of the integration of GCP Workflows with other components

Part 1: Introduction

Before we deep dive into data pipelines and different ways they can be orchestrated, let’s start with some fundamentals.

What is dbt?

dbt (Data Build Tool) is an open-source command-line tool that enables data analysts and engineers to transform, test and document their data pipelines.

dbt allows users to build and deploy data models and transformations and tests in a streamlined and repeatable manner.

dbt is a very attractive tool because of its modularity and version controlled approach to data transformation and modeling. It also provides automated testing, documentation and flexibility to work with a variety of databases and data warehouses.

dbt is free and open-source, meaning that users can use, modify and distribute the software without incurring additional costs. This makes dbt an accessible and cost-effective option for companies of all sizes, especially smaller startups or organizations with limited budgets.

Orchestration with Apache Airflow

The definition of data transformations and their dependencies is definitely a core functionality of a data pipeline, however we would not be able to say that we have a fully automated stack without its proper orchestration. Initially we integrated our solution with Apache Airflow, one of the most commonly used tools on the market.

As a part of this integration, at GetInData we developed the dbt-airflow-factory package, that combines dbt artifacts with Airflow by transforming the dbt-generated manifest file on-the-fly into a Directed Acyclic Graph (DAG) with a manageable graphical representation of the data pipeline. When shifting our attention to GCP Workflows, it became obvious that we would need to come up with a similar automation, and this is where the idea of dbt-workflows-factory originated.

What is GCP Workflows & Cloud batch?

GCP Workflows is a managed workflow orchestration service. Provided by GCP, it allows the user to automate, manage and analyze complex workflows across multiple services provided by Google.

Workflows' primary use cases automate data processing pipelines, allowing the data to be ingested from various sources, transformed and exported to other systems or applications.

GCP Workflows is a powerful platform that offers ease of use, seamless integration with other GCP components and cost-effectiveness. With a simple and intuitive interface, users can quickly build and execute complex workflows across GCP services. GCP Workflows charges based on a pay-as-you-go model, with free tier and pricing plans that include discounts for sustained usage. Overall, GCP Workflows is a versatile and cost-effective platform that can help organizations streamline their workflows and save time and money.

GCP Batch jobs is a managed batch processing service that is also provided by Google, enabling users to process large volumes of data using distributed computing techniques. It doesn’t require the user to manage the underlying infrastructure.

It was designed to work seamlessly with other Google Cloud services, allowing users to define their batch jobs using Docker containers. It also provides features for job scheduling, monitoring and management.

GCP Batch also charges based on a pay-as-you-go model, with pricing determined by the number of virtual machine instances used and the duration of their use. There is also a free tier available.

When comparing GCP Batch to Apache Airflow, it's important to note that Airflow is not a serverless platform and requires infrastructure management.

However, Airflow does offer more flexibility and control over your workflows, as well as a more mature ecosystem of plugins and integrations. Additionally, Airflow is free and open-source, although you will need to manage the infrastructure it runs by yourself.

Running GCP Batch jobs orchestrated from GCP Workflows offers multiple benefits, but may not always be so easy. While the process is described below, let's focus on the pros it provides:

scalability - Batch jobs are great with processing large volumes of data quickly and efficiently, while Workflows easily manages the processes.
automation - GCP Workflows allow users to automate the process from A to Z, allowing the triggering of jobs, monitoring progress and handling any errors.
flexibility - Workflows can orchestrate batch jobs that utilize various GCP services, which gives the users a lot of fields for discovery.

This is why we would now like to introduce you to the dbt-workflows-factory - a python library allowing users to integrate dbt tasks and orchestrate and run tasks with GCP tools.

Part 2: Running the flow

How can you run a dbt pipeline on Cloud Batch using GCP Workflow?

To run the dbt workflow on Cloud Batch, users can create a simple GCP Workflow that will trigger all of the jobs automatically.

To do so, only a simple .yaml file needs to be created in order to run it.

You can create such a .yaml using the new dbt-workflows-factory.

Creating configuration

How can you use the dbt-workflows-factory from CLI?

To run from CLI, in the library you will simply need to call

getindata-cli-library

Where you specify the arguments as follows:

image_uri: url address for the image
region: the location where tge workflow executes on GCP (example: us-central1 or europe-west1)
full_command: full command executed on image (example: "dbt --no-write-json run --target env_execution --project-dir
/dbt --profiles-dir /root/.dbt --select ")
remote_path: gcs mount path (example: "/mnt/disks/var")
key_volume_mount_path: path for mounting the volume containing key (ex. /mnt/disks/var/keyfile_name.json)
key_volume_path: path for mounting (ex. ["/mnt/disks/var/:/mnt/disks/var/:rw"])
key_path: is a remote path for bucket containing key to be mounted

How can you use the dbt-workflows-factory from python?

Specify the parameters and run the converter to create a workflow.yaml file from manifest.json

from dbt-workflows-converter import DbtWorkflowsConverter, Params

params = Params(image_uri='my_image_url', region='us-central1', full_command='dbt run', remote_path='/mnt/disks/var', key_colume_mount_path='/mnt/disks/var/keyfile_name.json', key_volume_path='/mnt/disks/var/:/mnt/disks/var/:rw', key_path='bucketname' )

converter = DbtWorkflowsConverter(params)
converter.convert() # writes to file workflow.yaml

How can you run the configuration from GCP?

When you have your .yaml ready, and the secrets are in the correct bucket, go to GCP Workflows.

You have two options:

Run the job from the GCP page

Click +, set up the region and then paste your .yaml file.

Run from GCP CLI

Log in to GCP

gcloud auth login*
gcloud config set project \[YOUR_PROJECT_ID]*

And then run using the yaml:

gcloud workflows execute \[WORKFLOW_NAME] --source \[WORKFLOW_FILE]*
gcloud workflows executions describe \[EXECUTION_ID]*

Either way, after doing so, you should see your flow in GCP:

getindata-flow-gcp

The separate jobs should also be visible in GCP Batch:

getindata-gcp-batch

And that's it!

Part 3: About the library

About

The library streamlines the process of converting dbt tasks into GCP Workflows, making it easier for developers to manage data pipelines on GCP.

This library is available on GitHub at https://github.com/getindata/dbt-workflows-converter

The first version of the library is currently only capable of processing the run and model dbt tasks. However, future development plans include parsing more complicated tasks, making it easier to automate even more complex data pipelines.

When to use GCP Workflows instead of Airflow

When deciding between GCP Workflows and Airflow for workflow orchestration, it's important to consider the unique strengths and weaknesses of each platform.

GCP Workflows is a fully managed, serverless platform that is ideal for running workflows that require integration with other GCP services. It is particularly well-suited for building data pipelines that process data stored in GCP storage services, such as BigQuery or Cloud Storage.

On the other hand, Airflow is a powerful open-source platform that provides a wider range of customization options and supports a larger number of third-party integrations.

Airflow is an excellent choice for complex workflows that require extensive customization and configuration. Airflow can also run on-premises or in any cloud environment, making it a more flexible option than GCP Workflows.

In summary, if your workflow primarily involves GCP services and requires integration with other GCP tools, GCP Workflows is likely the best choice.

However, if you require more flexibility, control and third-party integrations, Airflow may be the better option.

Ultimately, the choice between these two platforms will depend on the specific needs of your project, so it's important to evaluate both options carefully before making a decision

Plans for future development

In the future, the library will be integrated as a plugin for the DP framework, making it more easily accessible for developers.

Additionally, our Modern Data Platform Framework will be extended to include the possibility of deploying on GCP Workflows, further streamlining the process of automating data pipelines.

The library will also integrate scheduling jobs in GCP Workflows by cloud scheduler, making it possible to automate data pipelines at regular intervals.

Finally, the library will extend the configurability of workflows jobs, making it possible to fine-tune data pipelines to fit specific needs.

These and more future developments will make it even easier for developers to manage data pipelines on GCP, simplifying and streamlining the data engineering process.

Conclusion

We look forward to seeing the impact that this new package will have and to continue our mission of providing innovative solutions to the challenges faced by our community.

We would like to take this opportunity to encourage everyone to contribute to the further development of this package here. Your feedback and suggestions are invaluable to us and we welcome any contributions you may have.

If you would like to learn more about the Modern Data Platform or have any questions or comments, please do not hesitate to contact us. You can also schedule a FREE CONSULTATION with our specialist. We look forward to hearing from you and working together to continue improving our tools and resources.

Airflow

dbt

GCP

Modern Data Platform

GCP Workflows

modern data stack

Apache Airflow

Last updated: 4 July 2023

Written by

Anita Śledź

Software Developer

Like this post?
Spread the word

Want more? Check our articles

Use-cases/Project

Real-Time Data Revolution: How Bank Millennium Transformed Customer Engagement and Fraud Prevention

The rapid growth of electronic customer contact channels has led to an explosion of data, both financial and behavioral, generated in real-time. This…

Learn dbt Data Modeling: 3 Expert Blogs You Shouldn’t Miss

If you’re in the data world, you already know dbt (data build tool) is the real deal for transforming raw data into something actionable. It’s the go…

transfer legacy pipeline modern gitlab cicd kubernetes kaniko

Tutorial

How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 2

Please dive in the second part of a blog series based on a project delivered for one of our clients. If you miss the first part, please check it here…

Big Data Event

How we evaluate the CfP submissions and build the conference agenda at Big Data Technology Warsaw Summit

Big Data Technology Warsaw Summit 2021 is fast approaching. Please save the date - February 25th, 2021. This time the conference will be organized as…

Tutorial

Online ML Model serving using MLeap

Training ML models and using them in online prediction on production is not an easy task. Fortunately, there are more and more tools and libs that can…

Radio DaTa Podcast

MLOps in the Cloud at Swedbank - Enterprise Analytics Platform

In this episode of the RadioData Podcast, Adama Kawa talks with Varun Bhatnagar from Swedbank. Mentioned topics include: Enterprise Analytics Platform…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

GCP Workflows - how can you integrate a lightweight managed orchestrator in Modern Data Stack? GID Modern Data Platform

Part 1: Introduction

What is dbt?

Orchestration with Apache Airflow

What is GCP Workflows & Cloud batch?

Part 2: Running the flow

How can you run a dbt pipeline on Cloud Batch using GCP Workflow?

Creating configuration

How can you use the dbt-workflows-factory from python?

How can you run the configuration from GCP?

Part 3: About the library

About

When to use GCP Workflows instead of Airflow

Plans for future development

Conclusion

Like this post?Spread the word

Want more? Check our articles

Real-Time Data Revolution: How Bank Millennium Transformed Customer Engagement and Fraud Prevention

Learn dbt Data Modeling: 3 Expert Blogs You Shouldn’t Miss

How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 2

How we evaluate the CfP submissions and build the conference agenda at Big Data Technology Warsaw Summit

Online ML Model serving using MLeap

MLOps in the Cloud at Swedbank - Enterprise Analytics Platform

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!