Data online generation for event stream processing
In a lot of business cases that we solve at Getindata when working with our clients, we need to analyze sessions: a series of related events of actors…
Read moreAs organizations increasingly adopt cloud-native technologies like Kubernetes, managing costs becomes a growing concern. With multiple teams sharing the same cluster, it can be challenging to track and manage costs accurately and provide visibility into team-level expenses.
In this blog post, we'll explore how Kubecost - a widespread open-source cost monitoring and optimization tool, will help to show cost differences between two example teams, each owning and running their Airflow DAG with varying execution times, in a shared Kubernetes cluster. We'll dive into the details of Kubecost, its features, and how it can help you optimize costs, improve resource utilization, and make data-driven decisions.
Our scenario involves two teams, Team_1
and Team_2
, each with their own namespace in a shared Kubernetes cluster. Both teams use a shared Airflow instance, an open-source workflow management platform, to manage their workflows and trigger Airbyte sync jobs. Each team has its DAG (Directed Acyclic Graph) that runs for a different amount of time in their namespace. Team_1
DAG runs cumulatively for a little over 7 hours per day, while Team_2
DAG runs cumulatively for almost 2,5 hours daily.
We will make shortcuts to mimic the Airbyte sync job in each team (which usually takes some time, depending on the amount of data). As you can see in the code block below, Kubernetes Airflow job pod is deployed into a namespace held by each team (team1
and team2
respectively). Pod itself does a chain job: trigger simple Airbyte job to sync data between Sample Data Faker and S3 bucket and then runs stress command with parameters as an Airflow transformation stage impostor. The job is then triggered based on cron every 20 minutes for Team_1
and 30 minutes for Team_2
. We will be gathering data for three days.
from airflow import DAG
from airflow.providers.airbyte.operators.airbyte import AirbyteTriggerSyncOperator
from airflow.operators.empty import EmptyOperator
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
import pendulum
from kubernetes import client as k8s
executor_config_airbyte1= {
"pod_override": k8s.V1Pod(
metadata=k8s.V1ObjectMeta(
namespace="team1"
),
)
}
with DAG(dag_id="airbyte_airflow_dag_team1",
default_args={"owner": "airflow"},
schedule_interval='*/20 * * * *',
catchup=False,
start_date=pendulum.today("UTC").add(days=-1)
) as dag:
trigger_airbyte_sync_1 = AirbyteTriggerSyncOperator(
task_id="airbyte_trigger_sync_1",
airbyte_conn_id="airbyte_1",
connection_id="58410801-eee4-451e-8a8e-8a9af08f2e75",
asynchronous=False,
executor_config=executor_config_airbyte1,
)
transformation_1 = KubernetesPodOperator(
namespace="team1",
image="progrium/stress",
name="airflow-airbyte-transformation-1",
task_id="transformation_1",
is_delete_operator_pod=True,
cmds=["stress"],
arguments=["--cpu", "3", "--io", "3", "--vm", "2", "--vm-bytes", "1024M", "--timeout", "360s"],
get_logs=True,
executor_config=executor_config_airbyte1,
)
end_task = EmptyOperator(task_id="end")
trigger_airbyte_sync_1 >> transformation_1 >> end_task
Team_2
DAG is a copy of the one you see above with the difference of selected namespace and stress
command arguments:
arguments=["--cpu", "1", "--io", "2", "--vm", "1", "--vm-bytes", "128M",
"--timeout", "160s"]
Both DAG files are committed to the Git repository, enabling Airflow to synchronize its contents and load the DAGs, thus making them accessible for users to view, manage, and trigger jobs within the Airflow interface. Of course, Helm was used to install Airflow:
helm repo add airflow https://airflow.apache.org
helm install airflow apache-airflow/airflow
--namespace airflow --values=airflow-values.yaml
--create-namespace=true
Value file is default with this minor change:
gitSync:
enabled: true
credentialsSecret: gitlab-credentials-secret
# git repo clone url
# ssh example: git@github.com:apache/airflow.git
# https example: https://github.com/apache/airflow.git
repo: https://gitlab.com/getindata/devops/dags.git
branch: main
rev: HEAD
Once done, set the port forward from airflow-webserver
pod inside Kubernetes to reach Airflow UI and confirm that DAG sync is done and one can navigate inside DAG details.
To track costs and resource utilization, we've set up Kubecost, which provides detailed cost allocation and usage metrics, allowing us to drill down into costs by namespace, deployment, and even individual pods.
Deploy Kubecost to our Kubernetes cluster using Helm:
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm install kubecost -n kubecost kubecost/cost-analyzer
--values=kubecost-values.yaml --create-namespace=true
In the values file there is nothing fancy: just a disabled grafana for it's not currently needed.
Airbyte is following a similar path: a clean install but in two different namespaces:
helm repo add airbyte https://airbytehq.github.io/helm-charts
helm install airbyte airbyte/airbyte --version 0.64.81 -n team1
helm install airbyte airbyte/airbyte --version 0.64.81 -n team2
Kubecost is a comprehensive cost monitoring and optimization platform explicitly designed for Kubernetes environments. It provides a wide range of features to help users break down costs by namespace, deployment, service, and more across any major cloud provider or on-prem Kubernetes environment. It can also help to optimize Kubernetes resource utilization, reduce costs, and improve overall efficiency.
With Kubecost up and running, we can now see detailed cost information for each team. The overview dashboard provides a clear breakdown of costs by namespace, allowing us to identify which team is incurring the most expenses easily. We can also take a quick look at the Cluster Details page:
By navigating to Monitor > Allocations and filtering the data we are interested in based on the namespace and three-day period, we can quickly see Team_1
and Team_2
total costs with a breakdown to each deployment:
To drill down into - in this case - a group of resources under a quite suspicious name Uncontrolled workloads
, we just need to click on its name to see that hiding behind that name are just airflow-transformations pods - this is the step where we run stress
command. This type of pods are grouped under an uncontrolled name because kubescaller does not support them.
Now, back to analysis.
As expected, Team_1
longer and more frequently running DAG results in higher overall costs. Because for this exercise, the cluster was using single node configuration only, I’ve decided to skip shared and idle costs encapsulated and presented within the __idle__
section in the cost breakdown, to highlight the differences.
Now, we can get a precise cost difference between two teams when both namespaces are combined under a single filter and aggregated by namespace, as presented below.
By analyzing the cost drivers, we can see that the Airflow transformation jobs are the primary contributors to Team_1
and Team_2
costs. This information can be used to optimize the workflows, reduce costs, and improve overall efficiency.
We can also set up a budget for each team / namespace by navigating to Govern > Budgets.
This will help keep track of current consumption versus the one we’ve agreed on. Notifications can also be configured to inform users about reaching a certain threshold and pop a message to email Slack or MS Teams.
Kubecost provides valuable insights into cost differences between teams in a shared Kubernetes cluster. By tracking costs and resource utilization, teams can identify areas for optimization and make data-driven decisions to reduce costs. In this scenario, we've seen how Kubecost helped us uncover cost differences between two teams running Airflow DAGs with varying execution times, setup budgets for each team namespace. Whether you're a developer, DevOps engineer, or financial analyst, Kubecost is an essential tool for anyone looking to manage costs in a Kubernetes environment.
By using Kubecost, teams can:
Do you want to stay updated with articles, tips, and tutorials full of practical content? Sign up for our newsletter, where we share only the best content once a quarter, like this article you just read.
In a lot of business cases that we solve at Getindata when working with our clients, we need to analyze sessions: a series of related events of actors…
Read moreModern Data Stack has been around for some time already. Both tools and integration patterns have become more mature and battle tested. We shared our…
Read moreApache NiFI, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data…
Read moreWe are producing more and more geospatial data these days. Many companies struggle to analyze and process such data, and a lot of this data comes…
Read moreAbout In this White Paper, we described what is the Industrial Internet of Things and what profits you can get from Data Analytics with IIoT What you…
Read moreThese days, companies getting into Big Data are granted to compose their set of technologies from a huge variety of available solutions. Even though…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?