Tutorial
7 min read

Data pipeline evolution at Linkedin on a few pictures

Data Pipeline Evolution

The LinkedIn Engineering blog is a great resource of technical blog posts related to building and using large-scale data pipelines with Kafka and its “ecosystem” of tools. In this post, I provide several pictures and diagrams (including quotes) that summarise how data pipeline has evolved at LinkedIn over the years. The actual content is based on LinkedIn’s articles and presentations that transparently describe the pros and cons of their data infrastructure (thanks LinkedIn for sharing!).

Problem Definition

“We had dozens of data systems and data repositories. Connecting all of these would have lead to building custom piping between each pair of systems something like this:”

datapipeline-complex
What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps (https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying)

Idealistic Vision

“Instead, we needed something generic like this:”

data-pipeline-simple
What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps (https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying)

Kafka Does All The Magic

Kafka became a universal pipeline (…) It enabled near real-time access to any data source, empowered our Hadoop jobs, allowed us to build real-time analytics, vastly improved our site monitoring and alerting capability, and enabled us to visualize and track our call graphs.

kafka-broker
A Brief History of Scaling LinkedIn (https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin)

Loading To Hadoop Looks Simple

As simple as the picture below:

kafka-to-hadoop-simple

Deployment Reality (loading to Hadoop only)

“The figure shows the complexity of the data pipelines. Some of the solutions like our Kafka-etl (Camus), Oracle-etl (Lumos) and Databus-etl pipelines were more generic and could carry different kinds of datasets, others like our Salesforce pipeline were very specific. At one point, we were running more than 15 types of data ingestion pipelines and we were struggling to keep them all functioning at the same level of data quality, features and operability.”

gooblin-complex-getindata
Gobblin’ Big Data With Ease by Shirshanka Das and Lin Qiao (https://engineering.linkedin.com/data-ingestion/gobblin-big-data-ease)

Deployment Reality (loading to and from Hadoop)

The quote by Jay Kreps several years ago:

Note that data often flows in both directions, as many systems (databases, Hadoop) are both sources and destinations for data transfer. This meant we would end up building two pipelines per system: one to get data in and one to get data out.

It looks that the quote above didn’t save LinkedIn from building a complex system like below:

gobblin-big-data-with-ease-qconsf
Gobblin’ Big Data With Ease @ QConSF 2014 by Lin Qiao (http://www.slideshare.net/LinQiao1/gobblin-big-data-with-ease

Deployment Reconsidered

“Late last year (2013), we took stock of the situation and tried to categorize the diversity of our integrations a little better. (…) We also realized there were some common patterns and requirements. (…) We’ve brought these demands together to form the basis for our uber-ingestion framework Gobblin. As the figure below shows, Gobblin is targeted at “gobbling in” all of LinkedIn’s internal and external datasets through a single framework.”

gooblin-simple
Gobblin’ Big Data With Ease by Shirshanka Das and Lin Qiao (https://engineering.linkedin.com/data-ingestion/gobblin-big-data-ease)

Reconsidered Idea

Our motivations for building Gobblin stemmed from our operational challenges in building and maintaining disparate pipelines for different data sources across batch and streaming ecosystems. (…) Our first target sink was Hadoop’s ubiquitous HDFS storage system and that has been our focus for most of last year. (…) At LinkedIn, Gobblin is currently integrated with more than a dozen data sources including Salesforce, Google Analytics, Amazon S3, Oracle, LinkedIn Espresso, MySQL, SQL Server, SFTP, Apache Kafka, patent and publication sources, CommonCrawl, etc.

gobblin-ingest-ecosystem
Bridging Batch and Streaming Data Ingestion with Gobblin by Shirshanka Das(https://engineering.linkedin.com/big-data/bridging-batch-and-streaming-data-ingestion-gobblin).

Sooner or later, Gobblin will be also integrated with sinks different than Hadoop such as real-time stream processing frameworks e.g. Samza, Storm, Flink Streaming, Spark Streaming.

The ideal deployment scenario is where we can deploy Gobblin in continuous ingestion mode. (…) This will bring further latency reductions in our ingestion from streaming sources, enable resource utilization efficiencies and allow us to integrate with streaming sinks seamlessly.

Coming Full Circle?

Even though Gobblin is the probably most recent data-ingestion innovation at LinkedIn, there is one more brand-new project that might make a difference soon.

Kafka Connect is a tool for copying data between Kafka and a variety of other systems, ranging from relational databases to logs and metrics, to Hadoop and data warehouses, to NoSQL data stores, to search indexes, and more.

kafka-connect-source-sink-flow-diagram
Confluent Platform 2.0 is GA! by Neha Narkhede (http://www.confluent.io/blog/confluent-platform-2.0-with-apache-kafka-0.9-ga)

Summary

Although it’s great to see new open-source world-class tools that simplify Big Data ingestion, the reality is much more complex than the vision.

The picture is not always worth a thousand words, but sometimes the picture should be explained with a thousand words

big data
camus
hadoop
hdfs
kafka
21 December 2015

Want more? Check our articles

1 RsDrT5xOpdAcpehomqlOPg
Big Data Event

2³ Reasons To Speak at Big Data Tech Warsaw 2020 (February 27th, 2020)

Big Data Technology Warsaw Summit 2020 is fast approaching. This will be 6th edition of the conference that is jointly organised by Evention and…

Read more
5mlopsobszar roboczy 1 4
Tutorial

MLOps: 5 Machine Learning problems resulting in ineffective use of data

In recent times, Machine Learning has seen a surge in popularity. From Google to tech startups, everyone is rushing to use Machine Learning to expand…

Read more
data driven fast track 3 steps make you data driven company
Tech News

Data-driven fast-track: 3 steps to make your company more data-driven

Hardly anyone needs convincing that the more a data-driven company you are, the better. We all have examples of great tech companies in mind. The…

Read more
radiodatawilla
Radio DaTa Podcast

Data Journey with Arunabh Singh (Willa) – Building robust ML & Analytics capability very early with FinTech, skills & competencies for data scientists with ML/AI predictions for the next decades.

In this episode of the RadioData Podcast, Adama Kawa talks with Arunabh Singh about Willa use cases (​ FinTech): the most important ML models…

Read more
howdoweapplyknowledgeobszar roboczy 1 4

How do we apply knowledge sharing in our teams? GetInData Guilds

Do you remember our blog post about our internal initiatives such as Lunch & Learn and internal training? If yes, that’s great! If you didn’t get the…

Read more
dbt machine learning getindataobszar roboczy 1 4
Tutorial

dbt & Machine Learning? It is possible!

In one of our recent blog posts Announcing the GetInData Modern Data Platform - a self-service solution for Analytics Engineers we shared with you our…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy