Tutorial

12 min read

Lean Big Data - How to avoid wasting money with Big Data technologies and get some ROI

During my 6-year Hadoop adventure, I had an opportunity to work with Big Data technologies at several companies ranging from fast-growing startups (e.g. Spotify) to global corporations and academic institutes. What really amazed me was the difference of how the use-cases were defined, how fast valid solutions were built and how money was spent and wasted.

To be more specific, I’ve seen many companies with powerful, expensive and battle-ready data infrastructures (i.e. tens of nodes in multiple clusters, a whole Big Data stack, enterprise licences, security restrictions, teams of administrators) and … almost no real production use-cases. This means a large cost overrun, small ROI and something one should avoid.

This post introduces the notion of “Lean Big Data” to make a smart investment in Big Data technologies, and it describes five common pitfalls that can lead to a failed Big Data project. These include deploying Big Data technologies when you don’t have big data, separating application and platform roles too soon, and building a platform without a use case in mind.

1. Deploying Big Data technologies when don’t have big data

I think that many companies equate being data-driven with using Big Data technologies. Obviously, it’s not true. You can be data-driven without Big Data. Similarly, you can use Big Data technologies, but still, don’t make data-driven decisions or don’t implement product features that are fed by data.

A few years ago, a young marketplace startup discussed their data-processing needs with me. I advised them not to use Big Data technologies. Two reasons: they would iterate slower and spend more money. These two reasons would kill this startup quickly. When you have small data and start using Big Data technologies you don’t get benefits such as scalability, cheaper replacement for DWH, schema-on-read etc. You iterate slower because you spend time learning new and complex tools, troubleshooting issues and bugs, discovering unsupported features, making time-consuming mistakes, accumulating technical debt, implementing/testing/deploying/running distributed jobs on a cluster, trying to recruit Big Data specialists and keeping them stay. It’s much easier, faster and cheaper to iterate with MySQL rather than Hadoop or HBase (especially when you’re a young startup and you evolve and change requirements frequently).

Simply speaking, you should always be data-driven, but use Big Data technologies only when you have a really good reason for it. Here are some examples:

Spotify decided to migrate to Hadoop and Cassandra because its existing solutions based on traditional databases (e.g. Postgres) were overloaded. The volume of data was increasing exponentially because the product was more and more popular. Last but not least, Spotify didn’t want to enter the world of commercial data warehouses due to their technical limitations, large expenses, vendor-dependency and own engineering culture.
Some other company, that I know, was in a slightly different situation. They already had data warehouses, but they were overloaded. They simply didn’t want to invest more money into scaling them over and over.
Similarly, at Orbitz (a website used to research, plan and book travel), the daily ETL took longer than 24 hours (!) and the cost of existing data-warehouse was really high.Thanks to Hadoop, they not only became able to store all data for a reasonable price (no sampling, no pruning, no aggregating), but they also started increasing income (as their recommendation engine become more accurate thanks to larger and richer volume data analyzed)
The University of Warsaw was implementing Machine Learning library to extract knowledge from PDF and plain-text academic papers (10+ terabytes of data). Their compute-intensive algorithms were running for weeks. They needed a technology not to store large volumes of data because they had large disk arrays for this purpose, but to easily parallelize their computation and distribute it to many worker nodes, so that their ML algorithms could finish in hours rather than weeks.

The above use-cases are examples of why you might want to start using Big Data technologies (of course, there are more!).

However, some other companies that I’ve seen started deploying Big Data technologies when … they become very trendy and cool, “worth trying” or encouraged by vendors.

2. Building the “data lake” without a use-case in mind and hoping that use-cases will appear

These companies decided to build “production” Hadoop clusters, but they didn’t have valid use-cases identified. As a result, they started generating ideas and running tens (!) of internal POCs to onboard some projects on the Hadoop platform. Most of them failed and/or had generated high costs.

There is one use-case, that I saw in several places, that seems to be “always working” – just collect all textual documents, upload them to HDFS, use Solr to index them (not Elasticsearch, because Elasticsearch is not included in CDH and HDP) and build custom front-end website to search across indexes. This use-case runs on Hadoop cluster (yes!), but … it could run without Hadoop too, because Solr and Elasticsearch are scalable and distributed technologies that can run without HDFS, YARN, MapReduce, Spark etc.

When you build the data lake “not for today, but for the future”, you will probably end up using Hadoop for following use-cases:

Ad-hoc analysis and exploration of data using SQL, R and/or Python (the same data is generated by existing ETL processes using traditional technologies, available for analysis using the same programming languages in older systems and additionally copied to Hadoop for running “experiments” – in consequence, rather than eliminating silos, we create an additional one)
Uploading millions of small XML/JSON/CSV files generated from external tools (Hadoop acts as backup or archive with schema-on-read capabilities)
Uploading larger volumes of previously-ignored data hoping that they’ll be useful later

3. Spending fortune into cluster infrastructure and toolkit to solve all technical problems and issues

The Hadoop Ecosystem has several nice properties that allow you to economize e.g. open-source licence, standard hardware requirements, linear scalability that allows you to start just with a bunch of physical/virtual nodes or use public cloud where you “pay-as-you-go”, high-level frameworks. Even though, many companies don’t seem to take advantage of them. What I have seen is truly incredible:

Multiple clusters (POC, PROD, TEST, ACC etc. – similarly to DTAP) in multiple places (on-premise & cloud) … and a few “production-like” use-cases that could easily run on a single and small cluster
Army of Hadoop administrators fixing issues in multiple time-zones … and a few analysts working with the clusters on a daily basis
Many commercial licences (e.g. Cloudera Manager) and support from vendors (including expensive ones like Cloudera, Oracle and cheaper ones from Asia/Europe to “save” costs) … and lack of internal full-time employees who have practical experience at Big Data
Advanced features like strong security with Kerberos, Hive authorization with Sentry, quotas, auditing, backups … and a single “default” queue for everyone (no multi-tenancy as each cluster is barely used)
Many commercial tools already bought or on the roadmap to evaluate e.g. WanDISCO (to copy data between POC, PROD & TEST clusters), Control-M, Talend, Qlik Sense, ZoomData (to help get something useful out of data)
No infrastructure as a code … but many internal MS Word documents that explain how to go through Cloudera Manager or Ambari Web UIs to deploy the cluster, configure Kerberos and HA, troubleshoot issues that commonly re-occur on multiple clusters

Maybe it sounds surprising, but at Spotify we run almost all batch computation on a single production and multi-tenant cluster for many years, access control matrix was rather simple, Hadoop administrators team consisted of 1-5 people, there was no licence for any Hadoop server in our thousand-node cluster storing tens of PBs of data. Of course, we run into troubles and wasted some time/money here and there, but we managed to achieve so much with a single, but fully-utilized cluster!

The more you spend on Hadoop cluster, the bigger pressure and expectations are set. For instance, many projects are “forced” to migrate to Big Data. The managers set the goals as “adding 10 new production projects to Hadoop in 2016” that can be achieved only by migrating “fake” projects. These projects often die later. What’s more, most of these projects actually don’t want to go from “POC” to “PROD”, because they will iterate slower due to various procedures such as less frequent upgrades of Hadoop, less permissions for junior employees, straighten resource management to ensure SLA, more restricted development cycle.

4. Starting with separate “platform” and “application” teams

When you start with Big Data, it’s good to build a small cross-functional team that drives the whole effort. Such team can consist of 3-6 people such as a Linux administrator, a Java/Scala developer and a SQL analyst working full or part-time. Ideally, you should have someone who has previous experience in Big Data. Thanks to a frequent and quick feedback loop, the administrators exactly know what the infrastructure really misses or don’t need at all. The developers/analysts get immediate help when running into troubles, configuration, and access issues. Eventually, this team can split into smaller teams like “infrastructure” (administrators, DevOps), “tools & ETL” (engineers) and “insights” (analysts, data scientists) and these teams could split into smaller teams even later. This was the case at Spotify many years ago and it worked well.

data-lake-getindata

I noticed that some companies start with separate “platform” and “application” teams located in different floors/buildings/countries. As a result, they tend to run into miscommunication issues and usually aren’t aligned. The “platform” teams follow its own way by building a battle-ready and award-winning data infrastructure by scaling cluster, installing many components, deploying access control rules, thinking about features like backups, auditing, quotas and introducing “good” procedures and recommendation guides written in Word documents etc. The “application” teams have a hard time figuring out the valid use-cases, learning the complex infrastructure and overcoming troubles caused by all procedures, conventions, and restrictions introduced by the “platform” team. They iterate even slower due to these rules and complications.

In my opinion, the data infrastructure should be use-case-driven and improved only to address new requirements from the applications and business use-cases. At Spotify our Hadoop cluster had been always falling behind – it was so hard to keep up to scale, enrich, tune and stabilize the cluster because the new data, use-cases and requirements were added all the time, faster and faster. Eventually, Spotify has decided to go to the public cloud hoping that GCP will be able to quickly provide everything the company needs. If your platform advances the use-cases, then probably you invested too much into it (of course it doesn’t mean that your platform should be crappy, and nobody could rely on it due to problems).

5. Not having people with real experience in Big Data

Even a single person who has real production experience with Big Data can help a company go into the right direction, avoid making expensive mistakes, hire smart people and correctly talk to vendors. However, this can be the chicken-egg problem if the company doesn’t have valid use-cases for Big Data and can’t grow internal employees.

In other words

getindata-lean-big-data

Just to be clear, I don’t criticise companies that wish to innovate, try Big Data stack and evaluate new cutting-edge technologies. I highly encourage to do that! I’m just amazed by the scale, complexity, and bureaucracy of many Big Data infrastructures.

Where a single, small, stable, well-configured and multi-tenant cluster can be enough, I often see multiple overpaid, underutilized and concerning clusters.

Simple secret

What is the easiest way to waste money with Big Data?

IMHO, spending it far from ROI and in the case of Big Data, it’s cluster infrastructure, cluster administration, and ETL.

How to avoid it?

Lean Big Data

I would recommend following the idea described in the “Lean Startup” book. Try to spend as much time and energy as possible in activities that generate real value or validate your hypothesis (e.g. data analysis, data science, data-driven features, important dashboards). Build or improve your data infrastructure in iterations, only if you see that the next iteration is needed for your company to achieve more.

For example, the first iteration could look as follows:

big-data-technologies-get-some-roi-getindata

The next iterations, could look as follows:

big-data-infrastructure-technologies-getindata

Basically, try to spend as little time as possible in steps 2 and 6.

Discussion

Do you have a similar opinion? Am I completely wrong? Please share your thoughts in the comments, so everyone could benefit from your experience and avoid expensive mistakes when build own Big Data infrastructure.

big data

technology

hadoop

Last updated: 16 October 2016

Written by

Adam Kawa

CEO and Founder

Like this post?
Spread the word

Want more? Check our articles

Whitepaper

White Paper: Data Democratization Through Data Management

Our recently released white paper, "Data Democratization Through Data Management" offers an in-depth exploration of the subject. This article will…

e commerce chatbot llmobszar roboczy 1 4

Tutorial

How to build an e-commerce shopping assistant (chatbot) with LLMs

In the dynamic world of e-commerce, providing exceptional customer service is no longer an option – it's a necessity. The rise of online shopping has…

transfer legacy pipeline modern using gitlab cicd

Tutorial

How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 3

Please dive in the third part of a blog series based on a project delivered for one of our clients. Please click part I, part II to read the…

Tutorial

NiFi Ingestion Blog Series. PART I - Advantages and Pitfalls of Lego Driven Development

Apache NiFi, big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data…

Radio DaTa Podcast

MLOps in the Cloud at Swedbank - Enterprise Analytics Platform

In this episode of the RadioData Podcast, Adama Kawa talks with Varun Bhatnagar from Swedbank. Mentioned topics include: Enterprise Analytics Platform…

getindator data engineer as a pirate behind the blue steering w d0d036e9 2016 48da b7bb 6f6c9e6523f0

Tutorial

Kubecost: Cross Charging Costs of Data Processing Pipelines in Data Mesh Architecture

Introduction As organizations increasingly adopt cloud-native technologies like Kubernetes, managing costs becomes a growing concern. With multiple…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

Lean Big Data - How to avoid wasting money with Big Data technologies and get some ROI

1. Deploying Big Data technologies when don’t have big data

2. Building the “data lake” without a use-case in mind and hoping that use-cases will appear

3. Spending fortune into cluster infrastructure and toolkit to solve all technical problems and issues

4. Starting with separate “platform” and “application” teams

5. Not having people with real experience in Big Data

In other words

Simple secret

Lean Big Data

Discussion

Like this post?Spread the word

Want more? Check our articles

White Paper: Data Democratization Through Data Management

How to build an e-commerce shopping assistant (chatbot) with LLMs

How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 3

NiFi Ingestion Blog Series. PART I - Advantages and Pitfalls of Lego Driven Development

MLOps in the Cloud at Swedbank - Enterprise Analytics Platform

Kubecost: Cross Charging Costs of Data Processing Pipelines in Data Mesh Architecture

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!