During my 6-year Hadoop adventure, I had an opportunity to work with Big Data technologies at several companies ranging from fast-growing startups (e.g. Spotify) to global corporations and academic institutes. What really amazed me was the difference of how the use-cases were defined, how fast valid solutions were built and how money was spent and wasted.
To be more specific, I’ve seen many companies with powerful, expensive and battle-ready data infrastructures (i.e. tens of nodes in multiple clusters, a whole Big Data stack, enterprise licences, security restrictions, teams of administrators) and … almost no real production use-cases. This means large cost overrun, small ROI and something one should avoid.
This post introduces the notion of “Lean Big Data” to make a smart investment in Big Data technologies, and it describes five common pitfalls that can lead to a failed Big Data project. These include deploying Big Data technologies when you don’t have big data, separating application and platform roles too soon, and building platform without a use case in mind.
Deploying Big Data technologies when don’t have big data
I think that many companies equate being data-driven with using Big Data technologies. Obviously it’s not true. You can be data-driven without Big Data. Similarly, you can use Big Data technologies, but still don’t make data-driven decisions or don’t implement product features that are fed by data.
A few years ago, a young marketplace startup discussed their data-processing needs with me. I advised them not to use Big Data technologies. Two reasons: they would iterate slower and spend more money. These two reasons would kill this startup quickly. When you have small data and start using Big Data technologies you don’t get benefits such as scalability, cheaper replacement for DWH, schema-on-read etc. You iterate slower because you spend time learning new and complex tools, troubleshooting issues and bugs, discovering unsupported features, making time-consuming mistakes, accumulating technical debt, implementing/testing/deploying/running distributed jobs on a cluster, trying to recruit Big Data specialists and keeping them stay. It’s much easier, faster and cheaper to iterate with MySQL rather than Hadoop or HBase (especially when you’re a young startup and you evolve and change requirements frequently).
Simply speaking, you should always be data-driven, but use Big Data technologies only when you have a really good reason for it. Here are some examples:
The above use-cases are examples why you might want to start using Big Data technologies (of course, there are more!).
However, some other companies that I’ve seen, started deploying Big Data technologies when … they become very trendy and cool, “worth trying” or encouraged by vendors .
Building the “data lake” without a use-case in mind and hoping that use-cases will appear
These companies decided to build “production” Hadoop clusters, but they didn’t have valid use-cases identified. In a result, they started generating ideas and running tens (!) of internal POCs to onboard some projects on the Hadoop platform. Most of them failed and/or had generated high costs.
There is one use-case, that I saw in several places, that seems to be “always working” – just collect all textual documents, upload them to HDFS, use Solr to index them (not Elasticsearch, because Elasticsearch is not included in CDH and HDP) and build custom front-end website to search across indexes. This use-case runs on Hadoop cluster (yes!), but … it could run without Hadoop too, because Solr and Elasticsearch are scalable and distributed technologies that can run without HDFS, YARN, MapReduce, Spark etc.
When you build the data lake “not for today, but for the future”, you will probably end up using Hadoop for following use-cases:
Spending fortune into cluster infrastructure and toolkit to solve all technical problems and issues
The Hadoop Ecosystem has several nice properties that allows you to economize e.g. open-source licence, standard hardware requirements, linear scalability that allows you to start just with a bunch of physical/virtual nodes or use public cloud where you “pay-as-you-go”, high-level frameworks. Even though, many companies don’t seem to take advantage of them. What I have seen is truly incredible:
Maybe it sounds surprising, but at Spotify we run almost all batch computation on a single production and multi-tenant cluster for many years, access control matrix was rather simple, Hadoop administrators team consisted of 1-5 people, there was no licence for any Hadoop server in our thousand-node cluster storing tens of PBs of data. Of course, we run into troubles and wasted some time/money here and there, but we managed to achieve so much with a single, but fully-utilized cluster!
The more you spend on Hadoop cluster, the bigger pressure and expectations are set. For instance, many projects are “forced” to migrate to Big Data. The managers set the goals like “adding 10 new production projects to Hadoop in 2016” that can be achieved only by migrating “fake” projects. These projects often die later. What’s more, most of these projects actually don’t want to go from “POC” to “PROD”, because they will iterate slower due to various procedures such as less frequent upgrades of Hadoop, less permissions for junior employees, straighten resource management to ensure SLA, more restricted development cycle.
Starting with separate “platform” and “application” teams
When you start with Big Data, it’s good to build a small cross-functional team that drives the whole effort. Such team can consists of 3-6 people such as a Linux administrator, a Java/Scala developer and a SQL analyst working full or part time. Ideally, you should have someone who has previous experience in Big Data. Thanks to a frequent and quick feedback loop, the administrators exactly know what the infrastructure really misses or don’t need at all. The developers/analysts get immediate help when running into troubles, configuration and access issues. Eventually, this team can split into smaller teams like “infrastructure” (administrators, DevOps), “tools & ETL” (engineers) and “insights” (analysts, data scientists) and these teams could split into smaller teams even later. This was the case at Spotify many years ago and it worked well.
I noticed that some companies start with separate “platform” and “application” teams located in different floors/buildings/countries. In a result they tend to run into miscommunication issues and usually aren’t aligned. The “platform” teams follows its own way by building a battle-ready and award-winning data infrastructure by scaling cluster, installing many components, deploying access control rules, thinking about features like backups, auditing, quotas and introducing “good” procedures and recommendation guides written in Word documents etc. The “application” teams have hard time figuring out the valid use-cases, learning the complex infrastructure and overcoming troubles caused by all procedures, conventions and restrictions introduced by the “platform” team. They iterate even slower due to these rules and complications.
In my opinion, the data infrastructure should be use-case-driven and improved only to address new requirements from the applications and business use-cases. At Spotify our Hadoop cluster had been always falling behind – it was so hard to keep up to scale, enrich, tune and stabilize the cluster because the new data, use-cases and requirements were added all the time, faster and faster. Eventually Spotify has decided to go to the public cloud hoping that GCP will be able to quickly provide everything the company needs. If your platform advances the use-cases, then probably you invested too much into it (of course it doesn’t mean that your platform should be crappy, and nobody could rely on it due to problems).
Not having people with real experience in Big Data
Even a single person who has real production experience with Big Data can help company go into right direction, avoid making expensive mistakes, hire smart people and correctly talk to vendors. However, this can be the chicken-egg problem if the company doesn’t have valid use-cases for Big Data and can’t grow internal employees.
In other words
Just to be clear, I don’t criticise companies that wish to innovate, try Big Data stack and evaluate new cutting-edge technologies. I highly encourage to do that! I’m just amazed by the scale, complexity and bureaucracy of many Big Data infrastructures.
Where a single, small, stable, well-configured and multi-tenant cluster can be enough, I often see multiple overpaid, underutilized and concerining clusters.
– Adam Kawa, founder and Big Data architect at GetInData
What is the easiest way to waste money with Big Data?
IMHO, spending it far from ROI and in case of Big Data, it’s cluster infrastructure, cluster administration and ETL.
How to avoid it?
Lean Big Data
I would recommend to follow the idea described in the “Lean Startup” book. Try to spend as much time and energy as possible in activities that generate real value or validate your hypothesis (e.g. data analysis, data science, data-driven features, important dashboards). Build or improve your data infrastructure in iterations, only if you see that next iteration is needed for your company to achieve more.
For example, the first iteration could look as follows:
The next iterations, could look as follows:
Basically, try to spend as little time as possible in steps 2 and 6.
Do you have a similar opinion? Am I completely wrong? Please share your thoughts in the comments, so everyone could benefit from your experience and avoid expensive mistakes when build own Big Data infrastructure.
Latest posts by Adam Kawa (see all)
- Hot Topics at Big Data Tech Warsaw 2017 - December 17, 2016
- Schema Evolution With Avro and Hive - November 8, 2016
- Lean Big Data – How to avoid wasting money with Big Data technologies and get some ROI - October 13, 2016