Lean Big Data – How to avoid wasting money with Big Data technologies and get some ROI
During my 6-year Hadoop adventure, I had an opportunity to work with Big Data technologies at several companies ranging from fast-growing startups (e.g. Spotify) to global corporations and academic institutes. What really amazed me was the difference of how the use-cases were defined, how fast valid solutions were built and how money was spent and wasted.
To be more specific, I’ve seen many companies with powerful, expensive and battle-ready data infrastructures (i.e. tens of nodes in multiple clusters, a whole Big Data stack, enterprise licences, security restrictions, teams of administrators) and … almost no real production use-cases. This means a large cost overrun, small ROI and something one should avoid.
This post introduces the notion of “Lean Big Data” to make a smart investment in Big Data technologies, and it describes five common pitfalls that can lead to a failed Big Data project. These include deploying Big Data technologies when you don’t have big data, separating application and platform roles too soon, and building a platform without a use case in mind.
1. Deploying Big Data technologies when don’t have big data
I think that many companies equate being data-driven with using Big Data technologies. Obviously, it’s not true. You can be data-driven without Big Data. Similarly, you can use Big Data technologies, but still, don’t make data-driven decisions or don’t implement product features that are fed by data.
A few years ago, a young marketplace startup discussed their data-processing needs with me. I advised them not to use Big Data technologies. Two reasons: they would iterate slower and spend more money. These two reasons would kill this startup quickly. When you have small data and start using Big Data technologies you don’t get benefits such as scalability, cheaper replacement for DWH, schema-on-read etc. You iterate slower because you spend time learning new and complex tools, troubleshooting issues and bugs, discovering unsupported features, making time-consuming mistakes, accumulating technical debt, implementing/testing/deploying/running distributed jobs on a cluster, trying to recruit Big Data specialists and keeping them stay. It’s much easier, faster and cheaper to iterate with MySQL rather than Hadoop or HBase (especially when you’re a young startup and you evolve and change requirements frequently).
Simply speaking, you should always be data-driven, but use Big Data technologies only when you have a really good reason for it. Here are some examples:
- Spotify decided to migrate to Hadoop and Cassandra because its existing solutions based on traditional databases (e.g. Postgres) were overloaded. The volume of data was increasing exponentially because the product was more and more popular. Last but not least, Spotify didn’t want to enter the world of commercial data warehouses due to their technical limitations, large expenses, vendor-dependency and own engineering culture.
- Some other company, that I know, was in a slightly different situation. They already had data warehouses, but they were overloaded. They simply didn’t want to invest more money into scaling them over and over.
- Similarly, at Orbitz (a website used to research, plan and book travel), the daily ETL took longer than 24 hours (!) and the cost of existing data-warehouse was really high. Thanks to Hadoop, they not only became able to store all data for a reasonable price (no sampling, no pruning, no aggregating), but they also started increasing income (as their recommendation engine become more accurate thanks to larger and richer volume data analyzed)
- The University of Warsaw was implementing Machine Learning library to extract knowledge from PDF and plain-text academic papers (10+ terabytes of data). Their compute-intensive algorithms were running for weeks. They needed a technology not to store large volumes of data because they had large disk arrays for this purpose, but to easily parallelize their computation and distribute it to many worker nodes, so that their ML algorithms could finish in hours rather than weeks.
The above use-cases are examples of why you might want to start using Big Data technologies (of course, there are more!).
However, some other companies that I’ve seen started deploying Big Data technologies when … they become very trendy and cool, “worth trying” or encouraged by vendors.
2. Building the “data lake” without a use-case in mind and hoping that use-cases will appear
These companies decided to build “production” Hadoop clusters, but they didn’t have valid use-cases identified. As a result, they started generating ideas and running tens (!) of internal POCs to onboard some projects on the Hadoop platform. Most of them failed and/or had generated high costs.
There is one use-case, that I saw in several places, that seems to be “always working” – just collect all textual documents, upload them to HDFS, use Solr to index them (not Elasticsearch, because Elasticsearch is not included in CDH and HDP) and build custom front-end website to search across indexes. This use-case runs on Hadoop cluster (yes!), but … it could run without Hadoop too, because Solr and Elasticsearch are scalable and distributed technologies that can run without HDFS, YARN, MapReduce, Spark etc.
When you build the data lake “not for today, but for the future”, you will probably end up using Hadoop for following use-cases:
- Ad-hoc analysis and exploration of data using SQL, R and/or Python (the same data is generated by existing ETL processes using traditional technologies, available for analysis using the same programming languages in older systems and additionally copied to Hadoop for running “experiments” – in consequence, rather than eliminating silos, we create an additional one)
- Uploading millions of small XML/JSON/CSV files generated from external tools (Hadoop acts as backup or archive with schema-on-read capabilities)
- Uploading larger volumes of previously-ignored data hoping that they’ll be useful later
3. Spending fortune into cluster infrastructure and toolkit to solve all technical problems and issues
The Hadoop Ecosystem has several nice properties that allow you to economize e.g. open-source licence, standard hardware requirements, linear scalability that allows you to start just with a bunch of physical/virtual nodes or use public cloud where you “pay-as-you-go”, high-level frameworks. Even though, many companies don’t seem to take advantage of them. What I have seen is truly incredible:
- Multiple clusters (POC, PROD, TEST, ACC etc. – similarly to DTAP) in multiple places (on-premise & cloud) … and a few “production-like” use-cases that could easily run on a single and small cluster
- Army of Hadoop administrators fixing issues in multiple time-zones … and a few analysts working with the clusters on a daily basis
- Many commercial licences (e.g. Cloudera Manager) and support from vendors (including expensive ones like Cloudera, Oracle and cheaper ones from Asia/Europe to “save” costs) … and lack of internal full-time employees who have practical experience at Big Data
- Advanced features like strong security with Kerberos, Hive authorization with Sentry, quotas, auditing, backups … and a single “default” queue for everyone (no multi-tenancy as each cluster is barely used)
- Many commercial tools already bought or on the roadmap to evaluate e.g. WanDISCO (to copy data between POC, PROD & TEST clusters), Control-M, Talend, Qlik Sense, ZoomData (to help get something useful out of data)
- No infrastructure as a code … but many internal MS Word documents that explain how to go through Cloudera Manager or Ambari Web UIs to deploy the cluster, configure Kerberos and HA, troubleshoot issues that commonly re-occur on multiple clusters
Maybe it sounds surprising, but at Spotify we run almost all batch computation on a single production and multi-tenant cluster for many years, access control matrix was rather simple, Hadoop administrators team consisted of 1-5 people, there was no licence for any Hadoop server in our thousand-node cluster storing tens of PBs of data. Of course, we run into troubles and wasted some time/money here and there, but we managed to achieve so much with a single, but fully-utilized cluster!
The more you spend on Hadoop cluster, the bigger pressure and expectations are set. For instance, many projects are “forced” to migrate to Big Data. The managers set the goals as “adding 10 new production projects to Hadoop in 2016” that can be achieved only by migrating “fake” projects. These projects often die later. What’s more, most of these projects actually don’t want to go from “POC” to “PROD”, because they will iterate slower due to various procedures such as less frequent upgrades of Hadoop, less permissions for junior employees, straighten resource management to ensure SLA, more restricted development cycle.
4. Starting with separate “platform” and “application” teams
When you start with Big Data, it’s good to build a small cross-functional team that drives the whole effort. Such team can consist of 3-6 people such as a Linux administrator, a Java/Scala developer and a SQL analyst working full or part-time. Ideally, you should have someone who has previous experience in Big Data. Thanks to a frequent and quick feedback loop, the administrators exactly know what the infrastructure really misses or don’t need at all. The developers/analysts get immediate help when running into troubles, configuration, and access issues. Eventually, this team can split into smaller teams like “infrastructure” (administrators, DevOps), “tools & ETL” (engineers) and “insights” (analysts, data scientists) and these teams could split into smaller teams even later. This was the case at Spotify many years ago and it worked well.
I noticed that some companies start with separate “platform” and “application” teams located in different floors/buildings/countries. As a result, they tend to run into miscommunication issues and usually aren’t aligned. The “platform” teams follow its own way by building a battle-ready and award-winning data infrastructure by scaling cluster, installing many components, deploying access control rules, thinking about features like backups, auditing, quotas and introducing “good” procedures and recommendation guides written in Word documents etc. The “application” teams have a hard time figuring out the valid use-cases, learning the complex infrastructure and overcoming troubles caused by all procedures, conventions, and restrictions introduced by the “platform” team. They iterate even slower due to these rules and complications.
In my opinion, the data infrastructure should be use-case-driven and improved only to address new requirements from the applications and business use-cases. At Spotify our Hadoop cluster had been always falling behind – it was so hard to keep up to scale, enrich, tune and stabilize the cluster because the new data, use-cases and requirements were added all the time, faster and faster. Eventually, Spotify has decided to go to the public cloud hoping that GCP will be able to quickly provide everything the company needs. If your platform advances the use-cases, then probably you invested too much into it (of course it doesn’t mean that your platform should be crappy, and nobody could rely on it due to problems).
5. Not having people with real experience in Big Data
Even a single person who has real production experience with Big Data can help a company go into the right direction, avoid making expensive mistakes, hire smart people and correctly talk to vendors. However, this can be the chicken-egg problem if the company doesn’t have valid use-cases for Big Data and can’t grow internal employees.
In other words
Just to be clear, I don’t criticise companies that wish to innovate, try Big Data stack and evaluate new cutting-edge technologies. I highly encourage to do that! I’m just amazed by the scale, complexity, and bureaucracy of many Big Data infrastructures.
Where a single, small, stable, well-configured and multi-tenant cluster can be enough, I often see multiple overpaid, underutilized and concerning clusters.
What is the easiest way to waste money with Big Data?
IMHO, spending it far from ROI and in the case of Big Data, it’s cluster infrastructure, cluster administration, and ETL.
How to avoid it?
Lean Big Data
I would recommend following the idea described in the “Lean Startup” book. Try to spend as much time and energy as possible in activities that generate real value or validate your hypothesis (e.g. data analysis, data science, data-driven features, important dashboards). Build or improve your data infrastructure in iterations, only if you see that the next iteration is needed for your company to achieve more.
For example, the first iteration could look as follows:
The next iterations, could look as follows:
Basically, try to spend as little time as possible in steps 2 and 6.
Do you have a similar opinion? Am I completely wrong? Please share your thoughts in the comments, so everyone could benefit from your experience and avoid expensive mistakes when build own Big Data infrastructure.