AirHadoop – “AirBnb” For Hadoop Clusters
In this blog post we share motivation, current status and challenges for our new project, called AirHadoop. AirHadoop follows the sharing economy model and it aims to allow companies to use idle Hadoop clusters that belong to somebody else to temporarily gain more computing power and storage.
A sharing economy is an economic model in which individuals are able to borrow or rent assets owned by someone else. The sharing economy model is most likely to be used when the price of a particular asset is high and the asset is not fully utilized all the time. The poster child of sharing economy, Airbnb allows users to rent out their spare rooms or vacant homes to strangers for the price that is typically lower than at hotels. At GetInData, we observed that similar concept could be successfully applied for companies and Hadoop clusters.
Utilizing idle Hadoop clusters
While supporting various clients ranging from fast-growing startups to global corporations, we noticed that many Hadoop clusters have periods when they are completely busy or almost idle. For example, a European company, that typically runs jobs during the daytime, can rent out their compute resources to some US companies during the nighttime (which is daytime in the San Francisco) and vice versa.
Similarly, HDFS storage can be also rented out (regardless of the timezone). We saw several Hadoop clusters with HDFS being 95%-full, but on the other hand, we saw many more clusters with HDFS being only 40%-full. The companies that seek for more disk space could transfer a portion of their cold datasets to other HDFS clusters to temporarily get more room for their data.
I wish that I could use virtually endless computing power of other Hadoop clusters to speed up our ad-hoc experiments. Especially, for the price that is 10x lower than at Amazon or Google clouds.
– John Kowalski, the data-analyst that fast-growing mobile startup
We’ve heavily invested in our Big Data infrastructure, but we can’t actaully make any data-driven decision, because its impossible to hire good data-scientists these days. If somebody else could temporarily use our infrastructure, then we could increase our ROI.
– Thomas Sheeran, the director of a global corporation
Currently, we are building the MVP of AirHadoop. The MVP is conceptually simple, but demonstrates the powerful idea of the project. Each consultant at GetInData deploys a several-node Hadoop cluster on her laptop and lets others to copy data to her cluster and run some jobs remotely. During our last one-week sprint, one of our consultant was able copy “Hamlet” by Shakespeare to a peer-cluster and run WordCount in Spark on top of it. The results of this job were later moved back to her cluster. This outcome is more than promising and it excites us! 🙂
During next sprints we want to attack one of the challenges described below.
While building MVP and talking to companies interested in this project, we have identified several challenges and requirements:
This is obviously not a complete list. At GetInData, however, we believe that the more challenges, the better!
Once our MVP becomes rock-solid, we would like to invite a few carefully-selected companies to join AirHadoop and try it.
One idea is to invite Spotify – the awesome company that some of us had worked at in the past. As this Swedish company has been migrating to Google Cloud, its 2000-node Hadoop cluster (probably the larges one in Europe) will eventually become needless. By participating in AirHadoop, Spotify can rent out the storage space and processing power of this cluster and get some easy cash to pay a part of the bill sent by Google.
We will provide a next status update in a year (April 1st, 2017).
If your company would like to join AirHadoop, please ping us at firstname.lastname@example.org.
Although we can guarantee that AirHadoop becomes production-ready anytime soon, we can immediately help you with any other Big Data challenges such as implementing batch and real-time applications (Spark, Storm, Flink), building rock-solid data infrastructures (Kafka, Hadoop, Hive, HBase, Elasticsearch, Druid and more), administering Hadoop clusters (security, multi-tenancy, reliability, performance) or delivering Big Data training.