AirHadoop – “AirBnb” For Hadoop Clusters

In this blog post we share motivation, current status and challenges for our new project, called AirHadoop. AirHadoop follows the sharing economy model and it aims to allow companies to use idle Hadoop clusters that belong to somebody else to temporarily gain more computing power and storage.

Shared economy

A sharing economy is an economic model in which individuals are able to borrow or rent assets owned by someone else. The sharing economy model is most likely to be used when the price of a particular asset is high and the asset is not fully utilized all the time. The poster child of sharing economy, Airbnb allows users to rent out their spare rooms or vacant homes to strangers for the price that is typically lower than at hotels. At GetInData, we observed that similar concept could be successfully applied for companies and Hadoop clusters.

Utilizing idle Hadoop clusters

While supporting various clients ranging from fast-growing startups to global corporations, we noticed that many Hadoop clusters have periods when they are completely busy or almost idle. For example, a European company, that typically runs jobs during the daytime, can rent out their compute resources to some US companies during the nighttime (which is daytime in the San Francisco) and vice versa.

Similarly, HDFS storage can be also rented out (regardless of the timezone). We saw several Hadoop clusters with HDFS being 95%-full, but on the other hand, we saw many more clusters with HDFS being only 40%-full. The companies that seek for more disk space could transfer a portion of their cold datasets to other HDFS clusters to temporarily get more room for their data.

I wish that I could use virtually endless computing power of other Hadoop clusters to speed up our ad-hoc experiments. Especially, for the price that is 10x lower than at Amazon or Google clouds.
– John Kowalski, the data-analyst that fast-growing mobile startup

We’ve heavily invested in our Big Data infrastructure, but we can’t actaully make any data-driven decision, because its impossible to hire good data-scientists these days. If somebody else could temporarily use our infrastructure, then we could increase our ROI.
– Thomas Sheeran, the director of a global corporation

Current status

Currently, we are building the MVP of AirHadoop. The MVP is conceptually simple, but demonstrates the powerful idea of the project. Each consultant at GetInData deploys a several-node Hadoop cluster on her laptop and lets others to copy data to her cluster and run some jobs remotely. During our last one-week sprint, one of our consultant was able copy “Hamlet” by Shakespeare to a peer-cluster and run WordCount in Spark on top of it. The results of this job were later moved back to her cluster. This outcome is more than promising and it excites us! đŸ™‚

During next sprints we want to attack one of the challenges described below.

Existing challenges

While building MVP and talking to companies interested in this project, we have identified several challenges and requirements:

  • Hadoop clusters participating in the AirHadoop project must be properly configured and “certified” e.g. for privacy, security, multi-tenancy, performance, reliability.
  • The price must be an order of magnitude lower than price at AWS or GCP. It can be automatically calculated based on the law of demand and supply.
  • Running real-time stream-processing computation (that runs as a “never-ending” job) is tricky because all of the resources can be granted and revoked trough the day.
  • Most popular projects from Hadoop ecosystem should be supported e.g. HDFS, YARN, Hive, Kafka, HBase, Solr.
  • Frauds must be automatically detected e.g. avoid situation that somebody is using your cluster to store or analyze offensive, embarrassing or illegal datasets.
  • Transferring data between Hadoop clusters (potentially over the globe) can be too time-consuming and expensive.
  • This is obviously not a complete list. At GetInData, however, we believe that the more challenges, the better!

    Next steps

    Once our MVP becomes rock-solid, we would like to invite a few carefully-selected companies to join AirHadoop and try it.

    One idea is to invite Spotify – the awesome company that some of us had worked at in the past. As this Swedish company has been migrating to Google Cloud, its 2000-node Hadoop cluster (probably the larges one in Europe) will eventually become needless. By participating in AirHadoop, Spotify can rent out the storage space and processing power of this cluster and get some easy cash to pay a part of the bill sent by Google.

    We will provide a next status update in a year (April 1st, 2017).

    Join us

    If your company would like to join AirHadoop, please ping us at info@getindata.com.

    Although we can guarantee that AirHadoop becomes production-ready anytime soon, we can immediately help you with any other Big Data challenges such as implementing batch and real-time applications (Spark, Storm, Flink), building rock-solid data infrastructures (Kafka, Hadoop, Hive, HBase, Elasticsearch, Druid and more), administering Hadoop clusters (security, multi-tenancy, reliability, performance) or delivering Big Data training.

    Post by Adam Kawa

    Adam became a fan of Big Data after implementing his first Hadoop job in 2010. Since then he has been working with Hadoop at Spotify (where he had proudly operated one of the largest and fastest-growing Hadoop clusters in Europe for two years), Truecaller, Authorized Cloudera Training Partner and finally now at GetInData. He works with technologies like Hadoop, Hive, Spark, Flink, Kafka, HBase and more. He has helped a number of companies ranging from fast-growing startups to global corporations. Adam regularly blogs about Big Data and he also is a frequent speaker at major Big Data conferences and meetups. He is the co-founder of Stockholm HUG and the co-organizer of Warsaw HUG.

    2 Responses to AirHadoop – “AirBnb” For Hadoop Clusters

    1. I work with Hadoop clusters in my everyday job. I have never seen a production cluster connected to the internet. So that’s kind of not good.
      The only Hadoop vendor with a product that actually supports multi-tenancy is MapR. Many MapR customers still prefer to use multiple clusters instead of using the technically superior multi-tenancy. So that’s kind of not good either.
      In all clusters of entire countries, there may be no more than a handful of clusters using Hadoop security features. So yeah. Security eh?

      The idea is good, but as a platform, Hadoop (as in the platform/ecosystem) is technically retarded legacy software. Every ecosystem app is its own silo. You may be able to build what you want out of AirHadoop using Mesos, Myriad and MapR, but the technology support for that is not yet sufficient.

      Good luck.

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Blue Captcha Image
    Refresh

    *