Refcardz: Getting Started with Apache Hadoop
We are happy to say that our Refcardz, titled Getting Started Apache Hadoop, has been already published by DZone.
This Refcard presents Apache Hadoop, a software framework that enables distributed storage and processing of large datasets using simple high-level programming models. The card covers the most important concepts of Hadoop, describes its architecture, and explains how to start using it as well as write and execute various applications on Hadoop.
In a nutshell, Hadoop is an open-source project of the Apache Software Foundation that can be installed on a cluster of servers so that these servers can communicate and work together to store and process large datasets. Hadoop has become very successful in recent years thanks to its ability to effectively crunch big data. It allows companies to store all of their data in one system and perform analysis on this data that would be otherwise impossible or very expensive to do with traditional solutions.
Many companion tools built around Hadoop offer a wide variety of processing techniques. Integration with ancillary systems and utilities is excellent, making real-world work with Hadoop easier and more productive. These tools together form the Hadoop ecosystem.
You can think of Hadoop as a Big Data Operating System that makes it possible to run different types of workloads over all your huge datasets. This ranges from offline batch processing through machine learning to real-time stream processing.
To install Hadoop, you can take the code from http://hadoop.apache.org or (what is more recommended) use one of the Hadoop distributions. The three most widely used ones come from Cloudera (CDH), Hortonworks (HDP), and MapR. Hadoop distribution is a set of tools from the Hadoop ecosystem bundled together and guaranteed by the respective vendor that work and integrate with each other well. Additionally, each vendor offers tools (open-source or proprietary) to provision, manage, and monitor the whole platform.