Refcardz: Getting Started with Apache Hadoop

We are happy to say that our Refcardz, titled Getting Started Apache Hadoop, has been already published by DZone.

This Refcard presents Apache Hadoop, a software framework that enables distributed storage and processing of large datasets using simple high-level programming models. The card covers the most important concepts of Hadoop, describes its architecture, and explains how to start using it as well as write and execute various applications on Hadoop.

In a nutshell, Hadoop is an open-source project of the Apache Software Foundation that can be installed on a cluster of servers so that these servers can communicate and work together to store and process large datasets. Hadoop has become very successful in recent years thanks to its ability to effectively crunch big data. It allows companies to store all of their data in one system and perform analysis on this data that would be otherwise impossible or very expensive to do with traditional solutions.

Many companion tools built around Hadoop offer a wide variety of processing techniques. Integration with ancillary systems and utilities is excellent, making real-world work with Hadoop easier and more productive. These tools together form the Hadoop ecosystem.

You can think of Hadoop as a Big Data Operating System that makes it possible to run different types of workloads over all your huge datasets. This ranges from offline batch processing through machine learning to real-time stream processing.

To install Hadoop, you can take the code from or (what is more recommended) use one of the Hadoop distributions. The three most widely used ones come from Cloudera (CDH), Hortonworks (HDP), and MapR. Hadoop distribution is a set of tools from the Hadoop ecosystem bundled together and guaranteed by the respective vendor that work and integrate with each other well. Additionally, each vendor offers tools (open-source or proprietary) to provision, manage, and monitor the whole platform.

To read about HDFS, YARN, MapReduce, Hive, Pig, Tez and others, download this Refcardz: Getting started Apache Hadoop. To learn more, attend our highly practical and engaging Hadoop trainings.

Post by Piotr Krewski

Piotr Krewski has extensive practical experience in writing applications running on Hadoop clusters as well as in maintaining, managing and expanding Hadoop clusters. At Spotify, he was part of the team operating arguably the biggest Hadoop cluster in Europe. He is a co-founder of GetInData where he currently works as architect and engineer helping companies with building scalable, distributed architectures for storing and processing big data. Piotr serves also as Hadoop Instructor delivering GetInData proprietary trainings for administrators, developers and analysts working with Big Data solutions.

Leave a Reply

Your email address will not be published. Required fields are marked *

Blue Captcha Image